Structure of repository
The original expO repository consists of gene expression data from over 2000 tumor samples. Samples can be accessed by tissue type or in batches. Aim of GEMLeR repository is to use data from expO repository only where 50 or more samples of the same tissue type are available. Therefore nine tumor tissue types with more than 50 samples in expO repository were selected (altogether 1545 samples):
Each dataset can be obtained in "short" or "full" size. Datasets in "short" format contain 20% of genes with the highest variance across all samples. Unsupervised highest variance filter was chosen to avoid so called "selection bias". This way genes with practically constant signal were eliminated to allow faster computation times and lower memory requirements.GEMLeR datasets are divided in two sections - "one-versus-all" (OVA) and "all-paired" (AP) benchmarking datasets.
OVA benchmarking datasets collection consists of two-class classification problems where one of the cancer type groups is compared to samples from all other types that are marked "Other" instead of their cancer type. There are nine OVA datasets - one for each tissue type. Each of them contains 1545 samples and 10935 (short version) or 54681 (full version) gene expression measurements.
AP benchmarking datasets compare two specific cancer types, which means that there are less samples in such datasets, but compared to publicly available datasets number of available samples is still high. There are 36 datasets comparing all combinations of 9 tissue samples. Number of genes is the same as in OVA datasets.