GEMLeR provides a collection of gene expression datasets that can be used for benchmarking
gene expression oriented machine learning algorithms.
They can be used for estimation of different quality metrics (e.g. accuracy, precision, area under ROC curve, etc.)
for classification, feature selection or clustering algorithms.
This repository was inspired by an increasing need in machine learning / bioinformatics communities for a collection
of microarray classification problems that could be used by different researches. This
way many different classification or feature selection techniques can finally be compared to eachother on the same
set of problems.
Origin of data
Each gene expression sample in GEMLeR repository comes from a large publicly available expO (Expression Project For Oncology) repository by International Genomics Consortium.
Although there are various other sources of gene expression data available, a decision to use data from expO repository was made because of:The goal of expO and its consortium supporters is to procure tissue samples under standard conditions and perform gene expression analyses on a clinically annotated set of deidentified tumor samples. The tumor data is updated with clinical outcomes and is released into the public domain without intellectual property restriction. The availability of this information translates into direct benefits for patients, researchers and pharma alike.
Source: expO website