GEMLeR - Gene Expression Machine Learning Repository

GEMLeR provides a collection of gene expression datasets that can be used for benchmarking gene expression oriented machine learning algorithms. They can be used for estimation of different quality metrics (e.g. accuracy, precision, area under ROC curve, etc.) for classification, feature selection or clustering algorithms.

This repository was inspired by an increasing need in machine learning / bioinformatics communities for a collection of microarray classification problems that could be used by different researches. This way many different classification or feature selection techniques can finally be compared to eachother on the same set of problems.

Origin of data

Each gene expression sample in GEMLeR repository comes from a large publicly available expO (Expression Project For Oncology) repository by International Genomics Consortium.

The goal of expO and its consortium supporters is to procure tissue samples under standard conditions and perform gene expression analyses on a clinically annotated set of deidentified tumor samples. The tumor data is updated with clinical outcomes and is released into the public domain without intellectual property restriction. The availability of this information translates into direct benefits for patients, researchers and pharma alike.

Source: expO website
Although there are various other sources of gene expression data available, a decision to use data from expO repository was made because of: