Benchmark Results
This section presents the results of a small benchmarking experiment that provides initial classification accuracy values for GEMLeR datasets.
All measurements were done using WEKA machine learning environment. Two popular machine learning methods from gene expression analysis were used - i.e.
Support Vector Machines (SVM) classifier and Support Vector Machines - Recursive Feature Elimination (SVM-RFE) feature selection algorithm.
Evaluation (feature selection + classification) was done inside 10-fold cross-validation loop on all 45 GEMLeR datasets to avoid so called
selection bias (Ambroise and McLachlan, 2002):

To demonstrate the importance of using different accuracy metric, two different classification accuracy metrics were used - i.e. General Accuracy (ACC) and area under the Receiver Operating Characteristic (ROC) curve, or simply AUC.
Results of all ACC and AUC comparisons are available in a supplemental document as .pdf or .doc.
The following chart presents average accuracy results over all datasets (Avg), 35 All-Paired datasets (AP) and 9 One-Versus-All (OVA) datasets:

NOTE: Average accuracy is used only for ilustrational purposes. When comparing two or more classification methods, you are strongly advised to use so called Demsar's comparisons over multiple datasets methodology (Demsar, 2006) where Wilcoxon Signed Ranks Test is used to compare a pair of classifiers or alternatively Friedman's Test is used to compare multiple classifiers.
In the next chart AUC metric can be observed for different number of selected features using SVM-RFE and later classified by SVM:

By comparing results using ACC and AUC metrics one can observe the difference that is a consequence of highly unbalanced OVA datasets in terms of samples from class 1 (observed tumor samples) and class 2 (other samples). Especially in OVA datasets, we can clearly see that using AUC metric a stabilization of AUC can be defined (increasing the number of selected genes does not increase AUC anymore from 64 genes onward).
Multiple runs of randomized and stratified cross-validation loops are recommended for higher reliability of obtained results.
Ambroise, C., McLachlan G.J., (2002) Selection bias in gene extraction on the basis of microarray gene-expression data, Proc Natl Acad Sci USA 2002, 99:6562-6566.
Demsar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.