GEMLeR - Gene Expression Machine Learning Repository

File Formats

GEMLeR datsets are available in two file formats suitable for various machine learning tools. This section shortly describes both file formats that are supported by GEMLeR.

ARFF (Weka, Orange, YALE, Tanagra, ...)

ARFF is a default file format of WEKA machine learning tool. It was later adopted as a possible input format to many other machine learning tools like Orange, YALE, Tanagra, etc.

Arff files start with a reserved word @relation that is followed by name of the dataset. Following lines define attributes by specifying their name and type. In case of numeric attributes, attribute name is followed by a reserved word numeric, while name of nominal attribute has to be followed by definition of all possible attribute values (see Tissue attribute in example below).

Attributes definition is followed by a reserved word @data that marks the begining of data section. Each sample is represented by a single line where attributes are separated by commas.

Example of ARFF file:

@relation AP_Breast_Colon.arff

@attribute ID_REF numeric
@attribute 1007_s_at numeric
@attribute 121_at numeric
@attribute 1405_i_at numeric
@attribute 1438_at numeric
@attribute 1487_at numeric
@attribute Tissue {Breast,Colon}

@data
137984,2503.7,669,509.8,559.1,664.5,Breast
117717,3596.2,7013.3,2922,148.3,286.7,Colon
46859,4335.3,484.1,629.2,278,675.9,Breast
89055,3582.5,864.5,1011,303,489,Breast
...


CSV

CSV is a classical comma separated file format. The first line contains all attribute names (i.e. gene ids) separated by commas. All following lines contain patient id that is followed by values of gene expressions and finishes with one of the class values.

Example of CSV file:

ID_REF,1007_s_at,121_at,1405_i_at,1438_at,1494_f_at,Tissue
53104,4029.8,3023.7,90,35.5,530.3,Colon
89101,1088.6,1070.9,495.4,678.8,281.9,Breast
152581,3801.7,2607.9,344.3,360.9,302.4,Breast
...