File Formats
GEMLeR datsets are available in two file formats suitable for various machine learning tools. This section shortly describes both file formats that are supported by GEMLeR.
ARFF (Weka, Orange, YALE, Tanagra, ...)
ARFF is a default file format of WEKA machine learning tool. It was later adopted as
a possible input format to many other machine learning tools like Orange, YALE, Tanagra, etc.
Arff files start with a reserved word @relation that is followed by name of the dataset. Following lines define
attributes by specifying their name and type. In case of numeric attributes, attribute name is followed
by a reserved word numeric, while name of nominal attribute has to be followed by definition of all possible attribute values (see Tissue attribute in example below).
Attributes definition is followed by a reserved word @data that marks the begining of data section. Each sample is represented by a single line where attributes are separated by commas.
Example of ARFF file:
|
@relation AP_Breast_Colon.arff @attribute ID_REF numeric @attribute 1007_s_at numeric @attribute 121_at numeric @attribute 1405_i_at numeric @attribute 1438_at numeric @attribute 1487_at numeric @attribute Tissue {Breast,Colon} @data 137984,2503.7,669,509.8,559.1,664.5,Breast 117717,3596.2,7013.3,2922,148.3,286.7,Colon 46859,4335.3,484.1,629.2,278,675.9,Breast 89055,3582.5,864.5,1011,303,489,Breast ... |
CSV
CSV is a classical comma separated file format. The first line contains all attribute names (i.e. gene ids) separated by commas.
All following lines contain patient id that is followed by values of gene expressions and finishes with one of the class values.
Example of CSV file:
|
ID_REF,1007_s_at,121_at,1405_i_at,1438_at,1494_f_at,Tissue 53104,4029.8,3023.7,90,35.5,530.3,Colon 89101,1088.6,1070.9,495.4,678.8,281.9,Breast 152581,3801.7,2607.9,344.3,360.9,302.4,Breast ... |