Classifica IO User Manual

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 18

Download
Open PDF In Browser	View PDF

ClassificaIO: machine learning for classification graphical user
interface
Raeuf Roushangar1,2, George I. Mias1,2,*
1Department
2Institute

of Biochemistry and Molecular Biology,

for Quantitative Health Science and Engineering,

Michigan State University, East Lansing MI 48824, USA

Abstract
Background: Machine learning methods and algorithms have been used routinely by scientists
in many research areas. However, significant statistical and programing expertise are typically
required to use these methods and algorithms. Thus, an easy-to-use graphical user interface
software that enables biomedical researchers wanting to apply machine learning methods in their
research in biology is essential, and can facilitate further use of such methodology.
Results: Here we present ClassificaIO, an open-source Python graphical user interface for
machine learning classification for the scikit-learn Python module. ClassificaIO aims to provide an
easy-to-use interactive way to train, validate, and test data on a range of state-of-the-art
classification algorithms. The software enables fast comparisons within and across classifiers,
and facilitates uploading and exporting of trained models, and both validated, and tested data
results. ClassificaIO is implemented as a Python package and is available for download and
installation through the Python Package Index (PyPI) (http://pypi.python.org/pypi/ClassificaIO)
and it can be deployed using the “import” function once installed. The software is distributed under
an MIT license and source code is available for download (for Mac OS X, Linux and Microsoft

Windows)

through

PyPI

and

GitHub

(http://github.com/gmiaslab/ClassificaIO),

and

https://doi.org/10.5281/zenodo.1133266.
Conclusions: ClassificaIO facilitates the use of machine learning algorithms through a graphical
user interface (GUI), and can help biomedical and other researchers with broad machine learning
background to use machine learning, and apply it to their research in a simple and interactive
point-and-click way.

SUPPLEMENTARY CONTENTS
TABLE S1: Classification Algorithms ……………………………………………………. 3
SUPPLEMENTARY INFORMATION: ClassificaIO Software Manual…………………. 4
ATTACHMENTS: SUPPLEMENTARY FILES
(https://github.com/gmiaslab/manuals/tree/master/ClassificaIO/Supplementary%2
0Files)
1.

File Name
S1_Iris_Dependent_DataSet.csv

Description
Iris data set (150 data points)

S2_Iris_Target.csv

Iris Target data set (150 labels)

S3_Iris_Testing_DataSet.csv

Iris Testing data set (150 data points)

S4_Iris_FeatureNames.csv

S5_LogisticRegression_IrisTrainedModel.pkl

S6_TestingResult.csv

S7_TrainValidationResult.csv

Example Iris features (2 features: sepal length and
petal width)
Example ClassificaIO trained model using logistic
regression
Example ClassificaIO testing result using logistic
regression
Example ClassificaIO validation result using
logistic regression

Table S1. A list of all 25 classification algorithms, their corresponding scikit-learn functions, and
immutable (unchangeable) parameters with their default values are presented in additional file.

ClassificaIO
Machine Learning for Classification Graphical
User Interface User Manual 1.0.5 (01/2018)

Summary:
ClassificaIO is an open-source Python graphical user interface (GUI) for supervised
machine learning classification for the scikit-learn module (Pedregosa, et al., 2011).
ClassificaIO aims to provide an easy-to-use interactive way to train, validate, and test
data on a range of classification algorithms. The GUI enables fast comparisons within
and across classifiers, and facilitates uploading and exporting of trained models, and both
validated, and tested data results.

Prerequisites:
ClassificaIO is a Python library with the following external dependencies: nltk ≥ 3.2.5,
Tcl/Tk ≥ 8.6.7, Pillow ≥ 4.3, pandas ≥ 0.21, numpy ≥ 1.13, scikit-learn ≥ 0.19.1.
ClassificaIO requires Python version 3.5 or higher and we recommend using the Spyder
integrated
development
environment
(IDE)
in
Anaconda
Navigator
(https://www.anaconda.com/download/) on Mac OS High Sierra (10.13) and Microsoft
Windows 10 or higher.

Download and installation:
ClassificaIO can be installed using pip (https://pypi.python.org/pypi/pip) in the terminal:
$ pip install ClassificaIO

You can also install it directly from the main GitHub repository using:
$ pip install git+https://github.com/gmiaslab/ClassificaIO/

$ pip install git+https://github.com/georgemias/ClassificaIO.git
In case you do not have pip installed, you must install it first. Or you can obtain and install
ClassificaIO by downloading or cloning ClassificaIO source code from ClassificaIO
GitHub repository (https://github.com/gmiaslab/ClassificaIO)

Getting started:
Please Note:
•
•
•

ClassificaIO supports comma-separated values (CSV) input files only.
In this document we use the machine learning Iris dataset (Anderson, 1935; Fisher,
1936) (150 data points) as an example, to demonstrate model training, validation,
and testing, as well as the data formats that ClassificaIO relies on.
In the classification example below, we use 70% of the Iris dataset (105 data
points) for model training and 30% (45 data points) for model testing.

After installing ClassificaIO, please run it using the “import” function in Python:
>>> from ClassificaIO import ClassificaIO
>>> ClassificaIO.gui()

Once ClassificaIO’s main window appears on your screen, you can click on ‘Use My Own
Training Data’ button and start your new supervised machine learning classification
project.

Training data input:
You first need to make a selection (either ‘Dependent and Target’ or ‘Dependent, Target
and Features’) from the ‘UPLOAD TRAINING DATA FILES’ panel to upload training data
files. For this example, we select the ‘Dependent and Target’ radio butoon.

To begin uploading files, click the
corresponding buttons in the
‘UPLOAD TRAINING DATA FILES’
panel: a file selector directs you to
upload both, dependent data file
(Supplementary Figure 1.a) and
target data file (Supplementary
Figure 1.b). Once a file is uploaded
to ClassificaIO, the file name and
directory are automatically saved in
the ‘CURRENT DATA UPLOAD’
panel (Supplementary Figure 2).
This updatable log allows for
tracking current data files in use,
and maintains a history of all files
uploaded to the software.

Supplementary Figure 1. Graphical Control Element Dialog Box.
a. Dependent data file selected for upload. b. selected target data
file to upload. N.B. each file selection has to be done one at a time.

Supplementary Figure 2. Current Data
Upload Panel. Both dependent and target
data file names shown (red boxes). Scroll
down for uploaded data files directories.

Data format:
Data formats are shown in Supplementary Figure 3.a for dependent data and
Supplementary Figure 3.b for target data.

Attributes

Objects

Supplementary Figure 3.a Dependent Data. Example of partial dependent data file format. Testing data (not
shown) uses the same format.

Attributes

Objects

Supplementary Figure 3.b. Target Data. Example of partial
target data file format.

Classifier selection:
Once you have uploaded all required training data files, you can select between 25
different machine learning classification algorithms in the ‘CLASSIFER SELECTION’
panel.

Here are all classification algorithms in order of appearance in the ‘CLASSIFER
SELECTION’ panel. Also, immutable (unchangeable) parameters with their default values
are also listed for each classifier:
I.
Linear_model
1: LogisticRegression. (class_weight = None)
2: PassiveAggressiveClassifier. (class_weight = None, n_iter= None)
3: Perceptron. (class_weight = None)
4: RidgeClassifier. (class_weight = None)
5: Stochastic Gradient Descent (SGDClassifier).
II.

Discriminant_analysis
6: LinearDiscriminantAnalysis. (shrinkage= None, priors = None)
7: QuadraticDiscriminantAnalysis. (store_covariances = None, priors = None)

III.

Support vector machines (SVMs)
8: LinearSVC. (class_weight = None)
9: NuSVC. (class_weight = None)
10: SVC. (class_weight = None)
9

IV.

Neighbors
11: KNeighborsClassifier. (metric_params = None)
12: NearestCentroid.
13: RadiusNeighborsClassifier. (metric_params = None)

Gaussian_process
14: GaussianProcessClassifier. (kernel = None)

VI.

Naive_bayes
15: BernoulliNB. (class_prior = None)
16: GaussianNB. (class_prior = None)
17: MultinomialNB. (class_prior = None)

VII.

Trees
18: DecisionTreeClassifier. (class_weight = None)
19: ExtraTreeClassifier . (min_impurity_split = None, class_weight = None)

VIII.

Ensemble
20: AdaBoostClassifier. (base_estimator = None)
21: BaggingClassifier. (base_estimator = None)
22: ExtraTreesClassifier. (class_weight = None)
23: RandomForestClassifier. (class_weight = None)

IX.

Semi_supervised
24: LabelPropagation.

Neural_network
25: MLPClassifier.

The following will populate once you make a classifier selection:
• Supplementary Figure 4.a: The classifier definition with a clickable hyperlink
“learn more” in blue, which, once clicked, opens an external web-browser to the
scikit-learn documentation for the selected classifier.
• Supplementary Figure 4.b: Easy interactive way to select between train-validate
split and cross-validation methods (radio buttons), which are necessary to
prevent/minimize training model overfitting.
• Supplementary Figure 4.c: classifier parameters, to provide you with a point-andclick interface to set, modify, and test the influence of each parameter on your data

Supplementary Figure 4. Selected Logistic Regression Classifier. a. Classifier definition is displayed, together
with, b, the. training methods with ‘Train Sample Size(%)’ method selected, and c, the classifier parameters set to
their default values.

Model training, evaluation, validation and result output:
You can now click ‘submit’ to train your classifier using the uploaded training, dependent,
and target data in this example, and evaluate your result. Or, alternatively you can upload
testing data first ,and then click ‘submit’ to train and test a classifier on your data at the
same time! For this example, first: we train a selected classifier, ‘LogisticRegression’,
using its default parameters, and default train-validate split method ‘Train Sample
Size(%)’, and then, second: we upload testing data to test the trained model.
After clicking ‘submit’, our selected classifier, ‘LogisticRegression’ for this example, is
trained using the loaded training data, ‘Dependent and Target’ for this example.
Notes
ClassificaIO always shuffles your training data before splitting to eliminate mini batch
effects.
Internally, when ‘Train Sample Size(%)’ method is selected, ClassificaIO uses the scikitlearn train_test_split function, to allow for fast training data split into training and validation
subsets. With this method the parameter set is train_size, which takes the train sample
size set by you (e.g. Train Sample Size (%): set to 75% means train_size = 0.75 and
test_size= 0.25).
If the ‘K-fold Cross-Validation’ method is selected instead, ClassificaIO uses the scikitlearn cross_val_predict function where the training data is split into k-sets. The model is
trained on k-1 of the folds followed by a validation step on the remaining part of the data.
This will be repeated for each of the k-folds.
After training is completed, the confusion matrix, classifier accuracy and error are
displayed in the ‘CONFUSION MATRIX, MODEL ACCURACY & ERROR’ panel
(Supplementary Figure 5.a). Model validation data results are displayed in the
‘TRAINING RESULT: ID – ACTUAL – PREDICTION’ panel (Supplementary Figure 5.b)
with each data point ID is the first value, actual target value is displayed 2nd, and predicted
target value third.

Supplementary Figure 5. Trained Logistic Regression Classifier. a. Trained model using 78 data points (75% of
105 data points), classifier evaluation (confusion matrix, model accuracy and error). b. Model validated using 27 data
points (25% of 105 data points).

Testing data input and result output:
To test your trained model, first upload the testing data file by clicking the ‘Testing Data’
button in the ‘UPLOAD TESTING DATA FILE’ panel (Supplementary Figure 6.a). Once
clicked, a file selector directs you to upload the testing data file, see Supplementary
Figure 1. Once testing data is uploaded, the file name is automatically saved in the
‘CURRENT DATA UPLOAD’ panel to indicate that your file has been uploaded. The
Testing Data file format is the same as for the dependent data file, see Supplementary
Figure 3.a.
After clicking ‘Submit’, testing results are displayed in the ‘TESTING RESULT: ID –
PREDICTION’ panel (Supplementary Figure 6.b) with each data point ID shown first,
and the corresponding predicted target value displayed after it, separated by a hyphen.

Supplementary Figure 6. Tested Logistic Regression Classifier. a. Upload testing data panel. b. Model tested
using 45 data points.

Result export:
Now you are ready to export your trained model to preserve it for future use without having
to retrain. Simply, click the ‘Export Model’ button (Supplementary Figure 5.a) and save
your model. Your exported ClassificaIO model can then be used for future testing on new
data in the ‘Already Trained My Model’ window in ClassificaIO, shown below.

ClassificaIO model input:
You will need to upload ClassificaIO model by clicking the ‘Model File’ button in the
‘UPLOAD TRAINING MODEL FILE’ panel (Supplementary Figure 7.a). Once clicked, a
file selector directs you to upload a ClassificaIO trained model. Also, you will need to
upload a testing data file (the testing data file format is the same as explained above), by
clicking the ‘Testing Data’ button in the “UPLOAD TESTING DATA FILE” panel
(Supplementary Figure 7.b). Once a ClassificaIO model and testing data files are
uploaded, files names are automatically displayed in the ‘CURRENT DATA UPLOAD’
panel (Supplementary Figure 7.c).
After clicking ‘submit’, the uploaded model preset parameters will populate
(Supplementary Figure 7.d) to show the classifier used to originally train the uploaded
model. The confusion matrix, classifier accuracy and error of trained model are then
displayed in the ‘CONFUSION MATRIX, MODEL ACCURACY & ERROR’ panel
(Supplementary Figure 7.e). Testing data results are displayed in the ‘Testing RESULT:
ID – PREDICTION’ panel (Supplementary Figure 7.f) with the data point ID shown first,
followed by a hyphen and the predicted value displayed right after it.

Supplementary Figure 7. ‘Already Trained My Model’ window a. Upload ClassificaIO trained model panel. b. Upload
testing data panel. c. Current data upload panel with both model and testing data files names shown (red boxes). d.
Model preset parameters. e. Trained model result and model evaluation (confusion matrix, model accuracy and error).
f. Model testing result.

Results Export:
Full results (trained models, and both validated, and tested data) for both windows (‘Use
My Own Training Data’ and ‘Already Trained My Model’) can be exported as CSV files
for later use, e.g. further analysis, publication, sharing, etc. (for more details on the export
data file formats, see the Supplementary data files).

References
Anderson, E. The Irises of the Gaspe peninsula. Bulletin of American Iris Society
1935;59:2-5.
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann Eugenic
1936;7:179-188.
Pedregosa, F., et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res
2011;12:2825-2830.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 18
PDF Version                     : 1.4
Title                           : ClassificaIO_UserManual
Author                          : George Mias
Subject                         : 
Producer                        : Mac OS X 10.13.2 Quartz PDFContext
Creator                         : Word
Create Date                     : 2018:01:15 20:57:38Z
Modify Date                     : 2018:01:15 20:57:38Z

EXIF Metadata provided by EXIF.tools

Classifica IO User Manual

Navigation menu

Versions of this User Manual:

Views

Navigation