Documentation

by m May 1, 2018

1. Parameters

Each classifier has its own advantages or disadvantages and it can helpful to change classifiers for different datasets.
In machine learning a "cross-validation" is used to choose an algorithm for a specific dataset. Machine44 uses Random Forest classifier because it delivers good results for different tasks. Machine44 supports chosen parameters for the Random Forests.

1.1 Maximum Depth
The value sets the maximum depth of a tree. If set to None, then the full tree will be expanded. If the maximum depth is set to lower levels, overfitting can be avoided.

1.2 Estimators
Number of estimators (trees). A big number will usually improve the result but also slows down the train/test performance.

1.3 Jobs
The number of jobs to run in parallel, if -1 it will be the number of cores of your computer's processor.

1.4 Training/Test random shuffle
If activated, the process will randomly allocate training and test data sets. The slider parameter of the devision into training and test data will be taken into account. So if you wish 70% of the data to be training data and 30% to represent test data the process will walk through the data and randomly assign each single row in the dataset to training or test data.

1.5 Training/Test no shuffle
If activated, the process will serially allocate training and test data sets. The slider parameter will be taken into account. So if you wish 70% of the data to be training data and 30% to represent test data the process will assign the first 70% of the data rows to training and the remaining 30% to test data.

1.6 Slider Training %
The process will allocate the shown percentage of the data file rows to training data and remaining part to test data.

1.7 TRAIN AND TEST
The model will be trained and evaluated. It uses the chosen datafile and the selected classifier and respects the settings of parameters. The process will perform a training task first and will afterwards perform the test run. Result of the test will be shown inside the result box on the right.

1.8 PREDICT
While test and trainig data files always include the observed classes in the last column, for the prediction data files the same feature column headers are needed but but no classes column is available. The classes will be predicted using the given features.

2. Data

2.1 Training/Test file
The data input file is a text (csv) file has needs to be prepared so that it fits to the required format:

max. 200 columns
last column represents the observed class (dependend variable)
all other columns contain the features (explanatory variables)
max. 1,000,000 rows
features need to be numerical
the feature separator needs to be correct (default = comma)
(Please regard that no thousands separators are allowed!)
the decimal separator needs to be correct (default = point)

After the data file is chosen, it will be analysed. If the data does not fit the format a message box will appear.
The number of features, rows, and classes will be shown.

2.2 Separator
Set the feature separator (default = comma ",")

2.3 Decimal
Set the decimal (default = point ".")

2.4 Prediction file (optional)
Select a prediction file. Please make sure that all feature columns of the Test/Training data file are included. In a prediction file there is no last column including the classes, because they are unkown and need to be predicted. All other format requirements for the Training/Test file also apply for the prediction file.

2.5 Single row entry
If you only have one set of features you can enter/paste them here. You need to regard the format requirements (see Training/Test file).

2.6 Generate output file
If activated an output file will be generated that includes all features and predicted classes for a prediction run.

2.7 Output directory
Directory of the output file.

3. Result

3.1 Result box

In this frame the evaluation results will be displayed:
Number of training data rows
Number of test data rows
Feature columns
Class column
Confusion Matrix (see FAQ)
Feature Importances

3.2 Top 10 predictions
Shows predictions for the first 10 rows from a test or prediction run

3.3 Top 10 probabilities
Shows probabilities for the first 10 rows from a test or prediction run

4. Help
Opens the documentation website inside a webbrowser.

5. Report Error/Feedback
We appreciate any feedback. Please report errors, feature requests etc. here.
We can also adjust the program for your needs.