Tutorial
Simple Machine Learning Tutorial for Machine44 Classification
The best way to learn how to use machine learning algoriths is to start with a very simple example. This example will show you how to use Machine44 Classification with random forests. The dataset does not contain real customer data but is suited to show the possibilities of the tool. If you want you can open the Machine44_sample.csv file with a spreadsheet programm and see how it is built. We have a dataset of 99 rows containing information on how many times a single customer bought a product of the given product categories:
- Age
- Electronics
- Clothing
- Pets
- Toys
We will call them features.
Now we want to know what customers will buy our new product that we can see in the last column of the file (Bought_Product_1). We will call it class. In this file we already have the information whether a customer bought (value "yes") or didn't buy (value "no") the new product. The tool needs this information in order to learn and to apply the knowledge to datasets with unknown classes (so for customers that did not have the possibility to buy this product and we like to target our advertising only to those that will probably buy the new product).
Lets begin!
1. Open Machine44 Graphical User Interface (GUI) by clicking on the icon after installation. The tool needs some time to load. In the meantime you should see the following splash screen:
2. After loading you should see the GUI:
3. In the Parameters section set Maximum Depth to 10 (sets the depth of a every single tree). Set the number of Estimators to 200. The more the better for prediction, but it also slows down the system performance. Leave the value for Jobs at 2 (jobs in parallel, if -1 it will be the number of cores of your computer's processor):
4. Choose Training/Test random shuffle in order to shuffle the rows inside the data file randomly to get training and test datasets. Use the slide bar to choose the percentage of training data. The more training data rows the better the test result should be because the process has more data for learning:
5. Now in the Data section click the first button Choose data file in order to choose the training/test data file. Choose Machine44_sample.csv in the installation directory and confirm. In the last column of this file the classes are given. All other columns contain the features. In the field Separator: the character for the column separation can be chosen. In the field Decimal: the character for the decimal separator for floating point numbers can be chosen. For the sample file you do not need to change anything here:
6. After confirmation of the sample file the Data File Analysis shows the number of features, rows and classes:
7. If you like you can also choose a prediction file in order to predict data machine44_predict.csv with unknown classes. This step is optional:
8. Now click on the big button "TRAIN AND TEST" on the left:
9. In the result window on the right you should see information about the test run. It shows the number of rows for training and test data according to your percentage selection in the slide bar and the names of the feature columns and class column:
It shows a Confusion Matrix and Feature importances. A confusion matrix is a quick way to see if the model performs well for the different classes.
Please see FAQ for an example of a Confusion Matrix.
Please see FAQ for an example of a Confusion Matrix.
Feature importances show the relative importance (value between 0 and 1) of each feature for the prediction process. This is one advantage of the random forest classifier. If one feature has a high importance it contains valueable information for predicting the correct class.