Frequently Asked Questions (FAQ)

What is Machine Learning?

Machine learning can generate knowledge by experience. Currently there are many applications for machine learning like Email spam filters that people use every day. Classification is linked to supervised machine learning, because the classes are known inside the training data. Clustering is an exmaple for unsupervised machine learning, because the system generates clusters for unknown data.

How does Classification work?

Classification can categorize data features (e.g. headline, text and sender of an Email) into classes (eg. spam and non-spam). A training data set that already includes the correct classes for several features is needed in order to make the algorithm learn which features belong to which class.

What are possible applications of classification?

There are many applications for classification tasks, here are some business examples:

  • Will a certain customer respond to a certain marketing campaign?
  • Will a certain customer buy a proposed product?
  • Will a certain customer terminate in the next months?
  • Will a product with certain attributes be successful?

What is the difference between training data, test data and prediction data?

Training data contains data features like headline, text and sender of 100 Emails AND the correct classes (spam or non-spam) for these Emails.
Test data contains these features for other (e.g. 30) Emails. The process now classifies these 30 emails and compares the predicted classes to the correct classes for these 30 emails in order to calculate the accuracy. The test data therefore is needed for evaluation purposes.
In prediction data there is no information about the classes at all. You have feature data for other (e.g. 40) Emails but you have no classes. The process predicts the classes for each email with the knowledge it has from the training data set.

What is a classifier?

The classifier is an algorithm that is used to perform the classification. There are several classifiers that have different parameters.

What is a "Confusion Matrix"?

A confusion matrix is a quick way to see if the model performs well for the different classes. It is shown in a table that looks like that (Example: Predict whether an Email is spam or not)

Predictedspamnon-spam
Actual  
spam219
non-spam1020

You can see that that 21 Emails were predicted correctly as spam and 20 Emails were predicted correctly as non-spam. But there were 9 Emails that were predicted as non-spam that actually are spam. 10 Emails were actually spam but predicted as non-spam.

What is "Cross Validation"?

In machine learning there are several ways to evaluate the model. We want to know if the classifier really can make accurate predictions for unknown data with the chosen settings. The data will be split in to k parts (the number you entered in the k-field) and will rotate the test part.

Currently Machine44 software doesn't support cross validation.

Example: If you have chosen k=5 folds and you have 100 rows in your data file then the process will use row 1-20 (part 1) for the purpose of testing and row 21-100 (parts 2-5) for training and evaluate the run. Afterwards it will use part 2 (rows 21-40) for test and the remaining data for training and so on. In the end every data row was test data in one of teh runs and every data row also was used as training data. So Cross Validation creates an evaluation for the used model and will print into the result window for the used classifier.

How to select the right classifier?

You can start a cross validation in order to analyze the performance of the different classifiers with the given parameters.

How to select the right features?

All information that could be important for the categorisation into the different classes can be used as features. So it is obvious, that the sender of an email could be a good feature in order to determine if the email belongs to the spam or non-spam class. If you want to know how important the given features were for the classification you can choose "Random Forest" as a classifier and after the TEST run all features with their importances are shown in the result window. Afterwards you could erase feature that are not very important for the prediction. Or you could add features that are linked to important features.

Do you support Images and text documents as features?

Currently only numerical values are supported as features. If you need document ot image classification, please contact us and we can create a fast custom solution for you.

What is random forest classification?

In simple terms random forest is a classifier with many decision trees. Decision trees can be build visually and they contain ordered decision rules. The trees grow while the system is learning on a randomized basis. Every trees comes to a decision and the final decision is simply the one with the most decision trees. Random forest is a fast precise classifier.

What is overfitting?

If you have a lot of different features then there is a possibility for overfitting. Overfitting means that the trained model is very good in predicting classes of the training/test data but delivers very poor results for new (unkwon) data (of a prediction file). In order to prevent overfitting you can reduce the number of features that don't seem important or decrease the number in the parameter field "Maximum Depth".