Machine learning · 24. February 2019

Pros and cons of classical supervised ML algorithms

Here we explore the pros and cons of some the most popular classical machine learning algorithms for supervised learning. To recap, this is a learning situation where we are given some labelled data and the model must predict the value or class of a new datapoint using a hypothesis function that it has learned from studying the provided examples.

By ‘classical’ machine leaning algorithms I mean anything that is not a neural network. We will cover the advantages and disadvantages of various neural network architectures in a future post.

Also note that this post deals only with supervised learning. There will be another dealing with clustering algorithms for unsupervised tasks.

Linear Regression

Note that there are two basic flavors of this: Lasso and Ridge regression, which regularize with the L1 and L2 norms respectively

This is used for regression problems.

Pros

Small number of hyperparmeters
Easy to understand and explain
Can be regularized to avoid overfitting and this is intuitive
Lasso regression can provide feature importances

Cons

Input data need to be scaled and there are a range of ways to do this
May not work well when the hypothesis function is non-linear
A complex hypothesis function is really difficult to fit. This can be done by using quadratic and higher order features, but the number of these grows rapidly with the number of original features and may become very computationally expensive
Prone to overfitting with a large number of features are present
May not handle irrelevant features well

Logistic Regression

Again, this can be regularized with the L1 or L2 norm

This is used for classification problems.

Pros

Outputs a probabilistic interpretation, which can be investigated with P-R curves, ROC curves etc.
Has a small number of hyperparameters
Overfitting can be addressed though regularization
Using L2 regularization will provide feature importances

Cons

May overfit when provided with large numbers of features
Can only learn linear hypothesis functions so are less suitable to complex relationships between features and target
Input data might need scaling
May not handle irrelevant features well, especially if the features are strongly correlated

Naive Bayes

This is used for classification problems.

An intuitive way of determining the probability that a sample falls into classes based on a simplification of Bayes Theorem where we treat all the features as statistically independent. Specifically, we are attempting to determine the classification of some example given its feature values.

Bayes Theroem states this is proportional to the probability of each class multiplied by the product of observing each of the feature values in each class.

Pros

No training involved - this is just based on the distribution of feature values. Thus it is easy to implement and fast
Very little parameter tuning is required
Works surprisingly well given its simplicity
Features do not need scaling

Cons

Assumes that the features are independent, which is rarely true
Will usually not perform as well as classification algorithms that follow the ‘train-test’ approach

K nearest neighbors

This is an approach where the values of a datapoint’s nearest neighbors is used to either classify it or find its value

For classification, the neighbors of the datapoint all vote for its class. The class with the most votes wins
For regression, the value of the datapoint is determined as the average of its neighbors

This is also known as lazy learning because there is no training and testing dataset

Pros

No training involved
Intuitive and easy to understand the concept, especially for clustering
Naturally handles multiclass classification and can learn complex decision boundaries

Cons

The distance metric to choose it not obvious and difficult to justify in many cases
Performs poorly on high dimensional datasets
Expensive and slow to predict new instances because the distance to all neighbors must be recalculated
Data needs to be scaled before use

Decision Trees

Decision trees are rule-based learning methods that usually use either gini impurity or entropy to split the dataset at a series of decisions, made at nodes. Each internal node represents a ‘test’ on an attribute and each test has a series of outcomes, which are represented by branches. At the very end of the branches are leaf nodes that represent a class or regression outcome from the tree. The tree is ‘learned’ by splitting the source set into subsets based on an attribute-value test. The process can be repeated on each derived subset in a recursive fashion.

Decision trees are known as ‘weak learners’ and form the basis for ensemble methods like bagging and boosting

Pros

Easy to interpret
Can adapt to learn arbitrarily complex hypothesis functions
Requires little data preparation - data typically does not need to be scaled
Feature importance is built in due to the way inwhich decision nodes are built
Performs well on large datasets

Cos

Prone to overfitting unless pruning is used
Can be very non-robust, meaning that small changes in the training dataset can lead to quite major differences in the hypothesis function that gets learned
Generally have worse performance than ensemble methods

Support Vector Machines (SVM)

SVMs come in a variety of different flavors and are likely deserved of their own blog post for a more detailed description. Ultimately their goal is to find optimal hyperplanes that best separate the dataset. Typically a hyperplane is chosen such that the ‘margin’ between the plane and datapoints of different classes on either side is maximized

Hard margin SVMs: These are not robust to outliers (noise), meaning that all datapoints within each class must be on the correct side of the hyperplane
Soft margin SVM: A slack variable is introduced in the objective function so that the constraint can be satisfied even if it is not possible to fit a hyperplane to all datapoints. Typically these will have a regularization parameter C, whose size will control the extent of smoothing that occurs. A small C emphases the importance of the slack variables while a large C diminishes is
The kernel trick can be used to adapt SVMs so that they become capable of learning non-linear decision boundaries. A kernel is really just a measure of the similarity between a datapoint and all others represented in the dataset. The most commonly used kernel (Radial Basis Function or RBF) has a gaussian shape

Pros

Fairly robust against overfitting, especially in higher dimensional space
There are many kernels to choose from, and some may perform very well on the dataset in question
The optimization problem is convex and so solvable with standard libraries and always unique

Cons

Feature scaling is required
There are many hyperparameters and their meanings are often not intuitive
Do not scale well to large datasets (difficult to parallelize)

Bagging (random forest, for example)

The concept of bagging aggregates/combines the results of many weak learners. Each weak learner has low bias but high variance (not robust to variation in the training dataset), but when aggregated their variance is lowered. In classification, each tree votes for the class of the input datapoint. In regression, the value assigned to each input is the mean of the bagged learners.

In random forest, the weak learners are decision trees. Each tree is grown using a random subset of features using a random subset of the data. This decorrelates the trees by showing them different datasets. See the overview of random forest document for more information.

Pros

Highly flexible and generally very accurate
Naturally assigns feature importance scores, so can handle redundant feature columns
Has the ability to address class imbalance by using the balanced class weight flag
Scales to large datasets
Generally robust to overfitting
Data does not need to be scaled
Can learn non-linear hypothesis functions

Cons

Results may be difficult to interpret
Feature importances may not be robust to variation in the training dataset

Boosting (gradient boosted trees, for example)

Boosting is alternative ensemble method that relies on the idea of iteratively improving the performance of weak learners. In the case of gradient boosted decision trees, the trees are built sequentially and each tree is designed to correct errors made by previous trees. At each step, the residual between the predicted and observed training data is added to the cost function, which is then minimized in the next step.

Pros

Robust to missing data, highly correlated features and irrelevant features in much the same way as random forest
Naturally assigns feature importance scores
In Kaggle competitions at least, XGBoost appears to be the method of choice, with slightly better performance than random forest
Data does not need to be scaled
Can learn non-linear hypothesis function

Cons

May be more prone to overfitting than random forest - the main purpose of the boosting approach is to reduce bias , not variance
Has many hyperparameters to tune, so model development may not be as fast
Feature importances may not be robust to variation in the training dataset