Here we explore the pros and cons of some the most popular
classical machine learning algorithms for supervised learning. To recap, this is a learning situation where we are given some labelled data and the model must predict the value or class of a new
datapoint using a hypothesis function that it has learned from studying the provided examples.
By ‘classical’ machine leaning algorithms I mean anything that is not a neural network.
We will cover the advantages and disadvantages of various neural network architectures in a future post.
Also note that this post deals only with supervised
learning. There will be another dealing with clustering algorithms for unsupervised tasks.
Linear Regression
Note that there are two basic flavors of this: Lasso and
Ridge regression, which regularize with the L1 and L2 norms respectively
This is used for regression problems.
Pros
-
Small number of hyperparmeters
-
Easy to understand and explain
-
Can be regularized to avoid overfitting and this is
intuitive
-
Lasso regression can provide feature
importances
Cons
-
Input data need to be scaled and there are a range
of ways to do this
-
May not work well when the hypothesis function is
non-linear
-
A complex hypothesis function is really difficult to
fit. This can be done by using quadratic and higher order features, but the number of these grows rapidly with the number of original features and may become very computationally
expensive
-
Prone to overfitting with a large number of features
are present
-
May not handle irrelevant features well
Logistic
Regression
Again, this can be regularized with the L1 or L2
norm
This is used for classification problems.
Pros
-
Outputs a probabilistic interpretation, which can be
investigated with P-R curves, ROC curves etc.
-
Has a small number of hyperparameters
-
Overfitting can be addressed though
regularization
-
Using L2 regularization will provide feature
importances
Cons
-
May overfit when provided with large numbers of
features
-
Can only learn linear hypothesis functions so are
less suitable to complex relationships between features and target
-
Input data might need scaling
-
May not handle irrelevant features well, especially
if the features are strongly correlated
Naive Bayes
This is used for classification problems.
An intuitive way of determining the probability that a
sample falls into classes based on a simplification of Bayes Theorem where we treat all the features as statistically independent. Specifically, we are attempting to determine the
classification of some example given its feature values.
Bayes Theroem states this is proportional to the
probability of each class multiplied by the product of observing each of the feature values in each class.
Pros
-
No training involved - this is just based on the
distribution of feature values. Thus it is easy to implement and fast
-
Very little parameter tuning is required
-
Works surprisingly well given its simplicity
-
Features do not need scaling
Cons
-
Assumes that the features are independent, which is
rarely true
-
Will usually not perform as well as classification
algorithms that follow the ‘train-test’ approach
K nearest
neighbors
This is an approach where the values of a datapoint’s
nearest neighbors is used to either classify it or find its value
-
For classification, the neighbors of the datapoint
all vote for its class. The class with the most votes wins
-
For regression, the value of the datapoint is
determined as the average of its neighbors
This is also known as lazy learning because there is no
training and testing dataset
Pros
-
No training involved
-
Intuitive and easy to understand the concept,
especially for clustering
-
Naturally handles multiclass classification and can
learn complex decision boundaries
Cons
-
The distance metric to choose it not obvious and
difficult to justify in many cases
-
Performs poorly on high dimensional
datasets
-
Expensive and slow to predict new instances because
the distance to all neighbors must be recalculated
-
Data needs to be scaled before use
Decision Trees
Decision trees are rule-based learning methods that
usually use either gini impurity or entropy to split the dataset at a series of decisions, made at nodes. Each internal node represents a ‘test’ on an attribute and each test has a series of outcomes, which are
represented by branches. At the very end of the branches are leaf nodes that represent a class or regression outcome from the tree. The tree is ‘learned’ by splitting the source set into subsets based on an attribute-value
test. The process can be repeated on each derived subset in a recursive fashion.
Decision trees are known as ‘weak learners’ and form the basis for ensemble methods like bagging and
boosting
Pros
-
Easy to interpret
-
Can adapt to learn arbitrarily complex hypothesis
functions
-
Requires little data preparation - data typically
does not need to be scaled
-
Feature importance is built in due to the way
inwhich decision nodes are built
-
Performs well on large datasets
Cos
-
Prone to overfitting unless pruning is
used
-
Can be very non-robust, meaning that small changes
in the training dataset can lead to quite major differences in the hypothesis function that gets learned
-
Generally have worse performance than ensemble
methods
Support Vector Machines (SVM)
SVMs come in a variety of different flavors and are
likely deserved of their own blog post for a more detailed description. Ultimately their goal is to find optimal hyperplanes that best separate the dataset. Typically a hyperplane is chosen
such that the ‘margin’
between the plane and datapoints of different classes on
either side is maximized
-
Hard margin SVMs: These are not robust to
outliers (noise),
meaning that all datapoints within each class must
be on the correct side of the hyperplane
-
Soft margin SVM: A slack variable is introduced in
the objective function so that the constraint can be satisfied even if it is not possible to fit a hyperplane to all datapoints. Typically these will have a regularization parameter C,
whose size will control the extent of smoothing that occurs. A small C emphases the importance of the slack variables while a large C diminishes is
-
The kernel trick can be used to adapt SVMs so that
they become capable of learning non-linear decision boundaries. A kernel is really just a measure of the similarity between a datapoint and all others represented in the dataset. The most
commonly used kernel (Radial
Basis Function or RBF) has a gaussian shape
Pros
-
Fairly robust against overfitting, especially in
higher dimensional space
-
There are many kernels to choose from, and some may
perform very well on the dataset in question
-
The optimization problem is convex and so solvable
with standard libraries and always unique
Cons
-
Feature scaling is required
-
There are many hyperparameters and their meanings
are often not intuitive
-
Do not scale well to large datasets
(difficult to parallelize)
Bagging (random forest,
for example)
The concept of bagging aggregates/combines the
results of many weak learners. Each weak learner has low bias but high variance (not robust to variation in the training dataset), but when
aggregated their variance is lowered. In classification, each tree votes for the class of the input datapoint. In regression, the value assigned to each input is the mean of the bagged
learners.
In random forest, the weak learners are decision
trees. Each tree is grown using a random subset of features using a random subset of the data. This decorrelates the trees by showing them different datasets. See the overview of random
forest document for more information.
Pros
-
Highly flexible and generally very
accurate
-
Naturally assigns feature importance scores, so
can handle redundant feature columns
-
Has the ability to address class imbalance by
using the balanced class weight flag
-
Scales to large datasets
-
Generally robust to overfitting
-
Data does not need to be scaled
-
Can learn non-linear hypothesis functions
Cons
-
Results may be difficult to
interpret
-
Feature importances may not be robust to
variation in the training dataset
Boosting (gradient boosted trees, for example)
Boosting is alternative ensemble method that relies
on the idea of iteratively improving the performance of weak learners. In the case of gradient boosted decision trees, the trees are built sequentially and each tree is designed to
correct errors made by previous trees. At each step, the residual between the predicted and observed training data is added to the cost function, which is then minimized in the next
step.
Pros
-
Robust to missing data, highly correlated
features and irrelevant features in much the same way as random forest
-
Naturally assigns feature importance
scores
-
In Kaggle competitions at least, XGBoost appears
to be the method of choice, with slightly better performance than random forest
-
Data does not need to be scaled
-
Can learn non-linear hypothesis function
Cons
-
May be more prone to overfitting than random
forest - the main purpose of the boosting approach is to reduce bias , not variance
-
Has many hyperparameters to tune, so model
development may not be as fast
-
Feature importances may not be robust to
variation in the training dataset