Machine Learning Models - My Cheat Sheet

Supervised Models

This is a small revision on advantages and disadvantages of each model, based on suggested models of Udacity’s Nanodegree in Machine Learning Engineer.

Logistic Regression

Advantages

Don’t have to worry about features being correlated
You can easily update your model to take in new data (unlike Decision Trees or SVM)

Disadvantages

Deals bad with outliers
Must have lots of incomes for each class
Presence of multicollinearity

Decision Tree

Advantages

Easy to understand and interpret (for some people)
Easy to use - Doesn’t need data normalisation, dummy variables, etc
Can handle multi-output models
Easily handle feature interactions
Don’t have to worry about outliers

Disadvantages

It can be easily overfitted
Stability —> small changes in data can lead to completely different trees
If a class dominates, it can easily be biased
Don’t support online learning –> you should rebuilt the tree when new data comes

Ensemble Methods

Advantages

Harder to overfit
Usually better perfomance than a single model

Disadvantages

Scaling —> usually it trains several models, which can have a bad performance with larger datasets
Hard to implement in real time platform
Complexity increases
Boosting delivers poor probability estimates (https://arxiv.org/ftp/arxiv/papers/1207/1207.1403.pdf)

K-nearest Neighbors

Advantages

Little training time
Works well with multiclass datasets
Good for highly unusual data

Disadvantages

Need to determine value of k (distance)
Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data
The accuracy of KNN can be severely degraded with high-dimension data because there is little difference between the nearest and farthest neighbor.

Gaussian Naive Bayes

Advantages

Need less training data tran models like logistic regression
Highly scalable
Not sensitive to irrelevant features
Returns the degree of certanty of the answer
Good when you need something fast and that perfoms well

Disavantages

Can’t learn interactions between features e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

SVM

Advantages

High accuracy
Nice theoretical guarantees regarding overfitting
Especially popular in text classification problems

Disavantages

Memory-intensive
Hard to interpret
Complicated to run and tune

Stochastic Gradient Descent

Advantages

Efficiency
Ease implementation

Disavantages

A lot of hyperparameters to tune
Sensitive to feature scaling

Unupervised Models

KMeans

Advantages

Good when you have an idea of an ideal number of clusters
Can scale well with lots of samples, scale medium with number of clusters

Disadvantages

Doesn’t handle missing values very well
Can’t find clusters that aren’t circular or spherical

Choosing the value of K

For choosing the value of k cluster we can use the elbow method:

from sklearn.clusters import Kmeans
from sklearn.metrics import silhouette_score

X = pd.DataFrame(...)

possible_k_values = range(2, len(X)+1, 5)

scores = []
for k in possible_k_values:
    model = Kmeans(n_clusters=k).fit(X)
    prediction = model.predict(X)
    score = silhouette_score(X, predictions)
    scores.append((k, score))

Then find the best numbers of clusters by choosing a k that has a lower score of errors but can still be good enough for your problem.

Hierarchical Clustering

Advantages

Resulting hierarchical representation can be very informative
Provides an additional ability to visualize
Especially potent when the dataset contains real hierarchical relationship (e.g. Evolutionary biology)

Disadvantages

Sensitive to noise and outliers
Computationally intensive O(N^2)

Implementation on Sklearn

from sklearn import cluster

X = pd.DataFrame(...)

cls = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = cls.predict(X)

Get a dendrogram from a hierarchical clustering

from scipy.cluster.hierarchy import dendogram, ward
import matplotlib.pyplot as plt

X = pd.DataFrame(...)
linkage_matrix = ward(X)

dendogram(linkage_matrix)
plt.show()

DBSCAN

Advantages:

We don’t need to specify the number of clusters
Flexibility in shapes and sizes of clusters
Able to deal with noise and outliers

Disadvantages

Border points that are reachable from two clusters is assigned to the cluster that finds it first
Faces difficulty finding clusters of varying densities

Tips:

Small min samples and small episilon results in many small clusters
Small min samples and large episilon results in most points being on the same cluster
Large min samples results in most of points being classified as noise, except on desen regions when episilon is high
Do not use silhouetter coefficient to test this model! Recomendado

Gaussian Mixture Model

Advantages

Soft-clustering (you can see percentages of cluster participation on each sample)
Cluster shape flexibility

Supervised Models

Logistic Regression

Advantages

Disadvantages

Decision Tree

Advantages

Disadvantages

Ensemble Methods

Advantages

Disadvantages

K-nearest Neighbors

Advantages

Disadvantages

Gaussian Naive Bayes

Advantages

Disavantages

SVM

Advantages

Disavantages

Stochastic Gradient Descent

Advantages

Disavantages

Unupervised Models

KMeans

Advantages

Disadvantages

Choosing the value of K

Hierarchical Clustering

Advantages

Disadvantages

Implementation on Sklearn

Get a dendrogram from a hierarchical clustering

DBSCAN

Advantages:

Disadvantages

Tips:

Gaussian Mixture Model

Advantages

Disadvantages

General References

Comments