Machine Learning Models - My Cheat Sheet

Supervised Models

This is a small revision on advantages and disadvantages of each model, based on suggested models of Udacity’s Nanodegree in Machine Learning Engineer.

Logistic Regression


  • Don’t have to worry about features being correlated
  • You can easily update your model to take in new data (unlike Decision Trees or SVM)


  • Deals bad with outliers
  • Must have lots of incomes for each class
  • Presence of multicollinearity

Decision Tree


  • Easy to understand and interpret (for some people)
  • Easy to use - Doesn’t need data normalisation, dummy variables, etc
  • Can handle multi-output models
  • Easily handle feature interactions
  • Don’t have to worry about outliers


  • It can be easily overfitted
  • Stability —> small changes in data can lead to completely different trees
  • If a class dominates, it can easily be biased
  • Don’t support online learning –> you should rebuilt the tree when new data comes

Ensemble Methods


  • Harder to overfit
  • Usually better perfomance than a single model


K-nearest Neighbors


  • Little training time
  • Works well with multiclass datasets
  • Good for highly unusual data


  • Need to determine value of k (distance)
  • Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data
  • The accuracy of KNN can be severely degraded with high-dimension data because there is little difference between the nearest and farthest neighbor.

Gaussian Naive Bayes


  • Need less training data tran models like logistic regression
  • Highly scalable
  • Not sensitive to irrelevant features
  • Returns the degree of certanty of the answer
  • Good when you need something fast and that perfoms well


  • Can’t learn interactions between features e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).



  • High accuracy
  • Nice theoretical guarantees regarding overfitting
  • Especially popular in text classification problems


  • Memory-intensive
  • Hard to interpret
  • Complicated to run and tune

Stochastic Gradient Descent


  • Efficiency
  • Ease implementation


  • A lot of hyperparameters to tune
  • Sensitive to feature scaling

Unupervised Models



  • Good when you have an idea of an ideal number of clusters
  • Can scale well with lots of samples, scale medium with number of clusters


  • Doesn’t handle missing values very well
  • Can’t find clusters that aren’t circular or spherical

Choosing the value of K

For choosing the value of k cluster we can use the elbow method:

from sklearn.clusters import Kmeans
from sklearn.metrics import silhouette_score

X = pd.DataFrame(...)

possible_k_values = range(2, len(X)+1, 5)

scores = []
for k in possible_k_values:
    model = Kmeans(n_clusters=k).fit(X)
    prediction = model.predict(X)
    score = silhouette_score(X, predictions)
    scores.append((k, score))

Then find the best numbers of clusters by choosing a k that has a lower score of errors but can still be good enough for your problem.

Hierarchical Clustering


  • Resulting hierarchical representation can be very informative
  • Provides an additional ability to visualize
  • Especially potent when the dataset contains real hierarchical relationship (e.g. Evolutionary biology)


  • Sensitive to noise and outliers
  • Computationally intensive O(N^2)

Implementation on Sklearn

from sklearn import cluster

X = pd.DataFrame(...)

cls = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = cls.predict(X)

Get a dendrogram from a hierarchical clustering

from scipy.cluster.hierarchy import dendogram, ward
import matplotlib.pyplot as plt

X = pd.DataFrame(...)
linkage_matrix = ward(X)




  • We don’t need to specify the number of clusters
  • Flexibility in shapes and sizes of clusters
  • Able to deal with noise and outliers


  • Border points that are reachable from two clusters is assigned to the cluster that finds it first
  • Faces difficulty finding clusters of varying densities


  • Small min samples and small episilon results in many small clusters
  • Small min samples and large episilon results in most points being on the same cluster
  • Large min samples results in most of points being classified as noise, except on desen regions when episilon is high
  • Do not use silhouetter coefficient to test this model! Recomendado

Gaussian Mixture Model


  • Soft-clustering (you can see percentages of cluster participation on each sample)
  • Cluster shape flexibility


  • Sensitive to initialization values
  • Possible to converge to a local optimum
  • Slow convergence rate

General References