5 Comparing Classification Algorithms

Now that we have the basics (accessing data and comparing data algorithms by time and space complexity), we can begin a small trek through a number of different classification algorithms. This section will discuss the algorithms for:

K-Nearest Neighbors classification
Naive Bayes classification
Decision Tree classification

By the end of this section, you should be able to compare and contrast these algorithms with respect to:

Training time
Prediction time
Storage considerations
Data types supported
Interpretability of results (e.g. parameters)
Extrapolation

5.0.1 Implementation notes

For the purposes of this class:

We will learn to implement K-nearest-neighbors “from scratch” using numerical data (where “distance between data points” is meaningful) but as you saw in the previous course, the use of categorical data is also possible via one-hot encoding.
We will learn how to implement Naive Bayes with categorical data (but scikit-learn can use numerical as well; this just requires more calculus)
Decision trees can be easily used with a mix of categorical and numeric data.

5.1 What is classification? What is the intuitive difference between these three approaches?

Classification refers to the data task where your goal is to predict a category label, or class, for previously unseen data. For example, we might want to predict whether a new campsite might have a picnic table (Y) or not (N), in the absence of this information (say, if the campsite owner forgot to include it on their page). This is a binary classification problem; we will focus on implementing binary classification in this class.

The intuitive difference between the three approaches is as follows:

K-Nearest-Neighbors assumes that all future data we sample will be from similar distribution, and tries to match new data points to the \(k\) closest/most similar other data points (by distance), then taking a majority vote for the most likely category.
Naive Bayes tries to estimate the probability that your data belongs to each category label, based on the values of each observation. It does this by pretending all of the columns are unrelated, obtaining a probability estimate from each column, and then multiplying them. It then picks the highest-probability label for each datapoint.
Decision trees also try to estimate the probability your data belongs to each category label, but unlike Naive Bayes, it does not assume that each column is unrelated to the others. It tries to identify what the most informative sequence of questions to ask are to determine the class label.

5.2 Reminder: Implementing these algorithms with scikit-learn.

To implement any one of these algorithms with scikit-learn and pandas, load the requisite parts.

Create a train-test split on your data, then use the .fit() method on the training data.

New data can be predicted using the .predict() method.

See documentation: K Nearest Neighbors , Naive Bayes, and Decision Trees.

5.3 Implementing these algorithms “from scratch” (and computational complexity)

5.3.1 Implementing KNN

See Class Notes

5.3.2 Implementing Naive Bayes

See Class Notes

5.3.3 Implementing Decision Trees

See Class Notes

5.4 Model Selection Considerations

5.4.1 Performance Metrics

2x2 Confusion matrix
Classfication Result/Truth	Negative	Positive
Predicted Negative	True Negative (TN)	False Negative (FN)
Predicted positive	False Positive (FP)	True Positive (TP)

Accuracy: Percent of correctly-classified data points: (TP + TN)/(all data)

Precision (by class): Out of everything the model predicts as class A, what percent are truly class A? (TP)/(FP+TP)

Recall (by class): Out of everything that is truly class A, what that is correctly classified as class A? (TP)/(FN+TP)

F1-Score: Harmonic mean between precision and recall, given by: \[\frac{1}{1/P+1/R} \] . This simplifies to: \[2\cdot \frac{P\cdot R}{P+R} = \frac{2TP}{2TP+FP+FN}\]

5.4.2 Probability-based performance metrics (AUC)

If a classification model produces a probability that an object belongs to a particular class, then there is an additional metric called AUC (Area under the curve) that can be used to evaluate model performance. It corresponds to the area beneath a ROC (Receiver Operating Characteristic) curve, which will be discussed below. An AUC of 0.5 corresponds to a model that predicts completely randomly, and an AUC of 1 corresponds to a model whose estimation of probability is accurate to true probabilities.

If probabilities range from 0 to 1, let \(F(p)\) be the false positive rate and \(T(p)\) be the true positive rate if you were to use probability \(p\) as the threshold for classification. Plotting all of the points \((F(p), T(p))\) produces a curve with both \(x\) and \(y\) ranging from \(0\) to \(1\). As you increase predicted class probability \(p\), the true positive rate should increase, but so will the false positive rate. A successful classifier will increase the true positive rate faster than the false positive rate, leading to a curve that lies above the line \(y=x\).

5.4.2.1 How can each of these classification models produce a probability instead of a classification decision?

KNN: Instead of taking a majority vote, compute the percentage of neighbors that are in the target class.
Decision tree: Once you reach a leaf node, instead of taking a majority vote, report the percentage of rows in that split that are in the target class.
Naive Bayes: Once you compute the Naive Bayes numerator for each class, divide by the sum of these values to to normalize \(P(C_i | X)\)

Using scikit-learn, this is done via model_name.predict_proba(y_test) .

To produce a ROC curve, use sklearn.metrics.roc_curve(y_true, y_score) where y_true is the true classes, and y_score is the probability (or similar) predicted by sklearn’s predict_proba() function.

Examples of ROC curves produced from SKLearn models are in the Class Code directory.

5.4.3 Interpretability

Classification models can be more or less (human)-interpretable.

To interrogate a model for interpretability refers to determining why that model has made its predictions on a given point:

A KNN’s predictions are entirely determined by what the neighbors are for a given datapoint. Obtaining a list of the neighbors and their relative distance to your point can help explain why it has made the predictions it did for that point.
For a naive Bayes model, consider comparing each of the \(P(x_i|C)\) calculations that are used in the Naive Bayes numerator. If any of these values are particularly low for an individual class, this tells you that your datapoint is highly unusual for that class, typically ruling it out.
For a decision tree, you can see its entire decision process; it is maximally interpretable.

5.4.4 Model complexity

The bias/variance tradeoff tells us that as models become more complicated (capable of expressing more different/less smooth functions), they tend to overfit.

In order from “most smooth/simplest decision rules” to “least smooth/most complex decision rules”, we have:

Least complex: Naive Bayes
Moderate complexity: (depth-limited) decision trees
High complexity: KNN.

This tells us that KNN is most prone to overfitting (especially if we do not carefully calibrate \(k\)) , whereas Naive Bayes is the most prone to underfitting.

5.4.5 Model assumptions

5.4.6 Feature Selection

This class does not heavily emphasize feature selection. However, in practice you often want to select features (columns) that are meaningful to use for your predictions. Some things to consider are:

To measure whether a feature is useful, consider fitting your model with and without that feature (on a subset your training set), then validating with prediction on the remainder of that training set (the validation set). Ideally, you’d use cross-validation (doing this several times for different train-test subvalidation) and compare multiple different validation metrics (not just accuracy).
Naive Bayes assumes that model features are uncorrelated. What this means is that including too many features that are heavily correlated with each other leads to worse performance. Removing correlated columns can improve the performance of Naive Bayes for this reason.
For KNN, irrelevant or noisy (inaccurate) features can make distance calculations distorted. Scaling is also essential.
Beware of unintentional data leakage through choice of features. If a feature would require you to know the result of your prediction in order to produce that column, do not use it for prediction! (Example: Using the time on an Olympics ski run to help predict whether someone will be in the the Olympics in a given year; if this column is not NA, the person made it to the Olympics that year…).

Relation to bias-variance tradeoff: Increasing the number of features you can train your model on increases the possible complexity of your model, thereby increasing the possible variability of your model. This means your model will be more prone to overfitting, (but likewise, removing features will make it more prone to underfitting).

It is always worth:

Removing features that are basically identical for all data points
Using domain knowledge or suggestions of a domain expert

There are also dimensionality reduction techniques that rely on linear algebra, and can help you find “combined” features of several different columns that summarize them effectively.