You get a decision tree! And YOU get a decision tree!
Oprah was so close to discovering random forests.
Our first method for training structured tabular data is to use ensembles of decision trees.
Decision trees: a decision tree asks a series of yes/no questions about the data. After each question, the data at that part splits between yes/no. After one or more questions, predictions can be formed by finding the group the data is part of at the bottom of the tree and returning the average value of the targets in that group.
To train a decision tree, we follow a greedy approach with six steps:
- Loop through each column of the data set.
- For each column, loop through each possible level of that column.
Level: for most continuous and some categorical variables, when we say levels, we're referring to variables that can be ordered. For example, sizes like "Small" < "Medium" < "Large". For other categorical variables, we refer to the actual values.
- Try splitting the data into two groups, based on whether they're greater than or less than that value (or equal to or not equal to for other categorical variables).
- Find the average prediction for each of those two groups and use your metric to see how close that is to the actual value of each of the items in that group.
- After looping through all the possible columns and levels for each column, pick the split point that gave the best prediction.
- Now, we have two groups for our data set. Treat each of them as new data sets and repeat from step 1 until each group reaches your minimum size threshold.
With decision trees, you have to be careful with how many leaf nodes you end up with. If you have too many (close to the number of data entries), then your model will overfit.
One year before his retirement, Leo Breiman published a paper on "bagging". Instead of training on the entire training set (or mini-batches), you
- randomly choose a subset of the rows of your data,
- train a model using this subset,
- save the model, and
- train more models on different subsets of the data.
Eventually, you end up with a number of models. To make a prediction, you predict using all of the models and take the average.
Each of the models have errors since they're not trained on the full training set, but since different models have different errors (and these errors aren't correlated with each other; i.e., they're independent) the errors end up cancelling out when we take the average.
Seven years later, Breiman also coined "random forests" where you apply bagging to decision trees not only by randomly choosing a subset of the rows of your data, but you also randomly choosing a subset of the columns when choosing a split in each decision tree.
Random forests: a specific type of an ensemble of decision trees, where bagging is used to combine the results of several decision trees that were trained on random subsets of the rows of the data where each split made on a random subset of the columns of the data.
Since the errors tend to cancel out, it also means the trees are less susceptible to hyperparameter changes. We can also have as many trees as we want; in fact, the error rate usually decreases as we add more trees.
Once we trained our model, if the error rate for the validation set is higher than the training set, we want to make sure it's from generalization (or extrapolation) problems and not overfitting.
Out-of-bag error allows us to check if we're overfitting without the need of a validation set. Since each tree in a random forest is trained on a subset of the data, we can form a validation set for each tree as the rows not included in training for that tree.
What makes out-of-bag error different from validation set error is that the data in the former is within the range of the training set, while the validation set is usually outside of the range; this range is most important for time series data since the validation set should contain data that's in the future compared to the training set.
So, if our out-of-bag error is lower than the validation set error, then the model is not overfitting and is instead having other problems.
In general, we want to interpret in our model:
- how confident are we in our predictions for a particular row of data?
- for making our predictions on a specific row of data, what were the most important columns, and how did they influence the prediction?
- which columns are the most important; and which columns can we ignore (remove them from training)?
- which columns are effectively redundant in terms of prediction?
- how do predictions vary as we vary the columns (as in, what kind of relationship do the columns have with the predictions)?
When we want to predict for a particular row of data, we pass the data to each tree in our random forest and take the average of the results. To find the relative confidence of the prediction, we can take the standard deviation of the predictions instead of the average. So, if the standard deviation is high, we should be more wary of the prediction since the trees disagree more than if the standard deviation was low.
It's important to understand how our models are making predictions, not just how accuracte the predictions are.
To find the importance of each column (feature) in our data, we can loop through each tree and recursively explore each branch. At each branch, look at what column was used for that split and how much the model improved at that split. The improvement, which is weighted by the number of rows in that group is added to the importance score for that column. The importance score is summed across all branches of all trees. Then, you can normalize the scores (so that they sum to 1) and sort them in ascending order to see the least important columns, and by descending order to see the most important columns.
The "how" is mostly used in production (and not in model training) to see how the data is leading to predictions. To find how each column influenced the prediction, we take a single row of the data and pass it through each of the decision trees in our random forest. At each split point, record how the prediction changes (increases or decreases) compared to the parent node of the tree and add it to the column's score. Then, combine the score for each of the columns and you can see how each column increased or decreased the prediction relative to the parent node of the random forest (which is the average of the average of the target in each row in the batch of rows in the batch of trees in the random forest).
Once you found the importance of each column, you can set a threshold such that you ignore features whose importance scores were lower than that threshold (this is why we normalized the scores).
Try retraining your model with those columns ignored and you can decide to keep the change (if the accuracy hasn't changed much) or change your threshold (if the accuracy decreased significantly). In any case, it's nicer to train your model with less unimportant columns since you'll be able to train future models on the same data set faster.
To find redundant columns, you want to find how similar each column is to another. To do so, you calculate the rank correlation, where all the values in each column are replaced with their rank relative to other values in the same column (think of it like descending argsort
, where you give each row in a specific column the index it would have for the column to be sorted in descending order).Then, the correlation is calculated (kind of like the correlation coefficient $r$, but with rank). Columns with similar rank correlations may be synonyms for each other and one (or more) of them could be removed.
When removing redundant columns, retrain the model where you remove only one redundant column at a time. Then, try removing them in groups and eventually altogether. The point of this tedious task is to make sure we're not significantly reducing the accuracy of the model. And, some columns, although they seem redundant, may not be redundant and would be important to keep in the model.
Although not necessary, you should remove unimportant and redundant columns when possible since it'll simplify your model.
To find the relationship between a column and prediction, you could guess that we should have a row where we keep all columns constant except for the column in question.
But, we can't just take the average of the predictions for a specific level of a column since other variables can change. Instead, we replace every single value in the column with a specific level in the validation set, and record the prediction with the new validation set as the input. Then, we do the same for every other level of that column.
With these predictions, we can form a line graph with the levels as the x-axis and the predictions as the y-axis. We call this graph a partial dependence plot.
Sometimes, you trained your model and
- your accuraccy is too good to be true,
- some features don't make sense to be predictors, or
- the partial dependence plots looks weird.
If so, your data might have data leakage where the training set contains information that wouldn't be available in the data you give at inference (i.e., when using the model in practice and/or your validation set).
Data leakage are subtleties that give away the correct answer. For example, if you trained a model to predict the weather and the precipitation was in an available column (and/or it was only filled out on rainy days), you bet your model would predict it was "raining" on "rainy days" if there was any precipitation and "sunny" on "sunny days" otherwise. So, when you interpret the model later, you might see really high accuracy, with precipitation being a high predictor.
In preventing data leakage, train your model first and then look for data leakage (and then clean or reprocess your data); this process is the same with how you would train your model first before performing data cleaning.
With time series data, you usually want to have a model that can generalize to new data and extrapolate accurately. The downside of random forests is that it can only predict within the range of its training data. So, if the value in the validation set is outside of the range of the training set, the accuracy of the random forest will always be low since it can't predict values that high.
Why might this be the case? A random forest returns a prediction based on the average of the predictions of its decision trees, where each tree predicts the average of the targets in the rows in a leaf node. So, a random forest can never predict a value that's outside of the range of the training set.
In a general sense, a random forest can't generalize to out-of-domain data, so we need to make sure our validation, test, and future data sets contain the same kind of data as our training set.
To test if there's out-of-(the training set's)-domain data, we can build a random forest that predicts which row is in the validation or training set. To do so, you can concatenate the validation and training set and label the rows by validation or training. Then, through feature importance, if there's a particular column that is more prominent in the validation set, there will be a nonuniform distribution of importance scores.
Sometimes, you can remove the columns with high feature importance and improve the accuracy of the model since those columns might be related to another column (hence removing redundant columns).
Removing those columns can also make your model more resilient over time since those columns may be affected by domain shift where the data put into the model is significantly different from the training data.
Instead of random forests, which forms an ensemble of decision trees through bagging, we can also make gradient boosted machines which uses boosting instead of bagging.
Bagging takes the average of the predictions from each decision tree. Boosting, on the other hand, adds the predictions of each decision tree. So, you also train your decision trees differently:
- train a decision tree that underfits the targets of your training set,
- calculate residuals by subtracting the predictions from the targets,
- repeat from the beginning, but train your future models with the residuals as the targets, and
- continue training more trees until you reach a certain maximum or your validation metric gets worse.
With boosting, we try to minimize the error by having the residuals become as small as possible by underfitting them.
Unlike random forests, the trees aren't independent of each other so the more trees we train, the more the overall model will overfit the training set.
In training a model for tabular data, you can get a boost in accuracy by training a random forest model, doing some analysis like feature importance and partial dependence plots to remove redundant columns, and then training a neural network that uses embeddings for the categorical variables/columns.
Then, we retrain our random forest model, but instead of creating levels for the categorical variables, we use the embeddings trained by the neural network. So, instead of using a neural network at inference, you can use an improved random forest model.
The same can be done for gradient boosted machines, and any model that uses categorical variables. Just use the embeddings trained by the neural network.
We covered a machine learning technique called ensembles of decision trees. Here, we mentioned two methods of ensembling: bagging and boosting.
With bagging, you form a random forest that's quick and easy to train. Random forests are also resistant to hyperparameter changes and since the trees are independent, it's very difficult to overfit as you increase the number of trees.
With boosting, you form a gradient boosted machine (or gradient boosted decision tree) that are just as fast to train as random forests in theory, but require more hyperparameter tuning and are susceptible to overfitting with the more trees you train since the trees aren't independent of each other. However, gradient boosted machines tend to have higher accuracy than random forests.
Overall, because of the limitations of decision trees, both random forests and gradient boosted machines can't extrapolate to out-of-domain data. Therefore, you sometimes have to make a neural network.
Neural networks take the longest to train and require more preprocessing like batch normalization (which also needs to be done at inference). With neural networks, you have to be careful with your hyperparameters since they can lead to overfitting. However, neural networks are great at extrapolating and can have the highest accuracy of the three models.
With neural networks, you can also use ensembles of decision trees to do some of the preprocessing to make them faster to train. And, once you train a neural network, you can use the embeddings trained by the neural networks as the inputs for the categorical variables in another ensemble of decision trees on the same data set. Doing so tends to produce much higher accuracy.
If the task doesn't require extrpolation (all future predictions are expected to be in the same range as the training set), then you can use the improved ensemble of decision trees since they will be faster at inference compared to neural networks.
Moreover, if the response time at inference isn't a major problem, you can even form an ensemble of neural networks and an ensemble of decision trees where you take the average of the predictions of each of the models. Taking the theory behind random forests, since the two (or more) models were trained by two (or more) very different algorithms, the errors each make are independent of each other and will cancel each other out, leading to higher accuracy with less chances of overfitting. Still, it won't make a bad model a good model.