Training models on tabular data
Wow! Machine learning!
When we train a model on tabular data, we want to create a model that, given values in some columns, can predict the value in another column. In my collaborative filtering blog, I gave the model users' reviews of other movies as inputs and wanted a prediction of the users' review of another movie.
Tabular data have two types of variables: continuous variables (numerical data) and categorical variables (discrete data). In the collaborative filtering model, the users and movies were (high-cardinality) categorial variables. When training a model, we want all our inputs to be continuous variables, so we need a way to turn the categorical variables continuous.
So, you pass your categorical variables through embeddings. An embedding is equivalent to putting a linear layer after every one-hot-encoded input layers. To elaborate: you have inputs that can be indexed by one-hot-encoded vectors. And, an embedding layer takes the relevant inputs from those inputs by indexing while keeping track of the steps taken so that its derivative can be calculated layer. When you pass your one-hot-encoded input layers through an embedding layer, you get continuous numbers, which you can pass through other layers in your neural network.
When we train the model on these embeddings (the inputs), we can interpret the distance between the embeddings afterwards; since the embedding distances were learned based on patterns in the data, they also tend to match up with our intuition.
Since we can form continuous embeddings for our categorical variables, we can treat them like continuous variables when we train our models. So, we could perform probabilistic matrix factorization, or concatenate them with the actual continuous variables and pass them through a neural network.
Below shows how Google trains a model for recommendations on Google Play:
In modern machine learning, there are two main techniques that are widely applicable, each for specific kinds of data:
- Ensembles of decision trees (like random forests and gradient boosting machines) for structured data.
- Multilayered neural networks optimized with SGD (like shallow and/or deep learning) for unstructured data (like images, audio, and natural language).
Deep learning is almost always superior for unstructured data and tend to give similar results for structured data. But, decision trees train much faster, are simpler to train, and are easier to interpret (like which columns were most important).
However, deep learning is a better choice than decision trees when
- there are some high-cardinality categorical variables that are very important (like zip codes); or
- there's some columns that'd be best understood through a neural network like plain text.
Still, you should try both to see which one works best. Usually, you'll start with decision trees as a baseline and try to achieve a higher accuracy with a deep learning model if either of those two conditions above applies.
So, in the next two blog posts, I'll be talking about decision trees and deep learning, respectively, for tabular data.