The #Cats Model
Feature Engineering is a crucial component of the machine learning (ML) model development pipeline.
At Compado we devote a significant amount of our time to processing and engineering features in order to build strong predictive models.
Often our datasets are of mixed nature: They contain various types of features, such as numerical, categorical, text, DateTime etc. that (in general) need to be transformed, or encoded, into numerical format for our ML algorithms to understand.
To convert text data into a numerical format, there are several encoding techniques available. Definitely take a look at how we embed keywords into vectors here…
👉 Learn how Compado personalizes brand recommendation and understands intent 👈
…Similarly, various methods can be used to transform categorical features into numerical format.
In this blog post we will talk about some popular encoding techniques used by ML folks, such as Ordinal Encoding, One-Hot Encoding, CatBoost Encoding – and we will reveal which encoding method of categorical variables makes the most sense for us in Compado.
Keep in mind that every ML task is unique and there is no “one fits all” encoding technique…
…It is our job as a data scientist or ML engineer to thoroughly analyze the business case and come up with the best logical solution for a given problem.
Categorical features are often represented by strings and can be classified into two types: ordinal and nominal.
Ordinal features have a finite set of discrete values that are ranked in a specific order, such as age group or education.
For example, “Doctorate degree” > “Master’s degree” > “Bachelor’s degree” > “High School graduate” > … > “Daycare graduate” 😛
On the other hand, nominal features have a finite set of discrete values with no inherent ordering or relationship between them, such as country or language.
Since ordinal categorical features have a known relationship between their levels, in some ML models (such as tree-based models which we use at Compado) we can apply Ordinal Encoding (or Label Encoding) to represent them.
This means that we can replace education level strings with numbers that respect the same ordering…
…For example, we can exchange “Doctorate” with 5, “Master” with 4, “Bachelor” with 3, “High School Graduate” with 2 etc. No-brainer, right?
On the other hand, for nominal variables, there is no inherent relationship between the levels or categories, and we cannot in general use ordinal encoders even though sometimes it is very-very tempting.
I am talking about cases when in place of names (strings), categories are represented by ids (integers).
For example, rather than “domain_name” we often use “domain_id”, which is much more practical (just imagine how long some website names are).
However, we do not use these ids for ordering as they do not mean anything useful: Websites with similar ids are not per se related to each other. In those cases we could, for example, apply One-Hot Encoding.
One-Hot Encoding (or Dummy Encoding) is a technique where every categorical feature is transformed into a new binary feature with each category represented by a value 0 or 1.
For example, suppose that we have a categorical feature “country” with three possible categories: “Germany”, “Spain”, and “France”.
In this case, we can use One-Hot Encoding to transform “country” into three new binary features: “country_germany”, “country_spain,” and “country_france”.
Each of these new features will take the value 1 if the original feature had that particular category. Otherwise, it will take the value 0, see the table below…
country | country_germany | country_spain | country_france |
Spain | 0 | 1 | 0 |
France | 0 | 0 | 1 |
Germany | 1 | 0 | 0 |
France | 0 | 0 | 1 |
… | … | … | … |
👇
…One-Hot Encoding is a standard and a very popular encoding method.
It preserves all the information in the original categorical feature while transforming it into a numerical format that can be used for training ML models.
It is a simple but at the same time an effective technique that can handle categorical features if the number of unique categories is small.
The problem is that often it is not small…
…When there are too many unique categories (too many levels), a categorical feature is said to have high cardinality.
In such a case One-Hot Encoding can become computationally expensive and not feasible.
The reason is that One-Hot Encoding creates a new binary feature for each unique category value, resulting in a high number of features (N new features for N unique categories) that can quickly become impractical for training ML models.
As the number of unique categories increases, the number of new binary features also increases. Then the amount of data we need to be able to distinguish between these features (for accurate predictions) grows exponentially.
This is the so-called “curse of dimensionality“ that can result in overfitting, which means that the model does not generalize well to new, unseen data…
…ML models at Compado tend to have a lot of high cardinality features. While feeding One-Hot Encoded features to Boosted Trees we are usually decreasing the number of dimensions by taking the top most frequent categories and aggregating less frequent together as a category “Other”. This works pretty well.
However, as we are always challenging our setup and looking for improvements, we wanted to find a smarter encoding technique which would substantially reduce dimensionality of our datasets and result in better predictions.
There are several alternative ways of encoding that can do just that, but it turns out that they are not applicable for our ML models.
What I want to show you now is how feature engineering crucially depends on the data at hand and the business problem that we are trying to solve…
…Many ML practitioners turn to one of the most promising alternative encoding techniques which belongs to the group of target-based categorical encoders – CatBoost.
It is a popular gradient boosting library that provides a built-in categorical feature encoding method called “CatBoost Encoding“.
This encoding method is, among others, useful for handling categorical features in binary classification problems, because it encodes each category as a numerical value that is optimized for the target variable.
However, it is not really suitable for our case of a multilabel classification problem with high target cardinality since we have a lot of target classes representing brands.
Suppose there are 100 brands as classes, and even if the number of categories were small (it is actually big!), e.g. 3 countries in the example above, the CatBoost encoding would already create 300 new columns needed for target encoding of the categories in the country feature. This would result in strongly increased dimensionality (instead of a decreased one), high memory consumption and longer training time.
Hence, we were exploring better methods to manage the number of features in a practical manner.
We decided to try using domain knowledge to come up with a logic for ordering the categories.
This might be more risky and time consuming to do, but once we have found a good logic, we will end up with one variable per categorical feature (a huge improvement in dimension reduction), which we can then also neatly scale to new categorical features.
With the help of our business analysts (BAs) we have done a thorough analysis of various nominal categorical features for different industries.
In particular, we have asked the BAs to place categories in order of similarity in terms of brands ranking.
This means that we wanted to have similar categories to be ordered close to each other…
…As we use a tree-based boosting algorithm, it does not matter if we have exact quantities representing the categories, just the category order is important to us.
It turned out that among all the possible KPIs considered, compensation happened to be the one that closely aligned with the BAs’ category ranking.
As a result, in the new model we are converting nominal variables to ordinal ones using our historical business knowledge about compensations.
This means that instead of One-Hot Encoding we apply Ordinal Encoding to all our categorical features, ordering unique categories according to compensations they produced in the given industry during a predefined time period.
We do not use manual ordering, of course, but we have set up a fully automated process to rank categories which enables great scalability independently of features added.
This way of encoding allows us to significantly reduce dimensionality of our data and even include new crucial high-cardinality features which was not possible before…
…Suppose that “Germany” is the country that has given us the largest amount of compensation recently followed by “France” and “Spain”, then our example from above will look like the table below.
Instead of a huge amount of new binary columns being added for the model to understand, we have a single column with numerical values (the order can also be reversed).
Just think about it, a single column per feature instead of hundreds…
country | country |
Spain | 3 |
France | 2 |
Germany | 1 |
France | 2 |
… | … |
👇
…The new model has already won the AB tests for large industries (compared to the baseline) in terms of compensation uplift: This model is able to better learn category similarities than our previous ML models.
That is why we called it the Cats model (derived from the word “categories”).
Instead of using a popular CatBoost Encoding technique for our multiclass models, we have released our own Cats model…
…As per the Vietnamese zodiac, this year is known as the Year of the Cat, and it looks like our Cats model will contribute to our success in many more industries. 😉