It’s well-known that many machine studying fashions can’t course of categorical options natively. Whereas there are some exceptions, it’s normally as much as the practitioner to determine on a numeric illustration of every categorical characteristic. There are many ways to perform this, however one technique seldom beneficial is label encoding.
Label encoding replaces every categorical worth with an arbitrary quantity. As an example, if we have now a characteristic containing letters of the alphabet, label encoding would possibly assign the letter “A” a worth of 0, the letter “B” a worth of 1, and proceed this sample till “Z” which is assigned 25. After this course of, technically talking, any algorithm ought to have the ability to deal with the encoded characteristic.
However what’s the issue with this? Shouldn’t refined machine studying fashions have the ability to deal with one of these encoding? Why do libraries like Catboost and other encoding strategies exist to take care of excessive cardinality categorical options?
This text will discover two examples demonstrating why label encoding may be problematic for machine studying fashions. These examples will assist us respect why there are such a lot of alternatives to label encoding, and it’ll deepen our understanding of the connection between knowledge complexity and mannequin efficiency.
Top-of-the-line methods to realize instinct for a machine studying idea is to know the way it works in a low dimensional area and attempt to extrapolate the outcome to greater dimensions. This psychological extrapolation doesn’t at all times align with actuality, however for our functions, all we’d like is a single characteristic to see why we’d like higher categorical encoding methods.
A Function With 25 Classes
Let’s begin by a primary toy dataset with one characteristic and a steady goal. Listed here are the dependencies we’d like:
import numpy as np
import polars as pl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split