CatBoost

Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels (i.e. they are highly cardinal), performing one-hot-encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset.

CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms.

Code:

CatBoost algorithm effectively deals with categorical variables. Thus, you should not perform one-hot encoding for categorical variables. Just load the files, impute missing values, and you’re good to go.

from catboost import CatBoostClassifier
model=CatBoostClassifier()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
model.fit(x_train,y_train,cat_features=([ 0,  1, 2, 3, 4, 10]),eval_set=(x_test, y_test))
model.score(x_test,y_test)
0.80540540540540539

Sample code for regression problem:

from catboost import CatBoostRegressor
model=CatBoostRegressor()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
model.fit(x_train,y_train,cat_features=([ 0,  1, 2, 3, 4, 10]),eval_set=(x_test, y_test))
model.score(x_test,y_test)

Parameters

loss_function:
- Defines the metric to be used for training.
iterations:
- The maximum number of trees that can be built.
- The final number of trees may be less than or equal to this number.
learning_rate:
- Defines the learning rate.
- Used for reducing the gradient step.
border_count:
- It specifies the number of splits for numerical features.
- It is similar to the max_bin parameter.
depth:
- Defines the depth of the trees.
random_seed:
- This parameter is similar to the ‘random_state’ parameter we have seen previously.
- It is an integer value to define the random seed for training

What is CatBoost?

CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with diverse data types to help solve a wide range of problems that businesses face today. To top it up, it provides best-in-class accuracy.

It is especially powerful in two ways:

It yields state-of-the-art results without extensive data training typically required by other machine learning methods, and
Provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems.

“CatBoost” name comes from two words “Category” and “Boosting”.

As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.

“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.

Advantages of CatBoost Library

Performance: CatBoost provides state of the art results and it is competitive with any leading machine learning algorithm on the performance front.
Handling Categorical features automatically: We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. You can read more about it here.
Robust: It reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models. Although, CatBoost has multiple parameters to tune and it contains parameters like the number of trees, learning rate, regularization, tree depth, fold size, bagging temperature and others. You can read about all these parameters here.
Easy-to-use: You can use CatBoost from the command line, using an user-friendly API for both Python and R.

CatBoost – Comparison to other boosting libraries

We have multiple boosting libraries like XGBoost, H2O and LightGBM and all of these perform well on variety of problems. CatBoost developer have compared the performance with competitors on standard ML datasets:

The comparison above shows the log-loss value for test data and it is lowest in the case of CatBoost in most cases. It clearly signifies that CatBoost mostly performs better for both tuned and default models.

In addition to this, CatBoost does not require conversion of data set to any specific format like XGBoost and LightGBM.

PreviousLight GBM NextKNN

Last updated 4 years ago

Was this helpful?