LCE Presentation

As shown in “Why Do Tree-Based Models still Outperform Deep Learning on Tabular Data?” [8], the widely used tree-based models remain the state-of-the-art machine learning methods in many cases. Local Cascade Ensemble (LCE) [7] proposes to combine the strengths of the top performing tree-based ensemble methods - Random Forest [3] and eXtreme Gradient Boosting (XGBoost) [4], and integrates a supplementary diversification approach which enables it to be a better generalizing predictor.

Overview

The construction of an ensemble method involves combining accurate and diverse individual predictors. There are two complementary ways to generate diverse predictors: (i) by changing the training data distribution and (ii) by learning different parts of the training data.

LCE adopts these two diversification approaches. First, (i) LCE combines the two well-known methods that modify the distribution of the original training data with complementary effects on the bias-variance trade-off: bagging [2] (variance reduction) and boosting [11] (bias reduction). Then, (ii) LCE learns different parts of the training data to capture new relationships that cannot be discovered globally based on a divide-and-conquer strategy (a decision tree). Before detailing how LCE combines these methods, we introduce the key concepts behind them that will be used in the explanation of LCE.

Concepts

The bias-variance trade-off defines the capacity of the learning algorithm to generalize beyond the training set. The bias is the component of the prediction error that results from systematic errors of the learning algorithm. A high bias means that the learning algorithm is not able to capture the underlying structure of the training set (underfitting). The variance measures the sensitivity of the learning algorithm to changes in the training set. A high variance means that the algorithm is learning too closely the training set (overfitting). The objective is to minimize both the bias and variance. Bagging has a main effect on variance reduction; it is a method for generating multiple versions of a predictor (bootstrap replicates) and using these to get an aggregated predictor. The current state-of-the-art method that employs bagging is Random Forest [3]. Whereas, boosting has a main effect on bias reduction; it is a method for iteratively learning weak predictors and adding them to create a final strong one. After a weak learner is added, the data weights are readjusted, allowing future weak learners to focus more on the examples that previous weak learners mispredicted. The current state-of-the-art method that uses boosting is XGBoost [4]. The following Figure illustrates the difference between bagging and boosting methods.

_images/Figure_BaggingvsBoosting.png

LCE

LCE combines a boosting-bagging approach to handle the bias-variance trade-off faced by machine learning models; in addition, it adopts a divide-and-conquer approach to individualize predictor errors on different parts of the training data. LCE is represented in the following Figure.

_images/Figure_LCE.png

Specifically, LCE is based on cascade generalization: it uses a set of predictors sequentially, and adds new attributes to the input dataset at each stage. The new attributes are derived from the output given by a predictor (e.g., class probabilities for a classifier), called a base learner. LCE applies cascade generalization locally following a divide-and-conquer strategy - a decision tree, and reduces bias across a decision tree through the use of boosting-based predictors as base learners. The current best performing state-of-the-art boosting algorithm is adopted as base learner by default (XGBoost, e.g., XGB¹°, XGB¹¹ in above Figure). CatBoost [10] and LightGBM [9] can also be chosen as base learner. When growing the tree, boosting is propagated down the tree by adding the output of the base learner at each decision node as new attributes to the dataset (e.g., XGB¹°(D¹) in above Figure). Prediction outputs indicate the ability of the base learner to correctly predict a sample. At the next tree level, the outputs added to the dataset are exploited by the base learner as a weighting scheme to focus more on previously mispredicted samples. Then, the overfitting generated by the boosted decision tree is mitigated by the use of bagging. Bagging provides variance reduction by creating multiple predictors from random sampling with replacement of the original dataset (e.g., D¹, D² in above Figure). Finally, trees are aggregated with a simple majority vote. In order to be applied as a predictor, LCE stores, in each node, the model generated by the base learner.

Missing Data

LCE natively handles missing data. Similar to XGBoost, LCE excludes missing values for the split and uses block propagation. During a node split, block propagation sends all samples with missing data to the side of the decision node with less errors.

Hyperparameters

The hyperparameters of LCE are the classical ones in tree-based learning (e.g., max_depth, max_features, n_estimators). Moreover, LCE learns a specific XGBoost model at each node of a tree, and it only requires the ranges of XGBoost hyperparameters to be specified. Then, the hyperparameters of each XGBoost model are automatically set by Hyperopt [1], a sequential model-based optimization using a tree of Parzen estimators algorithm. Hyperopt chooses the next hyperparameters from both the previous choices and a tree-based optimization algorithm. Tree of Parzen estimators meets or exceeds grid search and random search performance for hyperparameters setting. The full list of LCE hyperparameters is available in its API documentation.

Published Results

LCE has been initially designed for a specific application in [6], and then evaluated on the public UCI datasets [5] in [7]. Results show that LCE obtains on average a better prediction performance than the state-of-the-art classifiers, including Random Forest and XGBoost. For a comparison between LCE, Random Forest and XGBoost on different public datasets, using the public implementations of the aforementioned algorithms, please refer to the article published in Towards Data Science “LCE: The Most Powerful Machine Learning Method?”.

References

Installation

You can install LCE from PyPI with pip:

pip install lcensemble

Or conda:

conda install -c conda-forge lcensemble

Code Examples

The following examples illustrate the use of LCE on public datasets for a classification and a regression task. They also demonstrate the compatibility of LCE with scikit-learn pipelines and model selection tools through the use of cross_val_score. An example of LCE on a dataset including missing values is also shown.

Classification

  • Example 1: LCE on Iris Dataset

from lce import LCEClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


# Load data and generate a train/test split
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

# Train LCEClassifier with default parameters
clf = LCEClassifier(n_jobs=-1, random_state=0)
clf.fit(X_train, y_train)

# Make prediction and compute accuracy score
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.1f}%".format(accuracy*100))
Accuracy: 97.4%
  • Example 2: LCE with scikit-learn cross validation score

This example demonstrates the compatibility of LCE with scikit-learn pipelines and model selection tools through the use of cross_val_score.

from lce import LCEClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split

# Load data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

# Set LCEClassifier with default parameters
clf = LCEClassifier(n_jobs=-1, random_state=0)

# Compute cross-validation scores
cv_scores = cross_val_score(clf, X_train, y_train, cv=3)
cv_scores = [round(elem*100, 1) for elem in cv_scores.tolist()]
print("Cross-validation scores on train set: ", cv_scores)
Cross-validation scores on train set:  [94.7, 100.0, 94.6]

Regression

  • Example 3: LCE on Diabetes Dataset

from lce import LCERegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


# Load data and generate a train/test split
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

# Train LCERegressor with default parameters
reg = LCERegressor(n_jobs=-1, random_state=0)
reg.fit(X_train, y_train)

# Make prediction
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("The mean squared error (MSE) on test set: {:.0f}".format(mse))
The mean squared error (MSE) on test set: 3761
  • Example 4: LCE with missing values

This example illustrates the robustness of LCE to missing values. The Diabetes train set is modified with 20% of missing values per variable.

import numpy as np
from lce import LCERegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


# Load data and generate a train/test split
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

# Input 20% of missing values per variable in the train set
np.random.seed(0)
m = 0.2
for j in range(0, X_train.shape[1]):
        sub = np.random.choice(X_train.shape[0], int(X_train.shape[0]*m))
        X_train[sub, j] = np.nan

# Train LCERegressor with default parameters
reg = LCERegressor(n_jobs=-1, random_state=0)
reg.fit(X_train, y_train)

# Make prediction
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("The mean squared error (MSE) on test set: {:.0f}".format(mse))
The mean squared error (MSE) on test set: 3895

Python Source Files

Gallery generated by Sphinx-Gallery