lce.LCERegressor

class lce.LCERegressor(n_estimators=10, bootstrap=True, criterion='squared_error', splitter='best', max_depth=2, max_features=None, max_samples=1.0, min_samples_leaf=1, metric='neg_mean_squared_error', n_iter=10, base_learner='xgboost', base_n_estimators=(10, 50, 100), base_max_depth=(3, 6, 9), base_num_leaves=(20, 50, 100, 500), base_learning_rate=(0.01, 0.1, 0.3, 0.5), base_booster=('gbtree',), base_boosting_type=('gbdt',), base_gamma=(0, 1, 10), base_min_child_weight=(1, 5, 15, 100), base_subsample=(1.0,), base_subsample_for_bin=(200000,), base_colsample_bytree=(1.0,), base_colsample_bylevel=(1.0,), base_colsample_bynode=(1.0,), base_reg_alpha=(0,), base_reg_lambda=(0.1, 1.0, 5.0), n_jobs=None, random_state=None, verbose=0)[source]

A Local Cascade Ensemble (LCE) regressor. LCERegressor is compatible with scikit-learn; it passes the check_estimator. Therefore, it can interact with scikit-learn pipelines and model selection tools.

Parameters:
n_estimatorsint, default=10

The number of trees in the ensemble.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

criterion{“squared_error”, “friedman_mse”, “absolute_error”, “poisson”}, default=”squared_error”

The function to measure the quality of a split. Supported criteria are “squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, “absolute_error” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and “poisson” which uses reduction in Poisson deviance to find splits.

splitter{“best”, “random”}, default=”best”

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depthint, default=2

The maximum depth of a tree.

max_featuresint, float or {“auto”, “sqrt”, “log”}, default=None

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_samplesint or float, default=1.0

The number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details).

  • If int, then draw max_samples samples.

  • If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

n_iter: int, default=10

Number of iterations to set the hyperparameters of each node base regressor in Hyperopt.

metric: string, default=”neg_mean_squared_error”

The score of the base regressor optimized by Hyperopt. Supported metrics are the ones from scikit-learn.

base_learner{“catboost”, “lightgbm”, “xgboost”}, default=”xgboost”

The base classifier trained in each node of a tree.

base_n_estimatorstuple, default=(10, 50, 100)

The number of estimators of the base learner. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_max_depthtuple, default=(3, 6, 9)

Maximum tree depth for base learners. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_num_leavestuple, default=(20, 50, 100, 500)

Maximum tree leaves (applicable to LightGBM only). The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_learning_ratetuple, default=(0.01, 0.1, 0.3, 0.5)

learning_rate of the base learner. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_booster(“dart”, “gblinear”, “gbtree”), default=(“gbtree”,)

The type of booster to use (applicable to XGBoost only). “gbtree” and “dart” use tree based models while “gblinear” uses linear functions. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_boosting_type(“dart”, “gbdt”, “rf”), default=(“gbdt”,)

The type of boosting type to use (applicable to LightGBM only): “dart” dropouts meet Multiple Additive Regression Trees; “gbdt” traditional Gradient Boosting Decision Tree; “rf” Random Forest. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_gammatuple, default=(0, 1, 10)

gamma of XGBoost. gamma corresponds to the minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative XGBoost algorithm will be. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_min_child_weighttuple, default=(1, 5, 15, 100)

min_child_weight of base learner (applicable to LightGBM and XGBoost only). min_child_weight defines the minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. The larger min_child_weight is, the more conservative the base learner algorithm will be. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_subsampletuple, default=(1.0,)

Base learner subsample ratio of the training instances (applicable to LightGBM and XGBoost only). Setting it to 0.5 means that the base learner would randomly sample half of the training data prior to growing trees, and this will prevent overfitting. Subsampling will occur once in every boosting iteration. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_subsample_for_bintuple, default=(200000,)

Number of samples for constructing bins (applicable to LightGBM only). The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_colsample_bytreetuple, default=(1.0,)

Base learner subsample ratio of columns when constructing each tree (applicable to LightGBM and XGBoost only). Subsampling occurs once for every tree constructed. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_colsample_byleveltuple, default=(1.0,)

Subsample ratio of columns for each level (applicable to CatBoost and XGBoost only). Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_colsample_bynodetuple, default=(1.0,)

Subsample ratio of columns for each node split (applicable to XGBoost only). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_reg_alphatuple, default=(0,)

reg_alpha of the base learner (applicable to LightGBM and XGBoost only). reg_alpha corresponds to the L1 regularization term on the weights. Increasing this value will make the base learner more conservative. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

base_reg_lambdatuple, default=(0.1, 1.0, 5.0)

reg_lambda of the base learner. reg_lambda corresponds to the L2 regularization term on the weights. Increasing this value will make the base learner more conservative. The tuple provided is the search space used for the hyperparameter optimization (Hyperopt).

n_jobsint, default=None

The number of jobs to run in parallel. n_jobs=None means 1. n_jobs=-1 means using all processors.

random_stateint, RandomState instance or None, default=None

Controls the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True), the sampling of the features to consider when looking for the best split at each node (if max_features < n_features), the base classifier (XGBoost) and the Hyperopt algorithm.

verboseint, default=0

Controls the verbosity when fitting.

Notes

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

Attributes:
base_estimator_LCETreeRegressor

The child estimator template used to create the collection of fitted sub-estimators.

estimators_list of LCETreeRegressor

The collection of fitted sub-estimators.

n_features_in_int

The number of features when fit is performed.

__init__(n_estimators=10, bootstrap=True, criterion='squared_error', splitter='best', max_depth=2, max_features=None, max_samples=1.0, min_samples_leaf=1, metric='neg_mean_squared_error', n_iter=10, base_learner='xgboost', base_n_estimators=(10, 50, 100), base_max_depth=(3, 6, 9), base_num_leaves=(20, 50, 100, 500), base_learning_rate=(0.01, 0.1, 0.3, 0.5), base_booster=('gbtree',), base_boosting_type=('gbdt',), base_gamma=(0, 1, 10), base_min_child_weight=(1, 5, 15, 100), base_subsample=(1.0,), base_subsample_for_bin=(200000,), base_colsample_bytree=(1.0,), base_colsample_bylevel=(1.0,), base_colsample_bynode=(1.0,), base_reg_alpha=(0,), base_reg_lambda=(0.1, 1.0, 5.0), n_jobs=None, random_state=None, verbose=0)[source]
fit(X, y)[source]

Build a forest of LCE trees from the training set (X, y).

Parameters:
Xarray-like of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,)

The target values (real numbers).

Returns:
selfobject
predict(X)[source]

Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest.

Parameters:
Xarray-like of shape (n_samples, n_features)

The training input samples.

Returns:
yndarray of shape (n_samples,)

The predicted values.

set_params(**params)[source]

Set the parameters of the estimator.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfobject