eBoruta package

Submodules

eBoruta.algorithm module

A module containing eBoruta masterclass encapsulating algorithm’s execution.

class eBoruta.algorithm.eBoruta(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]

Bases: BaseEstimator, TransformerMixin

Flexible sklearn-compatible feature selection wrapper method.

__init__(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]

Parameters:

n_iter (int) – The number of trials to run the algorithm.
classification (bool) – True if the task is classification else False.
percentile (int) – Percentile of the shadow features as alternative to max in original Boruta.
pvalue (float) – Level of rejecting the null hypothesis (the absence of a feature’s importance).
test_size (int | float) – The test_size param passed to train_test_split(). Can be a number or a fraction.
test_stratify (bool) – Stratify the test examples based on the y class values to balance the split.
shap (bool) – Use shap explainer.
shap_gpu_tree (bool) – Use shap.GPUTree explainer.
shap_approximate (bool) – Approximate shap importance values. Caution! some estimators may not support it (e.g., CatBoost). Ignored for linear models.
shap_check_additivity (bool) – Passed to the explainer. Consult with shap documentation. Ignored for linear models.
importance_getter (ImportanceGetter | None) – A callable accepting either an estimator or an estimator and TrialData instance and returning a numpy array of length equal to the number of features in TrialData.x_test.
verbose (int) – 0 – no output; 1 – progress bar; 2 – progress bar and info; 3 – debug mode

calculate_importance(model, trial_data, abs_=True)[source]

Parameters:

model (_E) – Estimator with fit method. In case of using shap importance, a tree-based estimator supported by Tree explainer is expected.
trial_data (TrialData) – Datasets generated for the current trial. see eBoruta.containers.Dataset.generate_trial_sample().
abs – Take absolute value of the importance array. True by default since shap contributions may be negative.

Returns:

An array of importance values.

Return type:

ndarray

fit(x, y, sample_weight=None, model_type=None, callbacks_trial_start=None, callbacks_trial_end=None, model_init_kwargs=None, **kwargs)[source]

Fit the boruta algorithm.

Parameters:

x (_X) – Features collection as a 2D array or a pd.DataFrame.
y (_Y) – Response variable(s).
sample_weight (ndarray | None) – Optional sample weight for each instance in x for models that support it within the fit() method.
model_type (Type[_E] | None) – An uninitialized estimator type with a fit(x, y) and predict(x) methods. If None, use RandomForestClassifier if classification is True else use RandomForestRegressor.
callbacks_trial_start (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s start.
callbacks_trial_end (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s end.
model_init_kwargs (dict[str, Any] | None) – Optional keyword arguments to initialize the estimator type with. If not provided, it is assumed that the estimator can be initialized without any arguments. If model_type is a Pipeline, the last step must be a supported classifier or regressor named model.
kwargs – Passed to model.fit() method.

Returns:

eBoruta object.

Return type:

eBoruta

rank(features=None, step=None, fit=True, model=None, gen_sample=False, sort=False)[source]

Rank (sort) features by feature importance values.

Uses calculate_importance() to obtain importance of selected features and dataset_ to obtain the data.

Parameters:

features (Sequence[str] | ndarray | None) – A sequence of features to select.
step (int | None) – A step (trial) number. If provided, will the method will select accepted features at this trial. If features were provided, will intersect with this list.
fit (bool) – Fit the model before calculating importance. In most cases, this should be True, since otherwise the features used to fit the model_ would be different from the features being ranked (for which the model_ will be queried in order to calculate the importance values).
model (_E | None) – Use a prefit model instead of the stored model_.
gen_sample (bool) – Generate trial sample from the dataset_ using test_size and stratify values provided during init.
sort (bool) – Sort results by importance values in descending order.

Returns:

A DataFrame with Feature and Importance columns.

Return type:

DataFrame

report_features(features=None, full=False)[source]

Create a text report of the optimization progress. It’s sane to print it out when debugging and using a few features.

Parameters:

features (Features | None) – Current features.
full (bool) – Create a very lengthy report for each individual feature.

Returns:

Concatenated report messages as a single string.

Return type:

str

rough_fix(n_last_trials=None)[source]

Apply “rough fix” strategy to handle remaining tentative features.

Features having median(importance) > median(thresholds) where importance and thresholds are history records of tentative features’ importance and percentile thresholds used to mark “hits” after each trial.

Parameters:: n_last_trials (int | None) – Consider only this number of last trials. If None, defaults to all trials.
Returns:: Modified features with resolved tentative ones.
Return type:: Features

set_fit_request(*, callbacks_trial_end='$UNCHANGED$', callbacks_trial_start='$UNCHANGED$', model_init_kwargs='$UNCHANGED$', model_type='$UNCHANGED$', sample_weight='$UNCHANGED$', x='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

callbacks_trial_endstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for callbacks_trial_end parameter in fit.
callbacks_trial_startstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for callbacks_trial_start parameter in fit.
model_init_kwargsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for model_init_kwargs parameter in fit.
model_typestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for model_type parameter in fit.
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.
xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for x parameter in fit.

Returns

selfobject: The updated object.

Return type:: eBoruta

set_transform_request(*, tentative='$UNCHANGED$', x='$UNCHANGED$')

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

tentativestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for tentative parameter in transform.
xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for x parameter in transform.

Returns

selfobject: The updated object.

Return type:: eBoruta

stat_tests(features, iter_i)[source]

Test features at current iteration based on importance values.

Parameters:

features (Features) – Features state at current iteration.
iter_i (int) – Iteration starting from 1.

Returns:

Accepted and rejected feature names. Features present in eBoruta.containers.Features.names but not accepted or rejected are considered “tentative”.

Return type:

tuple[ndarray, ndarray]

transform(x, tentative=False)[source]

Transform input data using fitted model. Transformation means selecting accepted features from x. Note that due to inheriting the from the TransformerMixin.

Parameters:

x (_X) – Data used to fit() the algorithm
tentative (bool) – Also select tentative features.

Returns:

A subset of x.

Return type:

DataFrame

eBoruta.base module

Base types and objects to inherit from.

exception eBoruta.base.ValidationError[source]

Bases: ValueError

Cases of failure to validate data.

class eBoruta.base.Estimator(*args, **kwargs)[source]

Bases: Protocol

An estimator protocol encapsulating methods strictly necessary for the main algorithm’s functioning.

__init__(*args, **kwargs)

fit(x, y, **kwargs)[source]

Fit the estimator.

Return type:: Estimator

get_params()[source]

Get a dict with the estimator’s params.

Return type:: dict[str, Any]

predict(x, **kwargs)[source]

Make predictions.

Return type:: ndarray

class eBoruta.base.ImportanceGetter(*args, **kwargs)[source]

Bases: Protocol

__call__(estimator, trial_data=None)[source]

Call self as a function.

Return type:: np.ndarray

__init__(*args, **kwargs)