eBoruta package

Submodules

eBoruta.algorithm module

A module containing eBoruta masterclass encapsulating algorithm’s execution.

class eBoruta.algorithm.eBoruta(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]

Bases: BaseEstimator, TransformerMixin

Flexible sklearn-compatible feature selection wrapper method.

__init__(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]
Parameters:
  • n_iter (int) – The number of trials to run the algorithm.

  • classification (bool) – True if the task is classification else False.

  • percentile (int) – Percentile of the shadow features as alternative to max in original Boruta.

  • pvalue (float) – Level of rejecting the null hypothesis (the absence of a feature’s importance).

  • test_size (int | float) – The test_size param passed to train_test_split(). Can be a number or a fraction.

  • test_stratify (bool) – Stratify the test examples based on the y class values to balance the split.

  • shap (bool) – Use shap explainer.

  • shap_gpu_tree (bool) – Use shap.GPUTree explainer.

  • shap_approximate (bool) – Approximate shap importance values. Caution! some estimators may not support it (e.g., CatBoost). Ignored for linear models.

  • shap_check_additivity (bool) – Passed to the explainer. Consult with shap documentation. Ignored for linear models.

  • importance_getter (ImportanceGetter | None) – A callable accepting either an estimator or an estimator and TrialData instance and returning a numpy array of length equal to the number of features in TrialData.x_test.

  • verbose (int) – 0 – no output; 1 – progress bar; 2 – progress bar and info; 3 – debug mode

calculate_importance(model, trial_data, abs_=True)[source]
Parameters:
  • model (_E) – Estimator with fit method. In case of using shap importance, a tree-based estimator supported by Tree explainer is expected.

  • trial_data (TrialData) – Datasets generated for the current trial. see eBoruta.containers.Dataset.generate_trial_sample().

  • abs – Take absolute value of the importance array. True by default since shap contributions may be negative.

Returns:

An array of importance values.

Return type:

ndarray

fit(x, y, sample_weight=None, model_type=None, callbacks_trial_start=None, callbacks_trial_end=None, model_init_kwargs=None, **kwargs)[source]

Fit the boruta algorithm.

Parameters:
  • x (_X) – Features collection as a 2D array or a pd.DataFrame.

  • y (_Y) – Response variable(s).

  • sample_weight (ndarray | None) – Optional sample weight for each instance in x for models that support it within the fit() method.

  • model_type (Type[_E] | None) – An uninitialized estimator type with a fit(x, y) and predict(x) methods. If None, use RandomForestClassifier if classification is True else use RandomForestRegressor.

  • callbacks_trial_start (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s start.

  • callbacks_trial_end (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s end.

  • model_init_kwargs (dict[str, Any] | None) – Optional keyword arguments to initialize the estimator type with. If not provided, it is assumed that the estimator can be initialized without any arguments. If model_type is a Pipeline, the last step must be a supported classifier or regressor named model.

  • kwargs – Passed to model.fit() method.

Returns:

eBoruta object.

Return type:

eBoruta

rank(features=None, step=None, fit=True, model=None, gen_sample=False, sort=False)[source]

Rank (sort) features by feature importance values.

Uses calculate_importance() to obtain importance of selected features and dataset_ to obtain the data.

Parameters:
  • features (Sequence[str] | ndarray | None) – A sequence of features to select.

  • step (int | None) – A step (trial) number. If provided, will the method will select accepted features at this trial. If features were provided, will intersect with this list.

  • fit (bool) – Fit the model before calculating importance. In most cases, this should be True, since otherwise the features used to fit the model_ would be different from the features being ranked (for which the model_ will be queried in order to calculate the importance values).

  • model (_E | None) – Use a prefit model instead of the stored model_.

  • gen_sample (bool) – Generate trial sample from the dataset_ using test_size and stratify values provided during init.

  • sort (bool) – Sort results by importance values in descending order.

Returns:

A DataFrame with Feature and Importance columns.

Return type:

DataFrame

report_features(features=None, full=False)[source]

Create a text report of the optimization progress. It’s sane to print it out when debugging and using a few features.

Parameters:
  • features (Features | None) – Current features.

  • full (bool) – Create a very lengthy report for each individual feature.

Returns:

Concatenated report messages as a single string.

Return type:

str

rough_fix(n_last_trials=None)[source]

Apply “rough fix” strategy to handle remaining tentative features.

Features having median(importance) > median(thresholds) where importance and thresholds are history records of tentative features’ importance and percentile thresholds used to mark “hits” after each trial.

Parameters:

n_last_trials (int | None) – Consider only this number of last trials. If None, defaults to all trials.

Returns:

Modified features with resolved tentative ones.

Return type:

Features

set_fit_request(*, callbacks_trial_end='$UNCHANGED$', callbacks_trial_start='$UNCHANGED$', model_init_kwargs='$UNCHANGED$', model_type='$UNCHANGED$', sample_weight='$UNCHANGED$', x='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

callbacks_trial_endstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for callbacks_trial_end parameter in fit.

callbacks_trial_startstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for callbacks_trial_start parameter in fit.

model_init_kwargsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for model_init_kwargs parameter in fit.

model_typestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for model_type parameter in fit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for x parameter in fit.

Returns

selfobject

The updated object.

Return type:

eBoruta

set_transform_request(*, tentative='$UNCHANGED$', x='$UNCHANGED$')

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

tentativestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for tentative parameter in transform.

xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for x parameter in transform.

Returns

selfobject

The updated object.

Return type:

eBoruta

stat_tests(features, iter_i)[source]

Test features at current iteration based on importance values.

Parameters:
  • features (Features) – Features state at current iteration.

  • iter_i (int) – Iteration starting from 1.

Returns:

Accepted and rejected feature names. Features present in eBoruta.containers.Features.names but not accepted or rejected are considered “tentative”.

Return type:

tuple[ndarray, ndarray]

transform(x, tentative=False)[source]

Transform input data using fitted model. Transformation means selecting accepted features from x. Note that due to inheriting the from the TransformerMixin.

Parameters:
  • x (_X) – Data used to fit() the algorithm

  • tentative (bool) – Also select tentative features.

Returns:

A subset of x.

Return type:

DataFrame

eBoruta.base module

Base types and objects to inherit from.

exception eBoruta.base.ValidationError[source]

Bases: ValueError

Cases of failure to validate data.

class eBoruta.base.Estimator(*args, **kwargs)[source]

Bases: Protocol

An estimator protocol encapsulating methods strictly necessary for the main algorithm’s functioning.

__init__(*args, **kwargs)
fit(x, y, **kwargs)[source]

Fit the estimator.

Return type:

Estimator

get_params()[source]

Get a dict with the estimator’s params.

Return type:

dict[str, Any]

predict(x, **kwargs)[source]

Make predictions.

Return type:

ndarray

class eBoruta.base.ImportanceGetter(*args, **kwargs)[source]

Bases: Protocol

__call__(estimator, trial_data=None)[source]

Call self as a function.

Return type:

np.ndarray

__init__(*args, **kwargs)

eBoruta.callbacks module

class eBoruta.callbacks.CallbackClass(*args, **kwargs)[source]

Bases: Protocol

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

__init__(*args, **kwargs)
class eBoruta.callbacks.CallbackFN(*args, **kwargs)[source]

Bases: Protocol

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

__init__(*args, **kwargs)
class eBoruta.callbacks.CatFeaturesSupplier[source]

Bases: object

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

class eBoruta.callbacks.EvalSetSupplier(param_name='eval_set')[source]

Bases: object

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

__init__(param_name='eval_set')[source]
class eBoruta.callbacks.IterationAdjuster(param_name, min_value, reducer=functools.partial(<function reduce_by_fraction>, frac=0.5))[source]

Bases: object

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

__init__(param_name, min_value, reducer=functools.partial(<function reduce_by_fraction>, frac=0.5))[source]
class eBoruta.callbacks.Scorer(scorers, verbose=2)[source]

Bases: object

__call__(estimator, features, dataset, trial_data, **kwargs)[source]

Call self as a function.

Return type:

Tuple[_E, Features, Dataset, TrialData, Dict[str, Any]]

__init__(scorers, verbose=2)[source]
eBoruta.callbacks.change_params_and_reinit(estimator, update)[source]
eBoruta.callbacks.reduce_by_fraction(num_features, frac)[source]

eBoruta.containers module

Types holding intermediate and final data for the algorithm.

class eBoruta.containers.Dataset(x, y, w=None, min_features=5)[source]

Bases: Generic[_X, _Y]

A container holding permanent data (x, y and weights) for training/validation/testing/etc.

__init__(x, y, w=None, min_features=5)[source]
generate_trial_sample(columns=None, **kwargs)[source]

Generates data for a single Boruta trial based on x, y, and w. Creates a copy of x, permutes rows, and renames columns as “shadow_{original_name}”. Concatenates original dataframe and the one with the shadow features to create a copy of the learning data with at least twice as many features.

If the number of features in x after selecting by columns is below min_features, randomly oversample existing features to account for the difference. Thus, the returned dataframe to always have at least min_features columns.

Parameters:
  • columns (None | list[str] | ndarray) – An optional list or array of columns to select from x.

  • kwargs – Keyword args passed to train_test_split() used to create train/test splits. Enable this feature by passing test_size={f} where f is the test size fraction. This allows using different datasets for training and importance computation.

Returns:

A prepared trial data.

Raises:

RuntimeError – If resulting features have duplicate names.

Return type:

TrialData

property w: ndarray | None
Returns:

Sample weights’ array.

property x: DataFrame
Returns:

Variables’ dataframe.

property y: ndarray
Returns:

Target variables’ array.

class eBoruta.containers.Features(names, _history=None)[source]

Bases: object

A dynamic container representing a set of features used by Boruta throughout the run.

It’s created internally and maintained by eBoruta.algorithm.eBoruta.

__init__(names, _history=None)
accepted_at_step(step)[source]
Parameters:

step (int) – Step (trial) number.

Returns:

Feature names accepted at step.

Return type:

ndarray

compose_history()[source]

Access the selection history and compose a summary table.

Returns:

A history dataframe.

Return type:

DataFrame

static melt_history(df, value_name)[source]
Return type:

DataFrame

reset_history_index()[source]

Bulk-pd.DataFrame.reset_index(). of importance, decision and hit history dataframes.

property accepted: ndarray

return: An array of feature names marked as accepted.

accepted_mask: ndarray
dec_history: DataFrame
property history: DataFrame
Returns:

A history dataframe created using compose_summary() if it doesn’t exist.

hit_history: DataFrame
imp_history: DataFrame
names: ndarray

An array of feature names.

property rejected: ndarray
Returns:

An array of feature names marked as rejected.

rejected_mask: ndarray
property shape: tuple[int, int]
Returns:

(# steps, # features)

property tentative: ndarray
Returns:

An array of feature names marked as tentative.

tentative_mask: ndarray
class eBoruta.containers.TrialData(x_train, x_test, y_train, y_test, w_train=None, w_test=None)[source]

Bases: Generic[_Y]

Data for a Boruta trial.

__init__(x_train, x_test, y_train, y_test, w_train=None, w_test=None)
property shapes: str

Descriptor property.

Returns:

a string with shapes of x and y attributes.

w_test: ndarray | None = None
w_train: ndarray | None = None

Weights for train and test folds

x_test: DataFrame
x_train: DataFrame
y_test: _Y
y_train: _Y

eBoruta.utils module

eBoruta.utils.convert_to_array(a)[source]
Parameters:

a (Any) – Any object.

Returns:

An np.array(a).

Raises:

TypeError – if the above fails.

Return type:

ndarray

eBoruta.utils.get_duplicates(it)[source]
Return type:

Iterator[A]

eBoruta.utils.sample_dataset(regression=False, multiclass=False, multitarget=False)[source]

Make a sample dataset with 30 features and 100 samples.

Parameters:
  • regression (bool) – Regression objective. Otherwise, classification assumed.

  • multiclass (bool) – Multiple (3) classes for classification.

  • multitarget (bool) – Multiple (3) targets for regression.

Returns:

DataFrame with predictors and DataFrame with response variables.

Return type:

tuple[DataFrame, DataFrame]

eBoruta.utils.setup_logger(log_path=None, file_level=None, stdout_level=None, stderr_level=None, logger=None)[source]
Return type:

Logger

eBoruta.utils.zip_partition(pred, a, b)[source]
Return type:

Tuple[Iterator[A], Iterator[A]]

Module contents