eBoruta package
Submodules
eBoruta.algorithm module
A module containing eBoruta masterclass encapsulating algorithm’s execution.
- class eBoruta.algorithm.eBoruta(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]
Bases:
BaseEstimator,TransformerMixinFlexible sklearn-compatible feature selection wrapper method.
- __init__(n_iter=30, classification=True, percentile=100, pvalue=0.05, test_size=0, test_stratify=False, shap=True, shap_gpu_tree=False, shap_approximate=False, shap_check_additivity=False, importance_getter=None, verbose=1)[source]
- Parameters:
n_iter (int) – The number of trials to run the algorithm.
classification (bool) – True if the task is classification else False.
percentile (int) – Percentile of the shadow features as alternative to max in original Boruta.
pvalue (float) – Level of rejecting the null hypothesis (the absence of a feature’s importance).
test_size (int | float) – The test_size param passed to
train_test_split(). Can be a number or a fraction.test_stratify (bool) – Stratify the test examples based on the y class values to balance the split.
shap (bool) – Use shap explainer.
shap_gpu_tree (bool) – Use
shap.GPUTreeexplainer.shap_approximate (bool) – Approximate shap importance values. Caution! some estimators may not support it (e.g., CatBoost). Ignored for linear models.
shap_check_additivity (bool) – Passed to the explainer. Consult with shap documentation. Ignored for linear models.
importance_getter (ImportanceGetter | None) – A callable accepting either an estimator or an estimator and TrialData instance and returning a numpy array of length equal to the number of features in TrialData.x_test.
verbose (int) – 0 – no output; 1 – progress bar; 2 – progress bar and info; 3 – debug mode
- calculate_importance(model, trial_data, abs_=True)[source]
- Parameters:
model (_E) – Estimator with fit method. In case of using shap importance, a tree-based estimator supported by Tree explainer is expected.
trial_data (TrialData) – Datasets generated for the current trial. see
eBoruta.containers.Dataset.generate_trial_sample().abs – Take absolute value of the importance array.
Trueby default since shap contributions may be negative.
- Returns:
An array of importance values.
- Return type:
ndarray
- fit(x, y, sample_weight=None, model_type=None, callbacks_trial_start=None, callbacks_trial_end=None, model_init_kwargs=None, **kwargs)[source]
Fit the boruta algorithm.
- Parameters:
x (_X) – Features collection as a 2D array or a
pd.DataFrame.y (_Y) – Response variable(s).
sample_weight (ndarray | None) – Optional sample weight for each instance in
xfor models that support it within thefit()method.model_type (Type[_E] | None) – An uninitialized estimator type with a
fit(x, y)andpredict(x)methods. IfNone, useRandomForestClassifierifclassificationisTrueelse useRandomForestRegressor.callbacks_trial_start (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s start.
callbacks_trial_end (Sequence[CallbackFN | CallbackClass] | None) – Callbacks to call at each trial’s end.
model_init_kwargs (dict[str, Any] | None) – Optional keyword arguments to initialize the estimator type with. If not provided, it is assumed that the estimator can be initialized without any arguments. If model_type is a Pipeline, the last step must be a supported classifier or regressor named
model.kwargs – Passed to
model.fit()method.
- Returns:
eBorutaobject.- Return type:
- rank(features=None, step=None, fit=True, model=None, gen_sample=False, sort=False)[source]
Rank (sort) features by feature importance values.
Uses
calculate_importance()to obtain importance of selected features anddataset_to obtain the data.- Parameters:
features (Sequence[str] | ndarray | None) – A sequence of features to select.
step (int | None) – A step (trial) number. If provided, will the method will select accepted features at this trial. If features were provided, will intersect with this list.
fit (bool) – Fit the model before calculating importance. In most cases, this should be
True, since otherwise the features used to fit themodel_would be different from the features being ranked (for which themodel_will be queried in order to calculate the importance values).model (_E | None) – Use a prefit model instead of the stored
model_.gen_sample (bool) – Generate trial sample from the
dataset_usingtest_sizeandstratifyvalues provided during init.sort (bool) – Sort results by importance values in descending order.
- Returns:
A DataFrame with Feature and Importance columns.
- Return type:
DataFrame
- report_features(features=None, full=False)[source]
Create a text report of the optimization progress. It’s sane to print it out when debugging and using a few features.
- Parameters:
features (Features | None) – Current features.
full (bool) – Create a very lengthy report for each individual feature.
- Returns:
Concatenated report messages as a single string.
- Return type:
str
- rough_fix(n_last_trials=None)[source]
Apply “rough fix” strategy to handle remaining tentative features.
Features having
median(importance) > median(thresholds)where importance and thresholds are history records of tentative features’ importance and percentile thresholds used to mark “hits” after each trial.- Parameters:
n_last_trials (int | None) – Consider only this number of last trials. If
None, defaults to all trials.- Returns:
Modified features with resolved tentative ones.
- Return type:
- set_fit_request(*, callbacks_trial_end='$UNCHANGED$', callbacks_trial_start='$UNCHANGED$', model_init_kwargs='$UNCHANGED$', model_type='$UNCHANGED$', sample_weight='$UNCHANGED$', x='$UNCHANGED$')
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- callbacks_trial_endstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
callbacks_trial_endparameter infit.- callbacks_trial_startstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
callbacks_trial_startparameter infit.- model_init_kwargsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
model_init_kwargsparameter infit.- model_typestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
model_typeparameter infit.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter infit.- xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xparameter infit.
Returns
- selfobject
The updated object.
- Return type:
- set_transform_request(*, tentative='$UNCHANGED$', x='$UNCHANGED$')
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.Parameters
- tentativestr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
tentativeparameter intransform.- xstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
xparameter intransform.
Returns
- selfobject
The updated object.
- Return type:
- stat_tests(features, iter_i)[source]
Test features at current iteration based on importance values.
- Parameters:
features (Features) – Features state at current iteration.
iter_i (int) – Iteration starting from 1.
- Returns:
Accepted and rejected feature names. Features present in
eBoruta.containers.Features.namesbut not accepted or rejected are considered “tentative”.- Return type:
tuple[ndarray, ndarray]
- transform(x, tentative=False)[source]
Transform input data using fitted model. Transformation means selecting accepted features from
x. Note that due to inheriting the from theTransformerMixin.- Parameters:
x (_X) – Data used to
fit()the algorithmtentative (bool) – Also select tentative features.
- Returns:
A subset of
x.- Return type:
DataFrame
eBoruta.base module
Base types and objects to inherit from.
- exception eBoruta.base.ValidationError[source]
Bases:
ValueErrorCases of failure to validate data.
eBoruta.callbacks module
- class eBoruta.callbacks.CallbackClass(*args, **kwargs)[source]
Bases:
Protocol- __init__(*args, **kwargs)
- class eBoruta.callbacks.CallbackFN(*args, **kwargs)[source]
Bases:
Protocol- __init__(*args, **kwargs)
- class eBoruta.callbacks.EvalSetSupplier(param_name='eval_set')[source]
Bases:
object
- class eBoruta.callbacks.IterationAdjuster(param_name, min_value, reducer=functools.partial(<function reduce_by_fraction>, frac=0.5))[source]
Bases:
object
eBoruta.containers module
Types holding intermediate and final data for the algorithm.
- class eBoruta.containers.Dataset(x, y, w=None, min_features=5)[source]
Bases:
Generic[_X,_Y]A container holding permanent data (x, y and weights) for training/validation/testing/etc.
- generate_trial_sample(columns=None, **kwargs)[source]
Generates data for a single Boruta trial based on
x,y, andw. Creates a copy ofx, permutes rows, and renames columns as “shadow_{original_name}”. Concatenates original dataframe and the one with the shadow features to create a copy of the learning data with at least twice as many features.If the number of features in
xafter selecting bycolumnsis belowmin_features, randomly oversample existing features to account for the difference. Thus, the returned dataframe to always have at leastmin_featurescolumns.- Parameters:
columns (None | list[str] | ndarray) – An optional list or array of columns to select from
x.kwargs – Keyword args passed to
train_test_split()used to create train/test splits. Enable this feature by passingtest_size={f}wherefis the test size fraction. This allows using different datasets for training and importance computation.
- Returns:
A prepared trial data.
- Raises:
RuntimeError – If resulting features have duplicate names.
- Return type:
- property w: ndarray | None
- Returns:
Sample weights’ array.
- property x: DataFrame
- Returns:
Variables’ dataframe.
- property y: ndarray
- Returns:
Target variables’ array.
- class eBoruta.containers.Features(names, _history=None)[source]
Bases:
objectA dynamic container representing a set of features used by Boruta throughout the run.
It’s created internally and maintained by
eBoruta.algorithm.eBoruta.- __init__(names, _history=None)
- accepted_at_step(step)[source]
- Parameters:
step (int) – Step (trial) number.
- Returns:
Feature names accepted at step.
- Return type:
ndarray
- compose_history()[source]
Access the selection history and compose a summary table.
- Returns:
A history dataframe.
- Return type:
DataFrame
- reset_history_index()[source]
Bulk-
pd.DataFrame.reset_index(). of importance, decision and hit history dataframes.
- property accepted: ndarray
return: An array of feature names marked as accepted.
- accepted_mask: ndarray
- dec_history: DataFrame
- property history: DataFrame
- Returns:
A history dataframe created using
compose_summary()if it doesn’t exist.
- hit_history: DataFrame
- imp_history: DataFrame
- names: ndarray
An array of feature names.
- property rejected: ndarray
- Returns:
An array of feature names marked as rejected.
- rejected_mask: ndarray
- property shape: tuple[int, int]
- Returns:
(# steps, # features)
- property tentative: ndarray
- Returns:
An array of feature names marked as tentative.
- tentative_mask: ndarray
- class eBoruta.containers.TrialData(x_train, x_test, y_train, y_test, w_train=None, w_test=None)[source]
Bases:
Generic[_Y]Data for a Boruta trial.
- __init__(x_train, x_test, y_train, y_test, w_train=None, w_test=None)
- property shapes: str
Descriptor property.
- Returns:
a string with shapes of x and y attributes.
- w_test: ndarray | None = None
- w_train: ndarray | None = None
Weights for train and test folds
- x_test: DataFrame
- x_train: DataFrame
- y_test: _Y
- y_train: _Y
eBoruta.utils module
- eBoruta.utils.convert_to_array(a)[source]
- Parameters:
a (Any) – Any object.
- Returns:
An
np.array(a).- Raises:
TypeError – if the above fails.
- Return type:
ndarray
- eBoruta.utils.sample_dataset(regression=False, multiclass=False, multitarget=False)[source]
Make a sample dataset with 30 features and 100 samples.
- Parameters:
regression (bool) – Regression objective. Otherwise, classification assumed.
multiclass (bool) – Multiple (3) classes for classification.
multitarget (bool) – Multiple (3) targets for regression.
- Returns:
DataFrame with predictors and DataFrame with response variables.
- Return type:
tuple[DataFrame, DataFrame]