Feature Selection Pipeline

class moosefs.feature_selection_pipeline.FeatureSelectionPipeline(data: DataFrame | None = None, *, X: DataFrame | None = None, y: Series | None = None, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None)[source]

Bases: object

End-to-end pipeline for ensemble feature selection.

Orchestrates feature scoring, merging, metric evaluation, and Pareto-based selection across repeated runs and method subgroups.

__init__(data: DataFrame | None = None, *, X: DataFrame | None = None, y: Series | None = None, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None) → None[source]

Initialize the pipeline.

Parameters:

data – Combined DataFrame where the last column is treated as the target.
X – Feature DataFrame (use together with y instead of data).
y – Target Series aligned with X.
fs_methods – Feature selectors (identifiers or instances).
merging_strategy – Merging strategy (identifier or instance).
num_repeats – Number of repeats for the pipeline.
num_features_to_select – Desired number of features to select.
metrics – Metric functions (identifiers or instances).
task – ‘classification’ or ‘regression’.
min_group_size – Minimum number of methods in each subgroup.
fill – If True, enforce exact size after merging.
random_state – Seed for reproducibility.
n_jobs – Parallel jobs (use num_repeats when -1 or None).

Raises:

ValueError – If task is invalid or required parameters are missing.

Note

Exactly one of data or the pair (X, y) must be provided.

static _validate_task(task: str) → None[source]

Validate task string.

Parameters:: task – Expected ‘classification’ or ‘regression’.

static _set_seed(seed: int, idx: int | None = None) → None[source]: Seed numpy/python RNGs for reproducibility.

static _validate_X_y(*, data=None, X=None, y=None)[source]: Normalize user inputs into a feature DataFrame and target Series.

_per_repeat_seed(idx: int) → int[source]: Derive a per-repeat seed from the top-level seed.

_effective_n_jobs() → int[source]: Return parallel job count capped by number of repeats.

_generate_subgroup_names(min_group_size: int) → list[source]

Generate all selector-name combinations with minimum size.

Parameters:: min_group_size – Minimum subgroup size.
Returns:: List of tuples of selector names.

run(verbose: bool = True) → tuple[source]

Execute the pipeline and return best merged features.

Returns:: (merged_features, best_repeat_idx, best_group_names).

_pipeline_run_for_repeat(i: int, verbose: bool) → Any[source]: Execute one repeat and return partial results tuple.

_replace_none(metrics: list) → list[source]

Replace any group with None with a list of -inf.

Parameters:: metrics – Per-group metric lists.
Returns:: Same shape with None replaced by -inf rows.

_split_data(test_size: float, random_state: int) → tuple[source]: Split data into train/test using stratification when classification.

_compute_subset(train_data: DataFrame, idx: int) → dict[source]: Compute selected Feature objects per method for this repeat.

_compute_merging(fs_subsets_local: dict, idx: int, verbose: bool = True) → dict[source]: Merge per-group features and return mapping for this repeat.

_merge_group_features(fs_subsets_local: dict, idx: int, group: tuple) → list[source]

Merge features for a specific group of methods.

Parameters:

idx – Repeat index.
group – Tuple of selector names.

Returns:

Merged features (type depends on strategy).

_compute_performance_metrics(X_train: DataFrame, y_train: Series, X_test: DataFrame, y_test: Series) → list[source]

Compute performance metrics using configured metric methods.

Returns:: Averaged metric values per configured metric.

_compute_metrics(fs_subsets_local: dict, merged_features_local: dict, train_data: DataFrame, test_data: DataFrame, idx: int) → list[source]

Compute and collect performance and stability metrics for subgroups.

Parameters:

fs_subsets_local – Local selected Feature lists per (repeat, method).
merged_features_local – Merged features per (repeat, group).
train_data – Training dataframe.
test_data – Test dataframe.
idx – Repeat index.

Returns:

List of per-metric dicts keyed by (repeat, group).

static _calculate_means(result_dicts: list, group_names: list) → list[source]

Calculate mean metrics per subgroup across repeats.

Parameters:

result_dicts – Per-metric dicts keyed by (repeat, group).
group_names – Subgroup names to summarize.

Returns:

List of [means per metric] for each subgroup.

static _compute_pareto(groups: list, names: list) → Any[source]: Return the name of the winner using Pareto analysis.

_extract_repeat_metrics(group: Any, *result_dicts: dict) → list[source]

Return a row per repeat for the given group.

Missing values remain as None and are later replaced by -inf.

_load_class(input: Any, instantiate: bool = False) → Any[source]

Resolve identifiers to classes/instances and optionally instantiate.

Parameters:

input – Identifier or instance of a selector/merger/metric.
instantiate – If True, instantiate using extracted parameters.

Returns:

Class or instance.

Raises:

ValueError – If input is invalid.

_num_metrics_total() → int[source]

Return total number of metrics tracked per group.

Includes performance metrics plus stability and optional agreement.