Feature Selection Pipeline

class moosefs.feature_selection_pipeline.FeatureSelectionPipeline(data: DataFrame, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None)[source]

Bases: object

End-to-end pipeline for ensemble feature selection.

Orchestrates feature scoring, merging, metric evaluation, and Pareto-based selection across repeated runs and method subgroups.

__init__(data: DataFrame, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None) None[source]

Initialize the pipeline.

Parameters:
  • data – DataFrame including a ‘target’ column.

  • fs_methods – Feature selectors (identifiers or instances).

  • merging_strategy – Merging strategy (identifier or instance).

  • num_repeats – Number of repeats for the pipeline.

  • num_features_to_select – Desired number of features to select.

  • metrics – Metric functions (identifiers or instances).

  • task – ‘classification’ or ‘regression’.

  • min_group_size – Minimum number of methods in each subgroup.

  • fill – If True, enforce exact size after merging.

  • random_state – Seed for reproducibility.

  • n_jobs – Parallel jobs (use num_repeats when -1 or None).

Raises:

ValueError – If task is invalid or required parameters are missing.

static _validate_task(task: str) None[source]

Validate task string.

Parameters:

task – Expected ‘classification’ or ‘regression’.

static _set_seed(seed: int, idx: int | None = None) None[source]

Seed numpy/python RNGs for reproducibility.

_per_repeat_seed(idx: int) int[source]

Derive a per-repeat seed from the top-level seed.

_effective_n_jobs() int[source]

Return parallel job count capped by number of repeats.

_generate_subgroup_names(min_group_size: int) list[source]

Generate all selector-name combinations with minimum size.

Parameters:

min_group_size – Minimum subgroup size.

Returns:

List of tuples of selector names.

run(verbose: bool = True) tuple[source]

Execute the pipeline and return best merged features.

Returns:

(merged_features, best_repeat_idx, best_group_names).

_pipeline_run_for_repeat(i: int, verbose: bool) Any[source]

Execute one repeat and return partial results tuple.

_replace_none(metrics: list) list[source]

Replace any group with None with a list of -inf.

Parameters:

metrics – Per-group metric lists.

Returns:

Same shape with None replaced by -inf rows.

_split_data(test_size: float, random_state: int) tuple[source]

Split data into train/test using stratification when classification.

_compute_subset(train_data: DataFrame, idx: int) dict[source]

Compute selected Feature objects per method for this repeat.

_compute_merging(fs_subsets_local: dict, idx: int, verbose: bool = True) dict[source]

Merge per-group features and return mapping for this repeat.

_merge_group_features(fs_subsets_local: dict, idx: int, group: tuple) list[source]

Merge features for a specific group of methods.

Parameters:
  • idx – Repeat index.

  • group – Tuple of selector names.

Returns:

Merged features (type depends on strategy).

_compute_performance_metrics(X_train: DataFrame, y_train: Series, X_test: DataFrame, y_test: Series) list[source]

Compute performance metrics using configured metric methods.

Returns:

Averaged metric values per configured metric.

_compute_metrics(fs_subsets_local: dict, merged_features_local: dict, train_data: DataFrame, test_data: DataFrame, idx: int) list[source]

Compute and collect performance and stability metrics for subgroups.

Parameters:
  • fs_subsets_local – Local selected Feature lists per (repeat, method).

  • merged_features_local – Merged features per (repeat, group).

  • train_data – Training dataframe.

  • test_data – Test dataframe.

  • idx – Repeat index.

Returns:

List of per-metric dicts keyed by (repeat, group).

static _calculate_means(result_dicts: list, group_names: list) list[source]

Calculate mean metrics per subgroup across repeats.

Parameters:
  • result_dicts – Per-metric dicts keyed by (repeat, group).

  • group_names – Subgroup names to summarize.

Returns:

List of [means per metric] for each subgroup.

static _compute_pareto(groups: list, names: list) Any[source]

Return the name of the winner using Pareto analysis.

_extract_repeat_metrics(group: Any, *result_dicts: dict) list[source]

Return a row per repeat for the given group.

Missing values remain as None and are later replaced by -inf.

_load_class(input: Any, instantiate: bool = False) Any[source]

Resolve identifiers to classes/instances and optionally instantiate.

Parameters:
  • input – Identifier or instance of a selector/merger/metric.

  • instantiate – If True, instantiate using extracted parameters.

Returns:

Class or instance.

Raises:

ValueError – If input is invalid.

_num_metrics_total() int[source]

Return total number of metrics tracked per group.

Includes performance metrics plus stability and optional agreement.