Feature Selection Pipeline
- class moosefs.feature_selection_pipeline.FeatureSelectionPipeline(data: DataFrame, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None)[source]
Bases:
object
End-to-end pipeline for ensemble feature selection.
Orchestrates feature scoring, merging, metric evaluation, and Pareto-based selection across repeated runs and method subgroups.
- __init__(data: DataFrame, fs_methods: list, merging_strategy: Any, num_repeats: int, num_features_to_select: int | None, metrics: list = ['logloss', 'f1_score', 'accuracy'], task: str = 'classification', min_group_size: int = 2, fill: bool = False, random_state: int | None = None, n_jobs: int | None = None) None [source]
Initialize the pipeline.
- Parameters:
data – DataFrame including a ‘target’ column.
fs_methods – Feature selectors (identifiers or instances).
merging_strategy – Merging strategy (identifier or instance).
num_repeats – Number of repeats for the pipeline.
num_features_to_select – Desired number of features to select.
metrics – Metric functions (identifiers or instances).
task – ‘classification’ or ‘regression’.
min_group_size – Minimum number of methods in each subgroup.
fill – If True, enforce exact size after merging.
random_state – Seed for reproducibility.
n_jobs – Parallel jobs (use num_repeats when -1 or None).
- Raises:
ValueError – If task is invalid or required parameters are missing.
- static _validate_task(task: str) None [source]
Validate task string.
- Parameters:
task – Expected ‘classification’ or ‘regression’.
- static _set_seed(seed: int, idx: int | None = None) None [source]
Seed numpy/python RNGs for reproducibility.
- _generate_subgroup_names(min_group_size: int) list [source]
Generate all selector-name combinations with minimum size.
- Parameters:
min_group_size – Minimum subgroup size.
- Returns:
List of tuples of selector names.
- run(verbose: bool = True) tuple [source]
Execute the pipeline and return best merged features.
- Returns:
(merged_features, best_repeat_idx, best_group_names).
- _pipeline_run_for_repeat(i: int, verbose: bool) Any [source]
Execute one repeat and return partial results tuple.
- _replace_none(metrics: list) list [source]
Replace any group with None with a list of -inf.
- Parameters:
metrics – Per-group metric lists.
- Returns:
Same shape with None replaced by -inf rows.
- _split_data(test_size: float, random_state: int) tuple [source]
Split data into train/test using stratification when classification.
- _compute_subset(train_data: DataFrame, idx: int) dict [source]
Compute selected Feature objects per method for this repeat.
- _compute_merging(fs_subsets_local: dict, idx: int, verbose: bool = True) dict [source]
Merge per-group features and return mapping for this repeat.
- _merge_group_features(fs_subsets_local: dict, idx: int, group: tuple) list [source]
Merge features for a specific group of methods.
- Parameters:
idx – Repeat index.
group – Tuple of selector names.
- Returns:
Merged features (type depends on strategy).
- _compute_performance_metrics(X_train: DataFrame, y_train: Series, X_test: DataFrame, y_test: Series) list [source]
Compute performance metrics using configured metric methods.
- Returns:
Averaged metric values per configured metric.
- _compute_metrics(fs_subsets_local: dict, merged_features_local: dict, train_data: DataFrame, test_data: DataFrame, idx: int) list [source]
Compute and collect performance and stability metrics for subgroups.
- Parameters:
fs_subsets_local – Local selected Feature lists per (repeat, method).
merged_features_local – Merged features per (repeat, group).
train_data – Training dataframe.
test_data – Test dataframe.
idx – Repeat index.
- Returns:
List of per-metric dicts keyed by (repeat, group).
- static _calculate_means(result_dicts: list, group_names: list) list [source]
Calculate mean metrics per subgroup across repeats.
- Parameters:
result_dicts – Per-metric dicts keyed by (repeat, group).
group_names – Subgroup names to summarize.
- Returns:
List of [means per metric] for each subgroup.
- static _compute_pareto(groups: list, names: list) Any [source]
Return the name of the winner using Pareto analysis.
- _extract_repeat_metrics(group: Any, *result_dicts: dict) list [source]
Return a row per repeat for the given group.
Missing values remain as None and are later replaced by -inf.
- _load_class(input: Any, instantiate: bool = False) Any [source]
Resolve identifiers to classes/instances and optionally instantiate.
- Parameters:
input – Identifier or instance of a selector/merger/metric.
instantiate – If True, instantiate using extracted parameters.
- Returns:
Class or instance.
- Raises:
ValueError – If
input
is invalid.