Core Module

class moosefs.core.data_processor.DataProcessor(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target')[source]

Bases: object

__init__(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target') → None[source]

Initialize the DataProcessor with specific parameters for preprocessing.

Parameters:

categorical_columns – List of column names to treat as categorical.
columns_to_drop – List of column names to drop from the dataset.
drop_missing_values – Flag to determine if missing values should be dropped.
merge_key – Column name to use as a key when merging data with metadata.
normalize – Flag to determine if numerical features should be normalized.
target_column – Name of the target column in the dataset.

preprocess_data(data: Any, index_col: str | None = None, metadata: Any | None = None) → DataFrame[source]

Load and preprocess data from a CSV file or DataFrame, with optional metadata merging.

Parameters:

data – Path to the CSV file or a pandas DataFrame.
index_col – Column to set as index. Defaults to None.
metadata – Path to the CSV file or DataFrame containing metadata. Defaults to None.

Returns:

The preprocessed data as a pandas DataFrame.

_load_data(data: Any, index_col: str | None = None) → DataFrame[source]

Helper method to load data and set the index if specified.

Parameters:

data – Path to the CSV file or a pandas DataFrame.
index_col – Column to set as index. Defaults to None.

Returns:

The loaded pandas DataFrame with index set if specified.

_merge_data_and_metadata(data_df: DataFrame, meta_df: DataFrame) → DataFrame[source]

Merge the main data frame with metadata.

Parameters:

data_df – The main data DataFrame.
meta_df – The metadata DataFrame.

Returns:

The merged DataFrame.

_rename_target_column(data_df: DataFrame) → DataFrame[source]

Rename the target column in the data frame to ‘target’.

Parameters:: data_df – The data DataFrame to be modified.
Returns:: The DataFrame with the renamed target column.

_drop_columns(data_df: DataFrame) → DataFrame[source]

Drop specified columns from the data frame.

Parameters:: data_df – The data DataFrame to be modified.
Returns:: The DataFrame with specified columns dropped.

_drop_missing_values(data_df: DataFrame) → DataFrame[source]

Drop missing values by dropping rows with NaNs.

Parameters:: data_df – The data DataFrame with missing values.
Returns:: The DataFrame with missing values dropped.

_encode_categorical_variables(data_df: DataFrame) → DataFrame[source]

Encode categorical variables using label encoding and store the mappings.

Parameters:: data_df – The data DataFrame with categorical columns.
Returns:: The DataFrame with categorical variables encoded.

get_label_mapping(column_name: str) → dict[source]

Retrieve the label encoding mapping for a specific column.

Parameters:: column_name – The column for which to get the label encoding mapping.
Returns:: A dictionary mapping original labels to encoded values.

_scale_numerical_features(data_df: DataFrame) → DataFrame[source]

Scale numerical features using standard scaling.

Parameters:: data_df – The data DataFrame with numerical columns.
Returns:: The DataFrame with numerical features scaled.

_filtered_time_dataset(data_df: DataFrame, min_num_timepoints: int, clone_column: str) → DataFrame[source]

Filter dataset to retain only clones with at least min_num_timepoints.

Parameters:

data_df – DataFrame containing the dataset.
min_num_timepoints – Minimum number of time points required per clone.
clone_column – Column name for the clone identifier.

Returns:

DataFrame with clones filtered based on time points.

_fill_nan(df: DataFrame, method: str = 'mean', **knn_kwargs: Any) → DataFrame[source]

Fill NaN values in df according to method.

Parameters:

df (pd.DataFrame) – The data whose missing values should be filled.
method ({"mean", "knn"}, default "mean") – Imputation strategy: - “mean” : column-wise mean for numeric, mode for categoricals. - “knn” : KNNImputer for numeric, mode for categoricals.
**knn_kwargs (Any) – Extra keyword arguments passed straight to sklearn.impute.KNNImputer when method == “knn”. Example: n_neighbors=5, weights="distance".

Returns:

A copy of df with NaNs imputed.

Return type:

pd.DataFrame

flatten_time(data_df: DataFrame, clone_column: str, time_column: str, time_dependent_columns: list, min_num_timepoints: int | None = None, fill_nan_method: str = 'mean', **kwargs: Any) → DataFrame[source]

Flatten dataset based on time-dependent columns, optionally filtering by minimum time points and filling NaNs.

Parameters:

data_df – DataFrame containing the dataset.
clone_column – Column name for the clone identifier.
time_column – Column name for the time variable.
time_dependent_columns – List of columns that vary with time.
min_num_timepoints – Optional minimum number of time points per clone for filtering.
fill_nan_method – Method to fill NaN values. Defaults to “mean”.

Returns:

DataFrame where time-dependent columns are pivoted and flattened by clone, with NaN values filled.

class moosefs.core.feature.Feature(name: str, score: float | None = None, selected: bool = False)[source]

Bases: object

Container for a single feature.

Stores the feature name, an optional score, and whether it is selected.

Parameters:

name – Feature identifier (e.g., column name).
score – Optional importance/score for ranking.
selected – Whether the feature is selected.

__init__(name: str, score: float | None = None, selected: bool = False) → None[source]

name: str

score: float | None

selected: bool

set_score(score: float) → None[source]

Set the feature score.

Parameters:: score – Importance/score value.

set_selected(selected: bool) → None[source]

Set the selected flag.

Parameters:: selected – True if selected; otherwise False.

class moosefs.core.novovicova.StabilityNovovicova(selected_features: list)[source]

Bases: object

Computes the stability of feature selection algorithms based on Novovicová et al. (2009).

References

Novovicová, J., Somol, P., & Pudil, P. (2009). “A New Measure of Feature Selection Algorithms’ Stability.” IEEE International Conference on Data Mining Workshops.

__init__(selected_features: list)[source]

Parameters:: selected_features – A list of sets or lists, where each represents selected features in a dataset.

static _validate_inputs(selected_features: list) → None[source]: Validates the input format, ensuring consistency and non-emptiness.

compute_stability() → float[source]

Computes the stability measure SH(S), ranging from 0 (no stability) to 1 (full stability).

Returns:: Stability score.

class moosefs.core.pareto.ParetoAnalysis(data: list, group_names: list)[source]

Bases: object

Rank groups by dominance and break ties using utopia distance.

For each group, computes a scalar dominance score: dominated−is_dominated. If the top score ties, scales tied vectors to [0, 1] (within the tie) and picks the one closest to the utopia point (1, …, 1).

__init__(data: list, group_names: list) → None[source]

Initialize the analysis state.

Parameters:

data – Metric vectors per group.
group_names – Display names for groups.

Raises:

ValueError – If data is empty.

_dominate_count(i: int) → int[source]

_is_dominated_count(i: int) → int[source]

get_results() → list[source]

Compute dominance and return ranked rows.

Returns:: Rows [name, dominate_count, is_dominated_count, scalar] sorted by rank.