Core Module

class moosefs.core.data_processor.DataProcessor(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target')[source]

Bases: object

__init__(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target') None[source]

Initialize the DataProcessor with specific parameters for preprocessing.

Parameters:
  • categorical_columns – List of column names to treat as categorical.

  • columns_to_drop – List of column names to drop from the dataset.

  • drop_missing_values – Flag to determine if missing values should be dropped.

  • merge_key – Column name to use as a key when merging data with metadata.

  • normalize – Flag to determine if numerical features should be normalized.

  • target_column – Name of the target column in the dataset.

preprocess_data(data: Any, index_col: str | None = None, metadata: Any | None = None) DataFrame[source]

Load and preprocess data from a CSV file or DataFrame, with optional metadata merging.

Parameters:
  • data – Path to the CSV file or a pandas DataFrame.

  • index_col – Column to set as index. Defaults to None.

  • metadata – Path to the CSV file or DataFrame containing metadata. Defaults to None.

Returns:

The preprocessed data as a pandas DataFrame.

_load_data(data: Any, index_col: str | None = None) DataFrame[source]

Helper method to load data and set the index if specified.

Parameters:
  • data – Path to the CSV file or a pandas DataFrame.

  • index_col – Column to set as index. Defaults to None.

Returns:

The loaded pandas DataFrame with index set if specified.

_merge_data_and_metadata(data_df: DataFrame, meta_df: DataFrame) DataFrame[source]

Merge the main data frame with metadata.

Parameters:
  • data_df – The main data DataFrame.

  • meta_df – The metadata DataFrame.

Returns:

The merged DataFrame.

_rename_target_column(data_df: DataFrame) DataFrame[source]

Rename the target column in the data frame to ‘target’.

Parameters:

data_df – The data DataFrame to be modified.

Returns:

The DataFrame with the renamed target column.

_drop_columns(data_df: DataFrame) DataFrame[source]

Drop specified columns from the data frame.

Parameters:

data_df – The data DataFrame to be modified.

Returns:

The DataFrame with specified columns dropped.

_drop_missing_values(data_df: DataFrame) DataFrame[source]

Drop missing values by dropping rows with NaNs.

Parameters:

data_df – The data DataFrame with missing values.

Returns:

The DataFrame with missing values dropped.

_encode_categorical_variables(data_df: DataFrame) DataFrame[source]

Encode categorical variables using label encoding and store the mappings.

Parameters:

data_df – The data DataFrame with categorical columns.

Returns:

The DataFrame with categorical variables encoded.

get_label_mapping(column_name: str) dict[source]

Retrieve the label encoding mapping for a specific column.

Parameters:

column_name – The column for which to get the label encoding mapping.

Returns:

A dictionary mapping original labels to encoded values.

_scale_numerical_features(data_df: DataFrame) DataFrame[source]

Scale numerical features using standard scaling.

Parameters:

data_df – The data DataFrame with numerical columns.

Returns:

The DataFrame with numerical features scaled.

_filtered_time_dataset(data_df: DataFrame, min_num_timepoints: int, clone_column: str) DataFrame[source]

Filter dataset to retain only clones with at least min_num_timepoints.

Parameters:
  • data_df – DataFrame containing the dataset.

  • min_num_timepoints – Minimum number of time points required per clone.

  • clone_column – Column name for the clone identifier.

Returns:

DataFrame with clones filtered based on time points.

_fill_nan(df: DataFrame, method: str = 'mean', **knn_kwargs: Any) DataFrame[source]

Fill NaN values in df according to method.

Parameters:
  • df (pd.DataFrame) – The data whose missing values should be filled.

  • method ({"mean", "knn"}, default "mean") – Imputation strategy: - “mean” : column-wise mean for numeric, mode for categoricals. - “knn” : KNNImputer for numeric, mode for categoricals.

  • **knn_kwargs (Any) – Extra keyword arguments passed straight to sklearn.impute.KNNImputer when method == “knn”. Example: n_neighbors=5, weights="distance".

Returns:

A copy of df with NaNs imputed.

Return type:

pd.DataFrame

flatten_time(data_df: DataFrame, clone_column: str, time_column: str, time_dependent_columns: list, min_num_timepoints: int | None = None, fill_nan_method: str = 'mean', **kwargs: Any) DataFrame[source]

Flatten dataset based on time-dependent columns, optionally filtering by minimum time points and filling NaNs.

Parameters:
  • data_df – DataFrame containing the dataset.

  • clone_column – Column name for the clone identifier.

  • time_column – Column name for the time variable.

  • time_dependent_columns – List of columns that vary with time.

  • min_num_timepoints – Optional minimum number of time points per clone for filtering.

  • fill_nan_method – Method to fill NaN values. Defaults to “mean”.

Returns:

DataFrame where time-dependent columns are pivoted and flattened by clone, with NaN values filled.

class moosefs.core.feature.Feature(name: str, score: float | None = None, selected: bool = False)[source]

Bases: object

Container for a single feature.

Stores the feature name, an optional score, and whether it is selected.

Parameters:
  • name – Feature identifier (e.g., column name).

  • score – Optional importance/score for ranking.

  • selected – Whether the feature is selected.

__init__(name: str, score: float | None = None, selected: bool = False) None[source]
set_score(score: float) None[source]

Set the feature score.

Parameters:

score – Importance/score value.

set_selected(selected: bool) None[source]

Set the selected flag.

Parameters:

selected – True if selected; otherwise False.

class moosefs.core.novovicova.StabilityNovovicova(selected_features: list)[source]

Bases: object

Computes the stability of feature selection algorithms based on Novovicová et al. (2009).

References

Novovicová, J., Somol, P., & Pudil, P. (2009). “A New Measure of Feature Selection Algorithms’ Stability.” IEEE International Conference on Data Mining Workshops.

__init__(selected_features: list)[source]
Parameters:

selected_features – A list of sets or lists, where each represents selected features in a dataset.

static _validate_inputs(selected_features: list) None[source]

Validates the input format, ensuring consistency and non-emptiness.

compute_stability() float[source]

Computes the stability measure SH(S), ranging from 0 (no stability) to 1 (full stability).

Returns:

Stability score.

class moosefs.core.pareto.ParetoAnalysis(data: list, group_names: list)[source]

Bases: object

Rank groups by dominance and break ties using utopia distance.

For each group, computes a scalar dominance score: dominated−is_dominated. If the top score ties, scales tied vectors to [0, 1] (within the tie) and picks the one closest to the utopia point (1, …, 1).

__init__(data: list, group_names: list) None[source]

Initialize the analysis state.

Parameters:
  • data – Metric vectors per group.

  • group_names – Display names for groups.

Raises:

ValueError – If data is empty.

_dominate_count(i: int) int[source]
_is_dominated_count(i: int) int[source]
get_results() list[source]

Compute dominance and return ranked rows.

Returns:

Rows [name, dominate_count, is_dominated_count, scalar] sorted by rank.