Core Module
- class moosefs.core.data_processor.DataProcessor(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target')[source]
Bases:
object
- __init__(categorical_columns: list | None = None, columns_to_drop: list | None = None, drop_missing_values: bool = False, merge_key: str | None = None, normalize: bool = True, target_column: str = 'target') None [source]
Initialize the DataProcessor with specific parameters for preprocessing.
- Parameters:
categorical_columns – List of column names to treat as categorical.
columns_to_drop – List of column names to drop from the dataset.
drop_missing_values – Flag to determine if missing values should be dropped.
merge_key – Column name to use as a key when merging data with metadata.
normalize – Flag to determine if numerical features should be normalized.
target_column – Name of the target column in the dataset.
- preprocess_data(data: Any, index_col: str | None = None, metadata: Any | None = None) DataFrame [source]
Load and preprocess data from a CSV file or DataFrame, with optional metadata merging.
- Parameters:
data – Path to the CSV file or a pandas DataFrame.
index_col – Column to set as index. Defaults to None.
metadata – Path to the CSV file or DataFrame containing metadata. Defaults to None.
- Returns:
The preprocessed data as a pandas DataFrame.
- _load_data(data: Any, index_col: str | None = None) DataFrame [source]
Helper method to load data and set the index if specified.
- Parameters:
data – Path to the CSV file or a pandas DataFrame.
index_col – Column to set as index. Defaults to None.
- Returns:
The loaded pandas DataFrame with index set if specified.
- _merge_data_and_metadata(data_df: DataFrame, meta_df: DataFrame) DataFrame [source]
Merge the main data frame with metadata.
- Parameters:
data_df – The main data DataFrame.
meta_df – The metadata DataFrame.
- Returns:
The merged DataFrame.
- _rename_target_column(data_df: DataFrame) DataFrame [source]
Rename the target column in the data frame to ‘target’.
- Parameters:
data_df – The data DataFrame to be modified.
- Returns:
The DataFrame with the renamed target column.
- _drop_columns(data_df: DataFrame) DataFrame [source]
Drop specified columns from the data frame.
- Parameters:
data_df – The data DataFrame to be modified.
- Returns:
The DataFrame with specified columns dropped.
- _drop_missing_values(data_df: DataFrame) DataFrame [source]
Drop missing values by dropping rows with NaNs.
- Parameters:
data_df – The data DataFrame with missing values.
- Returns:
The DataFrame with missing values dropped.
- _encode_categorical_variables(data_df: DataFrame) DataFrame [source]
Encode categorical variables using label encoding and store the mappings.
- Parameters:
data_df – The data DataFrame with categorical columns.
- Returns:
The DataFrame with categorical variables encoded.
- get_label_mapping(column_name: str) dict [source]
Retrieve the label encoding mapping for a specific column.
- Parameters:
column_name – The column for which to get the label encoding mapping.
- Returns:
A dictionary mapping original labels to encoded values.
- _scale_numerical_features(data_df: DataFrame) DataFrame [source]
Scale numerical features using standard scaling.
- Parameters:
data_df – The data DataFrame with numerical columns.
- Returns:
The DataFrame with numerical features scaled.
- _filtered_time_dataset(data_df: DataFrame, min_num_timepoints: int, clone_column: str) DataFrame [source]
Filter dataset to retain only clones with at least min_num_timepoints.
- Parameters:
data_df – DataFrame containing the dataset.
min_num_timepoints – Minimum number of time points required per clone.
clone_column – Column name for the clone identifier.
- Returns:
DataFrame with clones filtered based on time points.
- _fill_nan(df: DataFrame, method: str = 'mean', **knn_kwargs: Any) DataFrame [source]
Fill NaN values in df according to method.
- Parameters:
df (pd.DataFrame) – The data whose missing values should be filled.
method ({"mean", "knn"}, default "mean") – Imputation strategy: - “mean” : column-wise mean for numeric, mode for categoricals. - “knn” : KNNImputer for numeric, mode for categoricals.
**knn_kwargs (Any) – Extra keyword arguments passed straight to
sklearn.impute.KNNImputer
when method == “knn”. Example:n_neighbors=5, weights="distance"
.
- Returns:
A copy of df with NaNs imputed.
- Return type:
pd.DataFrame
- flatten_time(data_df: DataFrame, clone_column: str, time_column: str, time_dependent_columns: list, min_num_timepoints: int | None = None, fill_nan_method: str = 'mean', **kwargs: Any) DataFrame [source]
Flatten dataset based on time-dependent columns, optionally filtering by minimum time points and filling NaNs.
- Parameters:
data_df – DataFrame containing the dataset.
clone_column – Column name for the clone identifier.
time_column – Column name for the time variable.
time_dependent_columns – List of columns that vary with time.
min_num_timepoints – Optional minimum number of time points per clone for filtering.
fill_nan_method – Method to fill NaN values. Defaults to “mean”.
- Returns:
DataFrame where time-dependent columns are pivoted and flattened by clone, with NaN values filled.
- class moosefs.core.feature.Feature(name: str, score: float | None = None, selected: bool = False)[source]
Bases:
object
Container for a single feature.
Stores the feature name, an optional score, and whether it is selected.
- Parameters:
name – Feature identifier (e.g., column name).
score – Optional importance/score for ranking.
selected – Whether the feature is selected.
- class moosefs.core.novovicova.StabilityNovovicova(selected_features: list)[source]
Bases:
object
Computes the stability of feature selection algorithms based on Novovicová et al. (2009).
References
Novovicová, J., Somol, P., & Pudil, P. (2009). “A New Measure of Feature Selection Algorithms’ Stability.” IEEE International Conference on Data Mining Workshops.
- __init__(selected_features: list)[source]
- Parameters:
selected_features – A list of sets or lists, where each represents selected features in a dataset.
- class moosefs.core.pareto.ParetoAnalysis(data: list, group_names: list)[source]
Bases:
object
Rank groups by dominance and break ties using utopia distance.
For each group, computes a scalar dominance score: dominated−is_dominated. If the top score ties, scales tied vectors to [0, 1] (within the tie) and picks the one closest to the utopia point (1, …, 1).