spacr.sim¶
Module Contents¶
- spacr.sim.generate_gene_list(number_of_genes, number_of_all_genes)[source]¶
Generates a list of randomly selected genes.
- Parameters:
number_of_genes (int) – The number of genes to be selected.
number_of_all_genes (int) – The total number of genes available.
- Returns:
A list of randomly selected genes.
- Return type:
list
- spacr.sim.generate_plate_map(nr_plates)[source]¶
Generate a plate map based on the number of plates.
Parameters: nr_plates (int): The number of plates to generate the map for.
Returns: pandas.DataFrame: The generated plate map dataframe.
- spacr.sim.gini_coefficient(x)[source]¶
Compute Gini coefficient of array of values.
Parameters: x (array-like): Array of values.
Returns: float: Gini coefficient.
- spacr.sim.gini(x)[source]¶
Calculate the Gini coefficient for a given array of values.
Parameters: x (array-like): Input array of values.
Returns: float: The Gini coefficient.
Notes: This implementation has a time and memory complexity of O(n**2), where n is the length of x. Avoid passing in large samples to prevent performance issues.
- spacr.sim.gini_gene_well(x)[source]¶
Calculate the Gini coefficient for a given income distribution.
The Gini coefficient measures income inequality in a population. A value of 0 represents perfect income equality (everyone has the same income), while a value of 1 represents perfect income inequality (one individual has all the income).
Parameters: x (array-like): An array-like object representing the income distribution.
Returns: float: The Gini coefficient for the given income distribution.
- spacr.sim.gini(x)[source]¶
Calculate the Gini coefficient for a given array of values.
Parameters: x (array-like): The input array of values.
Returns: float: The Gini coefficient.
References: - Based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif - From: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm - All values are treated equally, arrays must be 1d.
- spacr.sim.dist_gen(mean, sd, df)[source]¶
Generate a Poisson distribution based on a gamma distribution.
Parameters: mean (float): Mean of the gamma distribution. sd (float): Standard deviation of the gamma distribution. df (pandas.DataFrame): Input data.
Returns: tuple: A tuple containing the generated Poisson distribution and the length of the input data.
- spacr.sim.generate_gene_weights(positive_mean, positive_variance, df)[source]¶
Generate gene weights using a beta distribution.
Parameters: - positive_mean (float): The mean value for the positive distribution. - positive_variance (float): The variance value for the positive distribution. - df (pandas.DataFrame): The DataFrame containing the data.
Returns: - weights (numpy.ndarray): An array of gene weights generated using a beta distribution.
- spacr.sim.normalize_array(arr)[source]¶
Normalize an array by scaling its values between 0 and 1.
Parameters: arr (numpy.ndarray): The input array to be normalized.
Returns: numpy.ndarray: The normalized array.
- spacr.sim.generate_power_law_distribution(num_elements, coeff)[source]¶
Generate a power law distribution.
Parameters: - num_elements (int): The number of elements in the distribution. - coeff (float): The coefficient of the power law.
Returns: - normalized_distribution (ndarray): The normalized power law distribution.
- spacr.sim.power_law_dist_gen(df, avg, well_ineq_coeff)[source]¶
Generate a power-law distribution for wells.
Parameters: - df: DataFrame: The input DataFrame containing the wells. - avg: float: The average value for the distribution. - well_ineq_coeff: float: The inequality coefficient for the power-law distribution.
Returns: - dist: ndarray: The generated power-law distribution for the wells.
- spacr.sim.run_experiment(plate_map, number_of_genes, active_gene_list, avg_genes_per_well, sd_genes_per_well, avg_cells_per_well, sd_cells_per_well, well_ineq_coeff, gene_ineq_coeff)[source]¶
Run a simulation experiment.
- Parameters:
plate_map (DataFrame) – The plate map containing information about the wells.
number_of_genes (int) – The total number of genes.
active_gene_list (list) – The list of active genes.
avg_genes_per_well (float) – The average number of genes per well.
sd_genes_per_well (float) – The standard deviation of genes per well.
avg_cells_per_well (float) – The average number of cells per well.
sd_cells_per_well (float) – The standard deviation of cells per well.
well_ineq_coeff (float) – The coefficient for well inequality.
gene_ineq_coeff (float) – The coefficient for gene inequality.
- Returns:
- A tuple containing the following:
cell_df (DataFrame): The DataFrame containing information about the cells.
genes_per_well_df (DataFrame): The DataFrame containing gene counts per well.
wells_per_gene_df (DataFrame): The DataFrame containing well counts per gene.
df_ls (list): A list containing gene counts per well, well counts per gene, Gini coefficients for wells, Gini coefficients for genes, gene weights array, and well weights.
- Return type:
tuple
- spacr.sim.classifier(positive_mean, positive_variance, negative_mean, negative_variance, classifier_accuracy, df)[source]¶
Classifies the data in the DataFrame based on the given parameters and a classifier error rate.
- Parameters:
positive_mean (float) – The mean of the positive distribution.
positive_variance (float) – The variance of the positive distribution.
negative_mean (float) – The mean of the negative distribution.
negative_variance (float) – The variance of the negative distribution.
classifier_accuracy (float) – The likelihood (0 to 1) that a gene is correctly classified according to its true label.
df (pandas.DataFrame) – The DataFrame containing the data to be classified.
- Returns:
The DataFrame with an additional ‘score’ column containing the classification scores.
- Return type:
pandas.DataFrame
- spacr.sim.classifier_v2(positive_mean, positive_variance, negative_mean, negative_variance, df)[source]¶
Classifies the data in the DataFrame based on the given parameters.
- Parameters:
positive_mean (float) – The mean of the positive distribution.
positive_variance (float) – The variance of the positive distribution.
negative_mean (float) – The mean of the negative distribution.
negative_variance (float) – The variance of the negative distribution.
df (pandas.DataFrame) – The DataFrame containing the data to be classified.
- Returns:
The DataFrame with an additional ‘score’ column containing the classification scores.
- Return type:
pandas.DataFrame
- spacr.sim.compute_roc_auc(cell_scores)[source]¶
Compute the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) for cell scores.
Parameters: - cell_scores (DataFrame): DataFrame containing cell scores with columns ‘is_active’ and ‘score’.
Returns: - cell_roc_dict (dict): Dictionary containing the ROC curve information, including the threshold, true positive rate (TPR), false positive rate (FPR), and ROC AUC.
- spacr.sim.compute_precision_recall(cell_scores)[source]¶
Compute precision, recall, F1 score, and PR AUC for a given set of cell scores.
Parameters: - cell_scores (DataFrame): A DataFrame containing the cell scores with columns ‘is_active’ and ‘score’.
Returns: - cell_pr_dict (dict): A dictionary containing the computed precision, recall, F1 score, PR AUC, and threshold values.
- spacr.sim.get_optimum_threshold(cell_pr_dict)[source]¶
Calculates the optimum threshold based on the f1_score in the given cell_pr_dict.
Parameters: cell_pr_dict (dict): A dictionary containing precision, recall, and f1_score values for different thresholds.
Returns: float: The optimum threshold value.
- spacr.sim.update_scores_and_get_cm(cell_scores, optimum)[source]¶
Update the cell scores based on the given optimum value and calculate the confusion matrix.
- Parameters:
cell_scores (DataFrame) – The DataFrame containing the cell scores.
optimum (float) – The optimum value used for updating the scores.
- Returns:
A tuple containing the updated cell scores DataFrame and the confusion matrix.
- Return type:
tuple
- spacr.sim.cell_level_roc_auc(cell_scores)[source]¶
Compute the ROC AUC and precision-recall metrics at the cell level.
- Parameters:
cell_scores (list) – List of scores for each cell.
- Returns:
DataFrame containing the ROC AUC metrics for each cell. cell_pr_dict_df (DataFrame): DataFrame containing the precision-recall metrics for each cell. cell_scores (list): Updated list of scores after applying the optimum threshold. cell_cm (array): Confusion matrix for the cell-level classification.
- Return type:
cell_roc_dict_df (DataFrame)
- spacr.sim.generate_well_score(cell_scores)[source]¶
Generate well scores based on cell scores.
- Parameters:
cell_scores (DataFrame) – DataFrame containing cell scores.
- Returns:
DataFrame containing well scores with average active score, gene list, and score.
- Return type:
DataFrame
- spacr.sim.sequence_plates(well_score, number_of_genes, avg_reads_per_gene, sd_reads_per_gene, sequencing_error=0.01)[source]¶
Simulates the sequencing of plates and calculates gene fractions and metadata.
Parameters: well_score (pd.DataFrame): DataFrame containing well scores and gene lists. number_of_genes (int): Number of genes. avg_reads_per_gene (float): Average number of reads per gene. sd_reads_per_gene (float): Standard deviation of reads per gene. sequencing_error (float, optional): Probability of introducing sequencing error. Defaults to 0.01.
Returns: gene_fraction_map (pd.DataFrame): DataFrame containing gene fractions for each well. metadata (pd.DataFrame): DataFrame containing metadata for each well.
- spacr.sim.regression_roc_auc(results_df, active_gene_list, control_gene_list, alpha=0.05, optimal=False)[source]¶
Calculate regression ROC AUC and other statistics.
Parameters: results_df (DataFrame): DataFrame containing the results of regression analysis. active_gene_list (list): List of active gene IDs. control_gene_list (list): List of control gene IDs. alpha (float, optional): Significance level for determining hits. Default is 0.05. optimal (bool, optional): Whether to use the optimal threshold for classification. Default is False.
Returns: tuple: A tuple containing the following: - results_df (DataFrame): Updated DataFrame with additional columns. - reg_roc_dict_df (DataFrame): DataFrame containing regression ROC curve data. - reg_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data. - reg_cm (ndarray): Confusion matrix. - sim_stats (DataFrame): DataFrame containing simulation statistics.
- spacr.sim.plot_histogram(data, x_label, ax, color, title, binwidth=0.01, log=False)[source]¶
Plots a histogram of the given data.
Parameters: - data: The data to be plotted. - x_label: The label for the x-axis. - ax: The matplotlib axis object to plot on. - color: The color of the histogram bars. - title: The title of the plot. - binwidth: The width of each histogram bin. - log: Whether to use a logarithmic scale for the y-axis.
Returns: None
- spacr.sim.plot_roc_pr(data, ax, title, x_label, y_label)[source]¶
Plot the ROC (Receiver Operating Characteristic) and PR (Precision-Recall) curves.
Parameters: - data: DataFrame containing the data to be plotted. - ax: The matplotlib axes object to plot on. - title: The title of the plot. - x_label: The label for the x-axis. - y_label: The label for the y-axis.
- spacr.sim.plot_confusion_matrix(data, ax, title)[source]¶
Plots a confusion matrix using a heatmap.
Parameters: data (numpy.ndarray): The confusion matrix data. ax (matplotlib.axes.Axes): The axes object to plot the heatmap on. title (str): The title of the plot.
Returns: None
- spacr.sim.run_simulation(settings)[source]¶
Run the simulation based on the given settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
A tuple containing the simulation results and distances. - cell_scores (DataFrame): Scores for each cell. - cell_roc_dict_df (DataFrame): ROC AUC scores for each cell. - cell_pr_dict_df (DataFrame): Precision-Recall AUC scores for each cell. - cell_cm (DataFrame): Confusion matrix for each cell. - well_score (DataFrame): Scores for each well. - gene_fraction_map (DataFrame): Fraction of genes for each well. - metadata (DataFrame): Metadata for each well. - results_df (DataFrame): Results of the regression analysis. - reg_roc_dict_df (DataFrame): ROC AUC scores for each gene. - reg_pr_dict_df (DataFrame): Precision-Recall AUC scores for each gene. - reg_cm (DataFrame): Confusion matrix for each gene. - sim_stats (dict): Additional simulation statistics. - genes_per_well_df (DataFrame): Number of genes per well. - wells_per_gene_df (DataFrame): Number of wells per gene. dists (list): List of distances.
- Return type:
tuple
- spacr.sim.vis_dists(dists, src, v, i)[source]¶
Visualizes the distributions of given distances.
- Parameters:
dists (list) – List of distance arrays.
src (str) – Source directory for saving the plot.
v (int) – Number of vertices.
i (int) – Index of the plot.
- Returns:
None
- spacr.sim.visualize_all(output)[source]¶
Visualizes various plots based on the given output data.
- Parameters:
output (list) – A list containing the following elements: - cell_scores (DataFrame): DataFrame containing cell scores. - cell_roc_dict_df (DataFrame): DataFrame containing ROC curve data for cell classification. - cell_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data for cell classification. - cell_cm (array-like): Confusion matrix for cell classification. - well_score (DataFrame): DataFrame containing well scores. - gene_fraction_map (dict): Dictionary mapping genes to fractions. - metadata (dict): Dictionary containing metadata. - results_df (DataFrame): DataFrame containing results. - reg_roc_dict_df (DataFrame): DataFrame containing ROC curve data for gene regression. - reg_pr_dict_df (DataFrame): DataFrame containing precision-recall curve data for gene regression. - reg_cm (array-like): Confusion matrix for gene regression. - sim_stats (dict): Dictionary containing simulation statistics. - genes_per_well_df (DataFrame): DataFrame containing genes per well data. - wells_per_gene_df (DataFrame): DataFrame containing wells per gene data.
- Returns:
The generated figure object.
- Return type:
fig (matplotlib.figure.Figure)
- spacr.sim.create_database(db_path)[source]¶
Creates a SQLite database at the specified path.
- Parameters:
db_path (str) – The path where the database should be created.
- Returns:
None
- spacr.sim.append_database(src, table, table_name)[source]¶
Append a pandas DataFrame to an SQLite database table.
Parameters: src (str): The source directory where the database file is located. table (pandas.DataFrame): The DataFrame to be appended to the database table. table_name (str): The name of the database table.
Returns: None
- spacr.sim.save_data(src, output, settings, save_all=False, i=0, variable='all')[source]¶
Save simulation data to specified location.
- Parameters:
src (str) – The directory path where the data will be saved.
output (list) – A list of dataframes containing simulation output.
settings (dict) – A dictionary containing simulation settings.
save_all (bool, optional) – Flag indicating whether to save all tables or only a subset. Defaults to False.
i (int, optional) – The simulation number. Defaults to 0.
variable (str, optional) – The variable name. Defaults to ‘all’.
- Returns:
None
- spacr.sim.save_plot(fig, src, variable, i)[source]¶
Save a matplotlib figure as a PDF file.
Parameters: - fig: The matplotlib figure to be saved. - src: The directory where the file will be saved. - variable: The name of the variable being plotted. - i: The index of the figure.
Returns: None
- spacr.sim.run_and_save(i, settings, time_ls, total_sims)[source]¶
Run the simulation and save the results.
- Parameters:
i (int) – The simulation index.
settings (dict) – The simulation settings.
time_ls (list) – The list to store simulation times.
total_sims (int) – The total number of simulations.
- Returns:
A tuple containing the simulation index, simulation time, and None.
- Return type:
tuple
- spacr.sim.validate_and_adjust_beta_params(sim_params)[source]¶
Validates and adjusts Beta distribution parameters in simulation settings to ensure they are possible.
Args: sim_params (list of dict): List of dictionaries, each containing the simulation parameters.
Returns: list of dict: The adjusted list of simulation parameter sets.
- spacr.sim.generate_paramiters(settings)[source]¶
Generate a list of parameter sets for simulation based on the given settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
A list of parameter sets for simulation.
- Return type:
list
- spacr.sim.run_multiple_simulations(settings)[source]¶
Run multiple simulations in parallel using the provided settings.
- Parameters:
settings (dict) – A dictionary containing the simulation settings.
- Returns:
None
- spacr.sim.remove_columns_with_single_value(df)[source]¶
Removes columns from the DataFrame that have the same value in all rows.
Args: df (pandas.DataFrame): The original DataFrame.
Returns: pandas.DataFrame: A DataFrame with the columns removed that contained only one unique value.
- spacr.sim.read_simulations_table(db_path)[source]¶
Reads the ‘simulations’ table from an SQLite database into a pandas DataFrame.
Args: db_path (str): The file path to the SQLite database.
Returns: pandas.DataFrame: DataFrame containing the ‘simulations’ table data.
- spacr.sim.plot_simulations(df, variable, x_rotation=None, legend=False, grid=False, clean=True, verbose=False)[source]¶
Creates separate line plots for ‘prauc’ against a specified ‘variable’, for each unique combination of conditions defined by ‘grouping_vars’, displayed on a grid.
Args: df (pandas.DataFrame): DataFrame containing the necessary columns. variable (str): Name of the column to use as the x-axis for grouping and plotting. x_rotation (int, optional): Degrees to rotate the x-axis labels. legend (bool, optional): Whether to display a legend. grid (bool, optional): Whether to display grid lines. verbose (bool, optional): Whether to print the filter conditions.
Returns: None
- spacr.sim.plot_correlation_matrix(df, annot=False, cmap='inferno', clean=True)[source]¶
Plots a correlation matrix for the specified variables and the target variable.
Args: df (pandas.DataFrame): The DataFrame containing the data. variables (list): List of column names to include in the correlation matrix. target_variable (str): The target variable column name.
Returns: None
- spacr.sim.plot_feature_importance(df, target='prauc', exclude=None, clean=True)[source]¶
Trains a RandomForestRegressor to determine the importance of each feature in predicting the target.
Args: df (pandas.DataFrame): The DataFrame containing the data. target (str): The target variable column name. exclude (list or str, optional): Column names to exclude from features.
Returns: matplotlib.figure.Figure: The figure object containing the feature importance plot.
- spacr.sim.calculate_permutation_importance(df, target='prauc', exclude=None, n_repeats=10, clean=True)[source]¶
Calculates permutation importance for the given features in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. features (list): List of column names to include as features. target (str): The name of the target variable column.
Returns: dict: Dictionary containing the importances and standard deviations.
- spacr.sim.plot_partial_dependences(df, target='prauc', clean=True)[source]¶
Creates partial dependence plots for the specified features, with improved layout to avoid text overlap.
Args: df (pandas.DataFrame): The DataFrame containing the data. target (str): The target variable.
Returns: None
- spacr.sim.generate_shap_summary_plot(df, target='prauc', clean=True)[source]¶
Generates a SHAP summary plot for the given features in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. features (list): List of column names to include as features. target (str): The name of the target variable column.
Returns: None