The new interface for hyperparameter selection and modeling are "Trainers", which implement an sklearn-style interface that allows for tuning using sklearn model selection constructs. Import the trainers and some other useful packages.

Expression Modeling

Preprocess expression count matrix

The first step in training the Expression model is of course, QC. Follow standard QC procedures from the scanpy tutorial.

Cells with below 400 genes expressed will tend to produce produce poor topics and embeddings. This is a good threshold for removing cells in order to train a quality model.

This dataset has a lot of mitochondrial reads, so filter cells with a high proportion of those.

Subset cells by %mitochondrial reads, and also remove cells with a huge number of genes since these may dublets or technical anomolies.

Feature Selection

With QC, complete, we need to perform feature selection to identify two groups of genes. The first is

Essentially genes that meat a minimum mean_expression threshold. The second group is a subset of the first, and contains

These genes are more likely to follow interesting patterns of dynamic response and regulation. SCIPM will learn module relationships and impute expression patterns in all genes in the first group, but will use only the second group of genes' expression in mapping expression to latent module compositions. This reduces training time and ensures that modules follow dynamic changes in expression instead of basal trends that are more likely to arrise in non-deviant genes.

Important: Freeze the state of the andata here to preserve raw counts

Next, find genes that meet scanpy's default min_mean requirement, which is reasonable set.

Here, reduce the min_disp parameter so that genes that are expressed, but not necessarily variable, are included. All genes flagged with "highly_variable" will be modeled by scIPM.

With the first group of genes selected, choose a subset of them which will be used as features for the encoder. For this, simply set a higher threshold for dispersion. Add a is_feature column to the .vars dataframe which flags genes that have a dispersions_norm greater than 0.8. This particular threshold should be adjusted so that between 1000 and 2000 dispersed genes are used as features.

Important: Restore the raw count matrix from the .raw attribute, then subset the counts to genes flagged highly_variable, or all genes that passed the minimum expression test.

Hyperparameter Tuning

Now move on to hyperparameter tuning, which seeks to find the most optimal model to describe your data. Kladi utilizes iterative Bayesian Optimization with aggressive pruning to try to converge to the best model in the shortest possible time. The basic workflow is

  1. Tune learning rate boundary
  2. Iteratively optimize model hyperparameters on training set
  3. Select best model trained using validation set

First, separate cells into training and validation sets using sklearn.model_selection.train_test_split. The training and testing-set size should be adjusted depending on the size of the dataset and should maximize the proportion of cells in the training set while still providing a reliable evaluation of model quality.

Next, instantiate an ExpressionTrainer. The two arguments that must provide are the

After instantiating a trainer, run the tune_learning_rate_bounds function to identify the minimum and maximum learning rates appropriate for the dataset. This function gradually increases the learning rate of the optimizer while monitoring the loss. The optimal boundaries for the learning rate are at the start and end of region of the loss curve with the greatest decreasing slope.

Plot the loss curve using the ExpressionTrainer.plot_learning_rate_bounds function. The trainer will try to automatically detect good boundaries, but if they need to be adjusted, use the ExpressionTrainer.trim_learning_rate_bounds function. The first and second arguments scoot the minimum and maximum boundaries inward by one magnitude, respectively. In this example, I adjust the minimum rate upwards by a 1 magnitude (5e-4 --> 5e-3) so rest the boundary at the start of the decline.

One may also directly set the learning rates with the ExpressionTrainer.set_learning_rates function.

With the learning rate set, use Bayesian Optimization to rune the remaining hyperparameters:

The tuner updates a histogram with the number of modules chosen for each trial. When the optimizer repeatedly chooses similar num_modules hyperparamters, this is a sign it is converging closer to the optimal model. The process may be stopped after a number of iterations, or by esc-I-I to interrupt the process. This will end optimization and save results.

This function returns an optuna study, for which you can see methods and visualization methods here: Optuna Study

The best models trained in the procedure above must now be retrained with all the training data, and compared on how well they minimize test-set loss. The best model from this final comparison is then fit again using all data and returned to the user. All of this is performed by the ExpressionTrainer.select_best_model function. Changing top_n_models changes how many models from the optimization procedure will be compared for validation set loss.

The result of this function is an ExpressionModel object which contains the methods needed for downstream analysis of the discovered modules. Now is a good time to save the model. The trainer object implements the save function, which simply takes a file name and saves the parameters and weights needed to recreate the best model.

To reload the model later, simply use:

    model = ExpressionTrainer.load('data/mouse_prostate/best_model.pth')

Using model

With the trained model, there are three main functions we can use to represent and visualize the data:

Accessibility Modeling

Preprocessing

I used MAESTRO to convert the fragment file into a binary peak-count matrix, then filtered for cells with greater than 400 accessible regions.

Feature selection

Feature selection in scATAC-seq is not as well defined as in scRNA-seq, where dispersed genes are well-established as good features for differentiating cell states. No such analogue is accepted in scATAC-seq so there are two options for choosing features for the encoder network:

  1. Use every peak
  2. If memory requires or there are many peaks in the sample (>150K), randomly downsample to ~100K peaks.

In this example, I do not subset the features of the accessibility model.

Hyperparameter selection

The procedure and interface for hyperparameter selection are identical, except we use AccessibilityTrainer instead of ExpressionTrainer. Instead of passing gene names to the features argument, instead pass the peak locations in the format:

[[chr, start, end], ...]

One can also pass a list of strings in the format:

["<chr>:<start>-<end>", ...]

Using accessibility model

The accessibility model has similar functions as the expression model for representation, except impute is not available due to the memory constrains of imputing every peak in every cell with a dense probablility.

Creating joint representation

To create the joint representation, we need to find cells that passed QC for both assays by intersecting the cell barcodes. Unfortunately for this assay, only 1300 cells passed both scRNA and scATAC QC thresholds.

If I look at what sorts of cells are shared between the two assays, they appear to be evenly-mixed in between major cell types in the RNA-seq representation (bottom), but according to the ATAC-seq representation (top), we will lose a major cell population in the joint representation. This cell type may not be captured well by scRNA-seq or may have too few transcripts to pass QC.

Since I lost a significant portion of cells by joining the representations, this UMAP loses some clarity, but it still shows separated cell types. The RNA-seq topics still appear coherent with respect ot the cell representations.

Continue with previous analysis notebook ...