Cis Modeling Tutorial

In this tutorial I will cover how to use the new RP model API to predict gene expression and find trans-regulated genes. The CisModeler is the entry for all methods, and implements an sklearn-style API similar to the topic models, namely fit, score, predict, and get_logp.

To instantiate a CisModeler object, we need to provide trained AccessibilityModel and ExpressionModel objects.

And we also need to load training data. To train cis-models, we must provide expression counts and accessiblity counts with shared barcodes.

Lastly, when we instantiate a CisModeler, by default it will attempt to learn a model for every gene modeled by the expression topic model. This can be time-consuming, and not every gene may be appropriate or needed for downstream analysis.

I recommend only training RP models for highly variable genes, at least to start. For this tutorial, I will demonstrate with a sample set of 40 genes.

General Training Procedure

When fitting/predicting/scoring RP models, the expression and accessibility states of the cells must be processed into features. Each modeling method (fit, score, predict, and get_logp) takes as arguments:

Processing these features is time-consuming, often taking longer than the actual modeling step itself. To mitigate this, when you supply these features to the cis_model object, it will process and save the features.

Then, for each method call, if they are not provided, the saved features will be used. To get this step out of the way, you can use CisModeler.load_features:

Now, if you want to subset the cells into train/test sets, you can simply pass the indices of the sets to the modeling methods, and it will subset the registered features.

In my case, I only have 1000 cells so subsetting significantly reduces my training size and the remaining test set may not be a good representation of the data, so I skip this step.

train_idx, test_idx = train_test_split(np.arange(len(joint_rna_data)), train_size=0.8)

cis_model.fit(idx=train_idx)

Next, fit the models. Each gene's model is fit using a 2nd-order LBGFS optimizer, whose learning rate is robust and does not need to be tuned as SGD might.

Saving and loading is simple, just provide a prefix and each model will be saved as <prefix>_<gene>.pth. To reload models. Instantiate a CisModeler object as above, then provide the prefix:

cis_model.load('data/mouse_prostate/cis_models/')

Let's evaluate the models:

To make predictions on all cells:

To analyze a particular genes' model, you may use the CisModeler.get_model method.

Using the .guide() method returns the MAP estimates of the model's paramters. Most notably, the a paramter scales the effects of upstream/promoter/downstream accessibility, and the logdistance parameter shows the upstream and downstream decay distances in Kb, respectively.

Cis/Trans test

Next, we want to identify genes that show interesting behavior with respect to their proximal chromatin, and to categorize genes based on that relationship. To fascilitate this, we must train a second group of models, this time providing those models with access to the atac latent features, giving them a "view" of the entire cell state from which to make predictions. The performance of these "trans" models relative to the "cis" models serves as the cis/trans test.

Trans models are instantiated in the exact same way as cis models, except to differences noted below:

The trans models have the exact same interface and perform all of the same methods as cis models. To identify trans-regulated genes, use CisModeler.cis_trans_test. It might be a good idea to explicity provide features to this method so that one may be sure the cis and trans models are being scored on the same cells.

(Note: you must call this method from the trans model)

Visualizing the results, the test_statistic shows most genes have test_stat < 20, which makes them cis-regulated. There are some genes where the trans-model does statistically better at describing the data:

It can be useful to compare the probability of observing the data given both models to compare where the trans model performed better. We can compare the log-prob of the data given the cis and trans models like so:

Visualization

Coming soon! I'll work on the install requirements for the Dynamic tracks package.

Other stuff: Motifs

Here's a rundown on the steps for finding motif hits. I removed the downloading step for motifs and saved them with the package, and I also ran the motifs through a parser so they should be formatted correctly.

Save hits data

Filter factors in the hits data for those found in the expression data

List TFs by enrichment in topic:

Compare topic TF enrichments: