Tutorial V3

Since the last tutorial, I have done a near 100% rework of the interfaces in the code so that every method that requires data simply takes the rna andata, atac andata, or both as arguments. The method will extract what it needs from the adata, or ask you to run other functions if it's missing something! No code for the topic models has changed, so we'll start from trained topic models.

The biggest changes to the package reflect a re-think of what is data and what are models. Data is no longer stored with the models in opaque objects, but is kept within anndata objects where it can be subset, saved, and plotted with other cell and feature-associated data.

Old methods that I have not re-vampled will be imported from kladi, new methods will be imported from kladiv2 (soon to be mira).

0. Imports

1. Load data

This rna adata object has been subset for cells passing QC, but not for highly-variable genes. Sometimes it's interesting to plot genes that weren't modeled as part of the topic model.

1. Load topic models

For the RNA model, you can simply use "load_old_model".

But for the accessibility model, changes to rely on the var_names of peaks instead of their (chr,start,end) requires you to pass new var_names in the same order as the peaks the model was trained on.

If you save these models after loading them (save method), then you don't have to use load_old_model, just use load to reload them.

2. Model data, get joint representation

Using the topic models is now much more straightforward. Simply provide an andata with all the genes used in the training step, and the method will subset and extract the correct features automatically.

If your raw counts are stored in a layer other than .X, you must set the "counts_layer" attribute of the model (usually this would be done during training).

2A. Modeling data, getting UMAP features

After each method, you can see what's new in the adata object you passed. This method added topic compositions to "obsm", as well as a column for each topic that makes plotting more convenient.

To get features for the UMAP representation, run get_umap_features:

And lastly, to get imputed values for the RNA data:

2B. Make UMAP

To make the UMAP, we must get nearest neighbors for each cell using manhattan distance to leverage the sparse, othonormal features.

Using the distance matrix produced above, we may also perform clustering on cells based on topics, which may yield clusters that are slightly less biologically-arbitrary.

To plot topics, you can invoke the topic columns by passing rna_model.topic_cols attribute to the color argument. Here I also pass topic_umap() as kwargs, which is a generator from the kladiv2.preferences file. This simply provides pre-built arguments to change the plotting behavior.

You can adjust these defaults by passing the prefered value as an argument:

3. RNA topic analysis

The most basic step in topic analysis is simply getting the top N genes from the top of the topic:

Which can then be used to plot representative genes:

To do enrichment analysis, you can either post genelists one-at-a-time, or post them all.

To fetch ontology enrichments, you can also do one topic at a time, or all at once. To the fetch methods, one can provide their own lists of ontologies that they are interested in. By default, we use the "Legacy" ontologies, which are listed under rich.LEGACY_ONTOLOGIES. For a list of all searchable ontologies, see https://maayanlab.cloud/Enrichr/#libraries.

To manually view the enrichment results:

Enrichment results are organized as:

{
    ontology : [
        {
            rank : 
            term : 
            pvalue : 
            zscore :
            combined_score :
            genes :
            adj_pvalue:
        },

        ...
    ]
}

To plot:

If you're interseted in the top topics for a certain gene:

If you save the either the ATAC or RNA models after getting enrichment results, that data will be saved with it and will be availabel upon reload.

4. Accessibility Topic Analysis

4A. Motif scanning

To analyze ATAC-seq topics, we first need to find motifs hits in our peaks using the find_motifs function.

Pass the ATAC adata object to the method, and using the chrom, start, end kwargs, indicate which columns of the data contain that information. Finally, pass the file location of a fasta for your genome to scan sequences.

This function will add binding data to your adata object. Save it once it's done.

Not every motif-associated factor is expressed in our data. You can filter out non-expressed or lowly-expressed TFs using the function below. This function simply masks factors that aren't found in the provided list. If you want to change your criteria later, you can just provide a new list.

Don't forget to save! The steps above are time-consuming but only needs to be executed once.

4b. Analysis

Finally, with motifs scanned, we can find enrichments in the ATAC-seq topcis. Often, it's most interesting to compare the enrichments of TFs during transitions in cell state. Below, I compare enrichments moving from undifferentiated TAC cells (topic 0) to pre-branch TAC cells (topic 15).

Lets mask for factors which are expressed highly in the HF:

This function performs a fisher exact test to find TFs which are preferentially enriched for peaks in the top quantile of activation for a given topic.

The test function above returns results, but as with the RNA model, they can be accessed from the ATAC model later using the get_enrichments method.

Let's plot enrichment of TFs transitioning from topic 0 to 15. Just for fun, let's color the TFs by whether they are highly expressed in the Cortex or Medulla topic. Coloring is a bit tricky. To color a TF on the plot, we have to pass a dictionary where the keys are TF names, and the values are the value to be mapped to a color. If a TF is on the plot, but not in the dictionary, it will be assigned to the na_color.

From The plot above, we can see that many TFs are important to both topics (DLX, LHX, MSX, HOX etc.), while NFIL3 might be more influential in topic 0. Strikingly, LEF1, RUNX2, TCF7, and TCF7L2 are very specific for topic 15, suggesting these factors are guiding TACs to their branch point decision. LEF1 is also highly expressed in the Cortex or Medulla.

If we want to visualize the influence of factors cell-by-cell, we can compute a motif scores adata:

This takes a while so be sure to save it once finished. The method above returns an entirely new adata object where the features are factors, and each value is the normalized score of that factor for that cell. If you transfer the UMAP to this new anndata, you can investigate patters of influence for each TF:

5. Pseudotime

Next, we can do pseudotime analysis, which has an all-new functional API. The workflow is now:

diffmap --> get_transport_map --> get_branch_probabilities --> get_tree_structure

To start, we must produce a denoised diffusion map representation of the data, and for this, we can use existing scanpy functions.

To start with get_transport_map, we have to choose a start cell. Here, I choose the cell with the highest composition of ATAC topic 10, which corresponds with the root of the differentiation in the ORS. This method does three things:

  1. From the start cell, finds pseudotime of each cell
  2. Constructs affinity matrix between cells, using pseudotime to prune backwards edges
  3. Converts affinity to stochastic transition matrix describing forward differentiation process

Using the eigenvectors of the transition matrix, we may also find possible terminal states.

The states found aren't always that reasonable, so plot them and make sure they make sense:

Next, we must provide our terminal cells so that we can calculate branch probabilities. To the terminal_cells arg, provide a dictionary where the keys are lineage names and the values are the terminal cell number.

Finally, the get_tree_structure function decomposes the markov chain into a series of discrete bifurcations. This representation makes it possible to plot the pseudotemporal differentiation using streams.

The function has one parameter: threshold, which changes the tolerance to divergence at a branch site. If threshold is high, branches occur later, while if threshold is low, the algorithm is very sensitive to branches. In general, 0.1 to 1 work well, but you may adjust this higher to ensure a certain size of pre-branch populations.

We can now plot topics along the pseudotime. The plot_stream function does a lot of different things, but to start, you just provide the adata that you ran the previous pseudotime methods on. Then to the data arg, pass the names of columns that you want to plot, in this case, topics that are influential in the HF subsystem.

Besides various standard plotting functionalities (pallete, ax, title), you can also change the style arg to any in

and you can set split to True if you want multiple plots instead of stacked streams! Lastly, I pass adjusted_pseudotime to the pseudotime_key arg, but if you leave it unset, it defaults to mira_pseudotime.

Interesting, expression topics around the branch point don't change very much, but accessibility topics show major changes prior to the branch point. Let's plot some key accessibility and expression topics at the same time. Here, I pass a custom pallete as a list so that ATAC topics are Blues and expression topics are Greys.

If you want to subset to only see parts of the tree, you can just subset the input anndata. For example, if I wanted to see only cells which are going to differentiate into Cortex or Medulla cells, I can subset based on the tree_states column found using the get_tree_structure function.

And if you only want to see one single lineage, select from the tree_states column, including only states that contain the lineage-of-interest. To format the plot as a linear differentiation, set tree_structure to False.

One may also visualize data using the swarm style:

Or with split streams:

Lastly, the scatter mode is useful for make two or three-way comparisons between features, for example TF expression vs influence or gene expression vs accessibility:

Below, I show an example where I pass to a list to the layers argument. This just tells plot_stream where to look for columns which have multiple representations (raw, imputed, normalized, etc.). Passing None tells the function to look in the default place. If you pass just a string, that layer will be expanded as the location of all features.

To save this pseudotime model of the data, we can just save the adata:

6. RP Modeling

Next, we can train RP model to study timing of accessibility vs expression changes, and also to do cis/trans testing.

6A.

First, we need to find distances between peaks and genes to train the RP functions using the function add_peak_gene_distances. This function will add a "distance_to_TSS" matrix to the ATAC adata showing the distance (positive for downstream, negative for upstream) between every peak and every gene in a provided annotation.

First, we need to get a gene annotation for the data. A good way to get all of the needed information (a dataframe containing chr, start, end, gene_name, strand), is to go to the UCSC table brower.

  1. Go to UCSC table browser: https://genome.ucsc.edu/cgi-bin/hgTables
  2. Select your species and assembly, then under "group" choose "Genes and Gene Prediction", for "track", select "GENCODE VM23", for "table", select "knownGene".
  3. Under "Retrieve and Display Data", set the "output format" box to "selected fields from primary and related tables"
  4. Hit "get output"
  5. Next, a field selection window will pop up. Scroll down to "Linked Tables", and choose "kgXref" and "knownCanoncial". These tables contain information on gene IDs and canonical splice variants, respectively. At the very bottom, hit "allow selection from checked tables".
  6. Scroll back up to the top, then check the fields we need from each table:
  1. Hit "get output" under the top table.

Done!

We'll read in the annotation data, do a little pre-processing, then use it to get peak-gene distances. Once I figure out the API for UCSC, I can automate this or host it for select species.

Now we can get TSS distances:

6B. Training Models

Next, we can train RP Models. To instantiate a modeler, pass an RNA model, ATAC model, and a list of genes that you want to analyze.

I recommend modeling all highly-variable genes, as well as genes which are in the top 200 for each topic to really cover all the bases. In this example, I'll just model a few genes that are interesting.

To fit the models, pass the RNA and ATAC adatas. To start, the method will check for certain fields in the adatas, and will calculate them if not detected. This can take a considerable amount of time, so you can resave the adatas afterwards to skip this step if you want to run it again.

Also, if you're training a lot of genes, it can be good to prepare for random errors that might occur. You can pass a SaveCallback object to the fit method, which will save each model after it's trained so you don't lose your progress. Just provide the callback with a filename prefix.

Alternatively, you can save the models after fitting:

And reload them later:

If you want to access the parameters of a certain model, use:

We can see the decay distances for WNT3 are 8 kB upstream and 28.5 kb downstream.

There's a couple useful functions with the RP model:

Notably, you can use any data that has been labeled with get_distance_to_TSS to make predictions, since the trained models are data-agnostic. Here, I predict expression of my genes on the hair follicle subset of the data.

Now we can plot from these layers:

WNT3's cis model does pretty well! But LEF1 appears difficult to model. The Cis Model expects higher expression at the branch, while expression actually peaks at the tip of the Cortex lineage. Looking at the logP(data) for LEF1 shows the model is most surprised by the data at the tip of the cortex lineage. Let's visualize this with a stream:

LEF1 shows some interesting behavior. Just how bad is LEF1's Cis model compared to WNT3s? We can train a Trans model to find out.

The arguments to a TransModel object are exactly the same as CisModel, except it takes an additional arg: initialization_model. To lower the variance of the trans model and ensure convergence, we initialize the trans model with the learned parameters of the cis model.

The trans model represents the best estimate of gene expression we could ever make using chromatin-based features. Let's see how it compares to true and cis-predicted expression.

7. Cis/Trans Testing

Cis/Trans testing is most interesting with a lot of genes, so I will import a prepared adata object with predictions and logps from 4500 models that I've already trained.

To make this dataset, I already trained Cis and Trans models, then ran predict and get_logp using each.

First, we'll run the Cis/Trans test. The degrees_of_freedom is equal to the number of parameters of the ATAC model.

The test added a new column: global_regulation_test_statistic, which we can use to rank genes based on the performance of the Trans vs Cis model. Let's check to make sure that high-count genes are not baised towards global regulation.

Let's see the top globally-regulated genes and plot a few of them.

The plot_chromatin_differential function is useful for exploring sets of genes and analyzing their relationship between chromatin accessibility and expression. Just list genes you want to see for the genes arg. The last figure in the panel shows for each cell the difference between Cis and Trans model predictions. Assuming the Trans model is the best model one can make using chromatin-based features, points which are off the line show where the Cis model over or under-estimated expression.

An interesting test is checking for a given enrichment, is that geneset especially Cis or Trans-regulated? We can test that using the global_ontology_term_test function. Provided the results from an enrichment, it will test the overlapping genes to see if they are significantly Cis or Trans regulated. We can use enrichments from our RNA topics like so:

The top terms show very significant enrichment for nonlocal regulation. Checking the chromatin differentials below, we see that the chromatin around these genes appears to be active and increasing at the branch point before they are expressed.

If we have a geneset that we would like to test, perhaps downloaded from enrichr, we can use the global_geneset_test function.

To analyze general trends in globally-regulated genes, we can take to top 500 and see what they are enriched for using the new enrichr API.

The 4th result under the WikiPathways_2019_Mouse ontology is very interesting. Shown below, EDA signaling and some downstream effectors show similar patterns of expression. Particularly, they are highly expressed at the branch, but shut off quickly afterwards. The chromatin around these genes, however, remains accessible. This suggests signaling is exerting control over expression of these genes.

8. Driver TF Analysis

Coming soon!