from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Since the last tutorial, I have done a near 100% rework of the interfaces in the code so that every method that requires data simply takes the rna andata, atac andata, or both as arguments. The method will extract what it needs from the adata, or ask you to run other functions if it's missing something! No code for the topic models has changed, so we'll start from trained topic models.
The biggest changes to the package reflect a re-think of what is data and what are models. Data is no longer stored with the models in opaque objects, but is kept within anndata objects where it can be subset, saved, and plotted with other cell and feature-associated data.
Old methods that I have not re-vampled will be imported from kladi
, new methods will be imported from kladiv2
(soon to be mira
).
from kladiv2.topic_model.expression_model import ExpressionModel #topic models
from kladiv2.topic_model.accessibility_model import AccessibilityModel
from kladiv2.cis_model.cis_model import CisModel, TransModel #RP models
from kladiv2.tools import pseudotime #new pseudotime api
from kladiv2.tools.motif_scan import find_motifs #motif scanning tool is separate from topic models
from kladiv2.tools.connect_genes_peaks import get_distance_to_TSS #finds distance between TSS in rna data and peaks in atac data
from kladiv2.tools import enrichr_enrichments as rich #enrichment API and plots separate from topic models.
from kladiv2.tools import utils
from kladiv2.tools import global_local_test as glt
from kladiv2.plots.chromatin_differential_plot import plot_chromatin_differential
from kladiv2.plots.streamplot import plot_stream
from kladiv2.preferences import raw_umap, topic_umap
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info('test')
INFO:root:test
rna_data = anndata.read_h5ad('data/shareseq/2021-08-12_checkpoints/rna_data.h5ad')
atac_data = anndata.read_h5ad('data/shareseq/2021-08-12_checkpoints/atac_data.h5ad')
This rna adata object has been subset for cells passing QC, but not for highly-variable genes. Sometimes it's interesting to plot genes that weren't modeled as part of the topic model.
rna_data.shape
(28429, 18938)
For the RNA model, you can simply use "load_old_model".
rna_model = ExpressionModel.load_old_model('/Users/alynch/Dropbox (Partners HealthCare)/Data/shareseq/2021-08-05_best_rna_model.pth')
WARNING:kladiv2.topic_model.base:Cuda unavailable. Will not use GPU speedup while training. INFO:kladiv2.topic_model.base:Moving model to device: cpu
But for the accessibility model, changes to rely on the var_names of peaks instead of their (chr,start,end) requires you to pass new var_names in the same order as the peaks the model was trained on.
atac_model = AccessibilityModel.load_old_model('/Users/alynch/Dropbox (Partners HealthCare)/Data/shareseq/2021-08-05_best_atac_model.pth',
atac_data.var_names.values)
WARNING:kladiv2.topic_model.base:Cuda unavailable. Will not use GPU speedup while training. INFO:kladiv2.topic_model.base:Moving model to device: cpu
If you save these models after loading them (save
method), then you don't have to use load_old_model
, just use load
to reload them.
Using the topic models is now much more straightforward. Simply provide an andata with all the genes used in the training step, and the method will subset and extract the correct features automatically.
If your raw counts are stored in a layer other than .X, you must set the "counts_layer" attribute of the model (usually this would be done during training).
rna_model.counts_layer = None #keep as None
help(rna_model.predict)
Help on method predict in module kladiv2.topic_model.base: predict(adata, batch_size=512, add_key='X_topic_compositions', add_cols=True, col_prefix='topic_') method of kladiv2.topic_model.expression_model.ExpressionModel instance
rna_model.predict(rna_data)
Predicting latent vars: 100%|██████████| 56/56 [00:06<00:00, 8.55it/s] INFO:kladiv2.core.adata_interface:Added key to obsm: X_topic_compositions INFO:kladiv2.core.adata_interface:Added cols: topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, topic_6, topic_7, topic_8, topic_9, topic_10, topic_11, topic_12, topic_13, topic_14, topic_15, topic_16, topic_17, topic_18, topic_19, topic_20, topic_21, topic_22, topic_23
After each method, you can see what's new in the adata object you passed. This method added topic compositions to "obsm", as well as a column for each topic that makes plotting more convenient.
atac_model.predict(atac_data)
Predicting latent vars: 100%|██████████| 56/56 [00:46<00:00, 1.21it/s] INFO:kladiv2.core.adata_interface:Added key to obsm: X_topic_compositions INFO:kladiv2.core.adata_interface:Added cols: topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, topic_6, topic_7, topic_8, topic_9, topic_10, topic_11, topic_12, topic_13, topic_14, topic_15, topic_16, topic_17, topic_18, topic_19, topic_20, topic_21, topic_22, topic_23
To get features for the UMAP representation, run get_umap_features
:
rna_model.get_umap_features(rna_data)
atac_model.get_umap_features(atac_data)
Predicting latent vars: 100%|██████████| 56/56 [00:06<00:00, 8.72it/s] INFO:kladiv2.core.adata_interface:Added key to obsm: X_umap_features Predicting latent vars: 100%|██████████| 56/56 [00:46<00:00, 1.21it/s] INFO:kladiv2.core.adata_interface:Added key to obsm: X_umap_features
rna_data, atac_data = utils.make_joint_representation(rna_data, atac_data)
INFO:kladiv2.tools.utils:28429 out of 28429 cells shared between datasets (100%). INFO:kladiv2.tools.utils:Key added to obsm: X_joint_umap_features
And lastly, to get imputed values for the RNA data:
rna_model.impute(rna_data)
INFO:kladiv2.core.adata_interface:Fetching key X_topic_compositions from obsm INFO:kladiv2.topic_model.base:Predicting latent variables ... Imputing features: 100%|██████████| 56/56 [00:00<00:00, 81.92it/s] INFO:kladiv2.core.adata_interface:Added layer: imputed
To make the UMAP, we must get nearest neighbors for each cell using manhattan distance to leverage the sparse, othonormal features.
sc.pp.neighbors(rna_data, use_rep='X_joint_umap_features', metric='manhattan') #manhattan distance is crucial
sc.tl.umap(rna_data, min_dist = 0.1) # min_dist = 0.1 makes more pretty maps
Using the distance matrix produced above, we may also perform clustering on cells based on topics, which may yield clusters that are slightly less biologically-arbitrary.
fig, ax = plt.subplots(1,1,figsize=(10,10))
sc.pl.umap(rna_data, color = 'true_cell', legend_loc='on data',
palette= 'tab20', show = False, ax = ax, frameon=False, title = '')
plt.show()
To plot topics, you can invoke the topic columns by passing rna_model.topic_cols
attribute to the color argument. Here I also pass topic_umap()
as kwargs, which is a generator from the kladiv2.preferences
file. This simply provides pre-built arguments to change the plotting behavior.
topic_umap()
{'ncols': 8, 'frameon': False, 'color_map': 'inferno'}
You can adjust these defaults by passing the prefered value as an argument:
topic_umap(ncols=1)
{'ncols': 1, 'frameon': False, 'color_map': 'inferno'}
sc.pl.umap(rna_data, color = rna_model.topic_cols, **topic_umap())
sc.pl.umap(atac_data, color = atac_model.topic_cols, **topic_umap())
The most basic step in topic analysis is simply getting the top N genes from the top of the topic:
rna_model.get_top_genes(10, top_n = 8)
array(['5430421N21RIK', 'KRT33A', 'GM11571', 'DSG4', 'RNASET2B', 'KRT31', 'MTCL1', 'KRT35'], dtype=object)
Which can then be used to plot representative genes:
sc.pl.umap(rna_data, color = rna_model.get_top_genes(13, top_n = 6), **raw_umap(ncols=6))
To do enrichment analysis, you can either post genelists one-at-a-time, or post them all.
help(rna_model.post_topic) # one topic at a time
Help on method post_topic in module kladiv2.topic_model.expression_model: post_topic(topic_num, top_n=None, min_genes=200, max_genes=600) method of kladiv2.topic_model.expression_model.ExpressionModel instance
help(rna_model.post_topics) # all at once
Help on method post_topics in module kladiv2.topic_model.expression_model: post_topics(top_n=None, min_genes=200, max_genes=600) method of kladiv2.topic_model.expression_model.ExpressionModel instance
rna_model.post_topic(12, top_n=250) # Usually 250 genes works pretty well.
To fetch ontology enrichments, you can also do one topic at a time, or all at once. To the fetch
methods, one can provide their own lists of ontologies that they are interested in. By default, we use the "Legacy" ontologies, which are listed under rich.LEGACY_ONTOLOGIES
. For a list of all searchable ontologies, see https://maayanlab.cloud/Enrichr/#libraries.
rich.LEGACY_ONTOLOGIES
['WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'GO_Molecular_Function_2018', 'GO_Cellular_Component_2018', 'GO_Biological_Process_2018', 'BioPlanet_2019']
rna_model.fetch_topic_enrichments(12, ontologies = rich.LEGACY_ONTOLOGIES)
help(rna_model.fetch_enrichments) # all at once
Help on method fetch_enrichments in module kladiv2.topic_model.expression_model: fetch_enrichments(ontologies=['WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'GO_Molecular_Function_2018', 'GO_Cellular_Component_2018', 'GO_Biological_Process_2018', 'BioPlanet_2019']) method of kladiv2.topic_model.expression_model.ExpressionModel instance
To manually view the enrichment results:
rna_model.get_enrichments(12)['WikiPathways_2019_Human'][0]
{'rank': 1, 'term': 'Sphingolipid pathway WP1422', 'pvalue': 0.004522125888624752, 'zscore': 9.982793522267206, 'combined_score': 53.89483679511331, 'genes': ['SGPL1', 'SPHK2', 'SGPP2'], 'adj_pvalue': 0.22671935701100768}
Enrichment results are organized as:
{
ontology : [
{
rank :
term :
pvalue :
zscore :
combined_score :
genes :
adj_pvalue:
},
...
]
}
To plot:
rna_model.plot_enrichments(12, show_top=10, enrichments_per_row=4)
If you're interseted in the top topics for a certain gene:
rna_model.rank_modules('FLG')[-5:]
[(22, array([0.76390576], dtype=float32)), (7, array([3.301608], dtype=float32)), (19, array([3.6907465], dtype=float32)), (12, array([4.829633], dtype=float32)), (13, array([6.47767], dtype=float32))]
If you save the either the ATAC or RNA models after getting enrichment results, that data will be saved with it and will be availabel upon reload.
rna_model.save('data/shareseq/test_save.pth')
rna_model = ExpressionModel.load('data/shareseq/test_save.pth')
rna_model.plot_enrichments(12, show_top = 10, enrichments_per_row=4)
WARNING:kladiv2.topic_model.base:Cuda unavailable. Will not use GPU speedup while training. INFO:kladiv2.topic_model.base:Moving model to device: cpu
To analyze ATAC-seq topics, we first need to find motifs hits in our peaks using the find_motifs
function.
help(find_motifs)
Help on function find_motifs in module kladiv2.tools.motif_scan: find_motifs(adata, chrom='chr', start='start', end='end', pvalue_threshold=0.0001, *, genome_fasta, factor_type='motifs')
Pass the ATAC adata object to the method, and using the chrom, start, end
kwargs, indicate which columns of the data contain that information. Finally, pass the file location of a fasta for your genome to scan sequences.
This function will add binding data to your adata object. Save it once it's done.
find_motifs(atac_data, genome_fasta='/Users/alynch/genomes/mm10/mm10.fa')
INFO:kladiv2.tools.motif_scan:Getting peak sequences ... 334124it [01:05, 5123.87it/s] INFO:kladiv2.tools.motif_scan:Scanning peaks for motif hits with p >= 0.0001 ... INFO:kladiv2.tools.motif_scan:Building motif background models ... INFO:kladiv2.tools.motif_scan:Starting scan ... INFO:kladiv2.tools.motif_scan:Found 1000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 2000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 3000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 4000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 5000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 6000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 7000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 8000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 9000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 10000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 11000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 12000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 13000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 14000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 15000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 16000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 17000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 18000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 19000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 20000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 21000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 22000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 23000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 24000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 25000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 26000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 27000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 28000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 29000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 30000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 31000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 32000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 33000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 34000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 35000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 36000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 37000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 38000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 39000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 40000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 41000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 42000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 43000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 44000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 45000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 46000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 47000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 48000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 49000000 motif hits ... INFO:kladiv2.tools.motif_scan:Found 50000000 motif hits ... INFO:kladiv2.tools.motif_scan:Formatting hits matrix ... INFO:kladiv2.core.adata_interface:Added key to varm: motifs_hits INFO:kladiv2.core.adata_interface:Added key to uns: motifs_hits
Not every motif-associated factor is expressed in our data. You can filter out non-expressed or lowly-expressed TFs using the function below. This function simply masks factors that aren't found in the provided list. If you want to change your criteria later, you can just provide a new list.
utils.mask_non_expressed_factors(atac_data, expressed_genes = rna_data.var_names)
INFO:kladiv2.tools.utils:Found 555 factors in expression data.
Don't forget to save! The steps above are time-consuming but only needs to be executed once.
Finally, with motifs scanned, we can find enrichments in the ATAC-seq topcis. Often, it's most interesting to compare the enrichments of TFs during transitions in cell state. Below, I compare enrichments moving from undifferentiated TAC cells (topic 0) to pre-branch TAC cells (topic 15).
rna_data.obs['is_follicle'] = rna_data.obsm['X_umap'][:,0] < 1.5 #isolate HF subsystem
HF_rna_data = rna_data[rna_data.obs.is_follicle].copy()
sc.pp.neighbors(HF_rna_data, use_rep='joint_umap_features', metric='manhattan')
sc.tl.umap(HF_rna_data, min_dist = 0.1, negative_sample_rate=5)
HF_atac_data = atac_data[HF_rna_data.obs_names].copy()
HF_atac_data.obsm['X_umap'] = HF_rna_data.obsm['X_umap']
sc.pl.umap(HF_atac_data, color = ['topic_0','topic_15'], color_map='inferno', frameon=False)
Lets mask for factors which are expressed highly in the HF:
HF_rna_data.var['log10_counts'] = np.log10(np.array(HF_rna_data.X.sum(0)).reshape(-1) + 1) #see how many counts in HF
ax = sns.histplot(data = HF_rna_data.var, x = 'log10_counts', color = 'lightgrey')
ax.vlines(2, ymin = 0, ymax = 1000, color = 'black')
sns.despine()
utils.mask_non_expressed_factors(HF_atac_data, expressed_genes=HF_rna_data[:, HF_rna_data.var.log10_counts >= 2].var_names)
INFO:kladiv2.tools.utils:Found 318 factors in expression data.
This function performs a fisher exact test to find TFs which are preferentially enriched for peaks in the top quantile of activation for a given topic.
atac_model.get_enriched_TFs(HF_atac_data, module_num=0)
_ = atac_model.get_enriched_TFs(HF_atac_data, module_num=15)
Finding enrichments: 100%|██████████| 318/318 [00:02<00:00, 128.12it/s] Finding enrichments: 100%|██████████| 318/318 [00:02<00:00, 116.60it/s]
The test function above returns results, but as with the RNA model, they can be accessed from the ATAC model later using the get_enrichments
method.
results = atac_model.get_enrichments(15)
pd.DataFrame(results).sort_values('pval').head()
id | name | parsed_name | pval | test_statistic | |
---|---|---|---|---|---|
125 | MA0523.1 | TCF7L2 | TCF7L2 | 0.0 | 1.651095 |
254 | MA0885.1 | DLX2 | DLX2 | 0.0 | 2.792547 |
31 | MA0708.1 | MSX2 | MSX2 | 0.0 | 2.145216 |
266 | MA0879.1 | DLX1 | DLX1 | 0.0 | 2.555530 |
239 | MA0151.1 | ARID3A | ARID3A | 0.0 | 1.493923 |
Let's plot enrichment of TFs transitioning from topic 0 to 15. Just for fun, let's color the TFs by whether they are highly expressed in the Cortex or Medulla topic. Coloring is a bit tricky. To color a TF on the plot, we have to pass a dictionary where the keys are TF names, and the values are the value to be mapped to a color. If a TF is on the plot, but not in the dictionary, it will be assigned to the na_color
.
cortex_or_medulla_factors = np.union1d(rna_model.get_top_genes(10, top_n = 400), rna_model.get_top_genes(5, top_n = 400))
hue = {k : 'Expressed in Cortex/Medulla' for k in cortex_or_medulla_factors}
atac_model.plot_compare_module_enrichments(15, 0, pval_threshold=(1e-100, 1e-100), hue = hue, na_color='lightgrey', palette = 'Set1',
label_closeness=2.5, figsize=(10,7))
plt.tight_layout()
From The plot above, we can see that many TFs are important to both topics (DLX, LHX, MSX, HOX etc.), while NFIL3 might be more influential in topic 0. Strikingly, LEF1, RUNX2, TCF7, and TCF7L2 are very specific for topic 15, suggesting these factors are guiding TACs to their branch point decision. LEF1 is also highly expressed in the Cortex or Medulla.
If we want to visualize the influence of factors cell-by-cell, we can compute a motif scores adata:
help(atac_model.get_motif_scores)
Help on method get_motif_scores in module kladiv2.topic_model.accessibility_model: get_motif_scores(adata, factor_type='motifs', mask_factors=True, key='X_topic_compositions', batch_size=512) method of kladiv2.topic_model.accessibility_model.AccessibilityModel instance
motif_scores = atac_model.get_motif_scores(HF_atac_data)
INFO:kladiv2.core.adata_interface:Fetching key X_topic_compositions from obsm INFO:kladiv2.topic_model.base:Predicting latent variables ... Imputing features: 100%|██████████| 13/13 [01:03<00:00, 4.91s/it]
motif_scores.write_h5ad('data/shareseq/HF_motif_scores.h5ad')
... storing 'name' as categorical ... storing 'parsed_name' as categorical
This takes a while so be sure to save it once finished. The method above returns an entirely new adata object where the features are factors, and each value is the normalized score of that factor for that cell. If you transfer the UMAP to this new anndata, you can investigate patters of influence for each TF:
motif_scores.obsm['X_umap'] = HF_rna_data.obsm['X_umap']
motif_scores.var['name'] = motif_scores.var.name.astype(str)
motif_scores.var = motif_scores.var.set_index('name')
sc.pl.umap(HF_rna_data, color = ['LEF1','RUNX2','DLX3','NFIL3'], **raw_umap())
sc.pl.umap(motif_scores, color = ['LEF1','RUNX2','DLX3','NFIL3'], frameon=False, color_map='inferno')
Next, we can do pseudotime analysis, which has an all-new functional API. The workflow is now:
diffmap
--> get_transport_map
--> get_branch_probabilities
--> get_tree_structure
To start, we must produce a denoised diffusion map representation of the data, and for this, we can use existing scanpy
functions.
sc.pp.neighbors(HF_rna_data, use_rep='joint_umap_features', n_pcs=None) #works best with 30 neighbors
sc.tl.diffmap(HF_rna_data) #produce diffmap using joint representation distance matrix (already computed)
pseudotime.normalize_diffmap(HF_rna_data) # select number of diffmap components using eigengap and rescale
sc.pp.neighbors(HF_rna_data, use_rep='X_diffmap', key_added='X_diffmap', n_neighbors = 30) #make a new distance matrix in rescaled diffmap space, works best with 30 or so neighbors
INFO:root:Added key to obsm: X_diffmap, normalized diffmap with 4 components.
To start with get_transport_map
, we have to choose a start cell. Here, I choose the cell with the highest composition of ATAC topic 10, which corresponds with the root of the differentiation in the ORS. This method does three things:
start_cell = HF_atac_data.obs.topic_10.argmax()
pseudotime.get_transport_map(HF_rna_data, start_cell = start_cell, n_waypoints = 2500)
INFO:kladiv2.tools.pseudotime:Calculating diffusion pseudotime ... INFO:kladiv2.tools.pseudotime:Calculating transport map ... INFO:kladiv2.core.adata_interface:Added key to obs: mira_pseudotime INFO:kladiv2.core.adata_interface:Added key to obsp: transport_map INFO:kladiv2.core.adata_interface:Added key to uns: iroot
Using the eigenvectors of the transition matrix, we may also find possible terminal states.
terminal_cellnums = pseudotime.find_terminal_cells(HF_rna_data)
INFO:kladiv2.tools.pseudotime:Found 3 terminal states from stationary distribution.
The states found aren't always that reasonable, so plot them and make sure they make sense:
fig,ax = plt.subplots(1,1,figsize=(6,4))
sc.pl.umap(HF_rna_data, color = 'mira_pseudotime', **topic_umap(color_map='viridis'), show = False, ax = ax)
for terminal_cell in list(terminal_cellnums) + [start_cell]:
ax.text(*HF_rna_data.obsm["X_umap"][terminal_cell], str(terminal_cell))
plt.show()
Next, we must provide our terminal cells so that we can calculate branch probabilities. To the terminal_cells
arg, provide a dictionary where the keys are lineage names and the values are the terminal cell number.
pseudotime.get_branch_probabilities(HF_rna_data, terminal_cells= {
'Medulla' : 3061,
'Cortex' : 1387,
'IRS' : 2318
})
INFO:kladiv2.core.adata_interface:Added key to obsm: branch_probs INFO:kladiv2.core.adata_interface:Added key to uns: lineage_names
Finally, the get_tree_structure
function decomposes the markov chain into a series of discrete bifurcations. This representation makes it possible to plot the pseudotemporal differentiation using streams.
The function has one parameter: threshold
, which changes the tolerance to divergence at a branch site. If threshold is high, branches occur later, while if threshold is low, the algorithm is very sensitive to branches. In general, 0.1 to 1 work well, but you may adjust this higher to ensure a certain size of pre-branch populations.
pseudotime.get_tree_structure(HF_rna_data, threshold=0.5)
INFO:kladiv2.core.adata_interface:Added key to obs: tree_states INFO:kladiv2.core.adata_interface:Added key to uns: tree_state_names INFO:kladiv2.core.adata_interface:Added key to uns: connectivities_tree
sc.pl.umap(HF_rna_data, color = 'tree_states', frameon=False, palette='Set2')
... storing 'tree_states' as categorical
We can now plot topics along the pseudotime. The plot_stream
function does a lot of different things, but to start, you just provide the adata that you ran the previous pseudotime methods on. Then to the data
arg, pass the names of columns that you want to plot, in this case, topics that are influential in the HF subsystem.
Besides various standard plotting functionalities (pallete, ax, title), you can also change the style
arg to any in
and you can set split
to True
if you want multiple plots instead of stacked streams! Lastly, I pass adjusted_pseudotime
to the pseudotime_key
arg, but if you leave it unset, it defaults to mira_pseudotime
.
fig, ax = plt.subplots(2,1,figsize=(15,10))
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 7], data = ['topic_6','topic_10','topic_5','topic_11','topic_4','topic_9'], max_bar_height=0.6,
style='stream', log_pseudotime=False, pseudotime_key = 'adjusted_pseudotime', ax = ax[0], palette='Set2', window_size=101, linewidth=0.2, title = 'Expression Topics')
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 7], data = ['topic_{}_atac'.format(str(x)) for x in [0,15,14,17,23,21,11]],
max_bar_height=0.6, hide_feature_threshold=0, style='stream', log_pseudotime=False, pseudotime_key = 'adjusted_pseudotime',
ax = ax[1], palette='Set3', window_size=101, linewidth=0.2, title = 'Accessibility Topics')
Interesting, expression topics around the branch point don't change very much, but accessibility topics show major changes prior to the branch point. Let's plot some key accessibility and expression topics at the same time. Here, I pass a custom pallete as a list so that ATAC topics are Blues and expression topics are Greys.
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 7], data = ['topic_6', 'topic_10', 'topic_5', 'topic_4','topic_0_atac', 'topic_15_atac', 'topic_17_atac'], max_bar_height=0.7,
style='stream', log_pseudotime=False, size = 5, pseudotime_key = 'adjusted_pseudotime', figsize=(15,6), center_baseline=True, scale_features=False,
palette=['lightgrey','darkgrey','lightslategrey', 'silver', 'cornflowerblue','skyblue','royalblue'], window_size=101, linewidth=0.3)
If you want to subset to only see parts of the tree, you can just subset the input anndata. For example, if I wanted to see only cells which are going to differentiate into Cortex or Medulla cells, I can subset based on the tree_states
column found using the get_tree_structure
function.
plot_stream(HF_rna_data[HF_rna_data.obs.tree_states.isin(['Cortex','Medulla','Medulla, Cortex'])],
data = ['topic_6', 'topic_10', 'topic_5', 'topic_4','topic_0_atac', 'topic_15_atac', 'topic_17_atac'], max_bar_height=0.7, scaffold_linecolor='white',
style='stream', log_pseudotime=False, size = 5, pseudotime_key = 'adjusted_pseudotime', figsize=(10,5), center_baseline=True, scale_features=False,
palette=['lightgrey','darkgrey','lightslategrey', 'silver', 'cornflowerblue','skyblue', 'royalblue'], window_size=101, linewidth=0.3)
And if you only want to see one single lineage, select from the tree_states
column, including only states that contain the lineage-of-interest. To format the plot as a linear differentiation, set tree_structure
to False.
plot_stream(HF_rna_data[np.logical_and(HF_rna_data.obs.tree_states.str.contains('Cortex'), HF_rna_data.obs.mira_pseudotime > 6)],
data = ['topic_6', 'topic_10','topic_0_atac', 'topic_15_atac', 'topic_23_atac'], max_bar_height=0.7,
style='stream', log_pseudotime=True, size = 5, figsize=(10,2), center_baseline=False,
palette=['lightgrey','lightslategrey', 'cornflowerblue','skyblue','royalblue'], window_size=301, linewidth=0.3, tree_structure=False)
One may also visualize data using the swarm
style:
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 7], data = ['LEF1','JAG1','JAG2'], layers = 'normalized', max_bar_height=0.7, palette = "Reds", max_swarm_density = 250,
style='swarm', log_pseudotime=False, pseudotime_key = 'adjusted_pseudotime', aspect=2, height=4, split = True, size = 8, linecolor='grey', linewidth=0.3)
Or with split streams:
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 7], data = ['LEF1','JAG1','JAG2'], layers='imputed', max_bar_height=0.99, palette = "plasma_r", max_swarm_density = 100,
style='stream', log_pseudotime=False, pseudotime_key = 'adjusted_pseudotime', aspect=2., height=4, split = True, scale_features = True)
Lastly, the scatter
mode is useful for make two or three-way comparisons between features, for example TF expression vs influence or gene expression vs accessibility:
HF_rna_data.obs['LEF1_motif_score'] = motif_scores.obs_vector('LEF1')
Below, I show an example where I pass to a list to the layers
argument. This just tells plot_stream
where to look for columns which have multiple representations (raw, imputed, normalized, etc.). Passing None
tells the function to look in the default place. If you pass just a string, that layer will be expanded as the location of all features.
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime >= 6], data = ['LEF1', 'LEF1_motif_score'], layers= ['imputed', None],
log_pseudotime=False, style = 'scatter', palette=['slategrey','cornflowerblue'], figsize=(15,6),
scale_features=True, size = 3)
To save this pseudotime model of the data, we can just save the adata:
HF_rna_data.write('data/shareseq/HF/pseudotime_rna.h5ad')
Next, we can train RP model to study timing of accessibility vs expression changes, and also to do cis/trans testing.
First, we need to find distances between peaks and genes to train the RP functions using the function add_peak_gene_distances
. This function will add a "distance_to_TSS" matrix to the ATAC adata showing the distance (positive for downstream, negative for upstream) between every peak and every gene in a provided annotation.
First, we need to get a gene annotation for the data. A good way to get all of the needed information (a dataframe containing chr, start, end, gene_name, strand), is to go to the UCSC table brower.
Done!
We'll read in the annotation data, do a little pre-processing, then use it to get peak-gene distances. Once I figure out the API for UCSC, I can automate this or host it for select species.
tss_data = pd.read_csv('/Users/alynch/genomes/mm10/canonical_tss.tsv', sep = '\t')
tss_data.head(3)
#mm10.knownGene.name | mm10.knownGene.chrom | mm10.knownGene.strand | mm10.knownGene.txStart | mm10.knownGene.txEnd | mm10.kgXref.geneSymbol | mm10.knownCanonical.chromStart | mm10.knownCanonical.chromEnd | mm10.knownCanonical.transcript | |
---|---|---|---|---|---|---|---|---|---|
0 | ENSMUST00000193812.1 | chr1 | + | 3073252 | 3074322 | 4933401J01Rik | 3073252.0 | 3074322.0 | ENSMUST00000193812.1 |
1 | ENSMUST00000082908.1 | chr1 | + | 3102015 | 3102125 | Gm26206 | 3102015.0 | 3102125.0 | ENSMUST00000082908.1 |
2 | ENSMUST00000159265.1 | chr1 | - | 3206522 | 3215632 | Xkr4 | NaN | NaN | NaN |
tss_data.columns = tss_data.columns.str.split('.').str.get(-1) # parse the column titles to get rid of table name
tss_data.geneSymbol = tss_data.geneSymbol.str.upper() # make the gene symbols uppercase
tss_data = tss_data.dropna() # drop gene entries that do not correspond with the gene's canonical splice variant
tss_data = tss_data.drop_duplicates('geneSymbol') # drop duplicate gene symbols to remove ambiguity. Most gene symbols correspond to a unique locus, but some don't.
# If you've done your whole analysis up to this point with Ensemble IDs you could potentially test specific loci for these genes.
tss_data.head(3)
name | chrom | strand | txStart | txEnd | geneSymbol | chromStart | chromEnd | transcript | |
---|---|---|---|---|---|---|---|---|---|
0 | ENSMUST00000193812.1 | chr1 | + | 3073252 | 3074322 | 4933401J01RIK | 3073252.0 | 3074322.0 | ENSMUST00000193812.1 |
1 | ENSMUST00000082908.1 | chr1 | + | 3102015 | 3102125 | GM26206 | 3102015.0 | 3102125.0 | ENSMUST00000082908.1 |
4 | ENSMUST00000192857.1 | chr1 | + | 3252756 | 3253236 | GM18956 | 3252756.0 | 3253236.0 | ENSMUST00000192857.1 |
Now we can get TSS distances:
help(get_distance_to_TSS)
Help on function get_distance_to_TSS in module kladiv2.tools.connect_genes_peaks: get_distance_to_TSS(adata, tss_data=None, peak_chrom='chr', peak_start='start', peak_end='end', gene_id='geneSymbol', gene_chrom='chrom', gene_start='txStart', gene_end='txEnd', gene_strand='strand', max_distance=600000.0, promoter_width=3000, *, genome_file)
get_distance_to_TSS(atac_data, # ATAC data
tss_data = tss_data, # annotation dataframe
peak_chrom='chr', # column from ATAC data with peak chromosome locs
peak_start= 'start',
peak_end = 'end',
gene_id = 'geneSymbol', # column from TSS data with gene name,
gene_chrom = 'chrom',
gene_strand = 'strand',
gene_start = 'txStart',
gene_end = 'txEnd',
genome_file= '/Users/alynch/genomes/mm10/mm10.genome' # chromosome lengths of genome
)
INFO:kladiv2.tools.connect_genes_peaks:Finding peak intersections with promoters ... INFO:kladiv2.tools.connect_genes_peaks:Calculating distances between peaks and TSS ... INFO:kladiv2.tools.connect_genes_peaks:Masking other genes' promoters ... INFO:kladiv2.core.adata_interface:Added key to var: distance_to_TSS INFO:kladiv2.core.adata_interface:Added key to uns: distance_to_TSS_genes
Next, we can train RP Models. To instantiate a modeler, pass an RNA model, ATAC model, and a list of genes that you want to analyze.
I recommend modeling all highly-variable genes, as well as genes which are in the top 200 for each topic to really cover all the bases. In this example, I'll just model a few genes that are interesting.
cis_model = CisModel(expr_model=rna_model, accessibility_model=atac_model,
genes = ['LEF1','SHH','NFKB2','RELB','EGR2','WNT3'])
To fit the models, pass the RNA and ATAC adatas. To start, the method will check for certain fields in the adatas, and will calculate them if not detected. This can take a considerable amount of time, so you can resave the adatas afterwards to skip this step if you want to run it again.
Also, if you're training a lot of genes, it can be good to prepare for random errors that might occur. You can pass a SaveCallback
object to the fit
method, which will save each model after it's trained so you don't lose your progress. Just provide the callback with a filename prefix.
from kladiv2.cis_model.cis_model import SaveCallback
cis_model.fit(
atac_adata= atac_data,
expr_adata=rna_data,
callback=SaveCallback('data/shareseq/HF/test_models_')
)
Fitting models: 100%|██████████| 6/6 [00:33<00:00, 5.63s/it]
<kladiv2.cis_model.cis_model.CisModel at 0x7fdbc1b06310>
Alternatively, you can save the models after fitting:
cis_model.save('data/shareseq/HF/test_models_')
And reload them later:
cis_model.load('data/shareseq/HF/test_models_')
If you want to access the parameters of a certain model, use:
cis_model.get_model('WNT3').get_normalized_params()
{'a': array([0.31201848, 2.8915608 , 0.31894672], dtype=float32), 'logdistance': array([ 8.52299, 28.5409 ], dtype=float32), 'theta': array(0.707018, dtype=float32), 'gamma': array(0.9881284, dtype=float32), 'bias': array(-0.28603807, dtype=float32)}
We can see the decay distances for WNT3 are 8 kB upstream and 28.5 kb downstream.
There's a couple useful functions with the RP model:
Notably, you can use any data that has been labeled with get_distance_to_TSS
to make predictions, since the trained models are data-agnostic. Here, I predict expression of my genes on the hair follicle subset of the data.
cis_model.get_logp(atac_adata=HF_atac_data, expr_adata=HF_rna_data)
INFO:kladiv2.core.adata_interface:Fetching key X_topic_compositions from obsm INFO:kladiv2.topic_model.base:Predicting latent variables ... Calculating softmax summary data: 100%|██████████| 13/13 [00:00<00:00, 71.92it/s] INFO:kladiv2.core.adata_interface:Fetching key X_topic_compositions from obsm INFO:kladiv2.topic_model.base:Predicting latent variables ... Calculating softmax summary data: 100%|██████████| 13/13 [00:08<00:00, 1.54it/s] Getting logp(Data): 0%| | 0/6 [00:00<?, ?it/s]/Users/alynch/projects/multiomics/kladi/kladiv2/cis_model/cis_model.py:344: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). return torch.tensor(X, requires_grad=False) Getting logp(Data): 100%|██████████| 6/6 [00:00<00:00, 11.06it/s] INFO:kladiv2.core.adata_interface:Added layer: cis_logp
cis_model.predict(atac_adata=HF_atac_data, expr_adata=HF_rna_data)
Predicting expression: 100%|██████████| 6/6 [00:00<00:00, 11.67it/s] INFO:kladiv2.core.adata_interface:Added layer: cis_prediction
Now we can plot from these layers:
sc.pl.umap(HF_rna_data, color = ['LEF1', 'WNT3'], layer = 'imputed', **topic_umap(), title=['LEF1 Expression', 'WNT3 Expression'])
sc.pl.umap(HF_rna_data, color = ['LEF1', 'WNT3'], layer = 'cis_prediction', **topic_umap(color_map='viridis'), title = ['LEF1 Cis Prediction', 'WNT3 Cis Prediction'])
sc.pl.umap(HF_rna_data, color = ['LEF1', 'WNT3'], layer = 'cis_logp', **topic_umap(color_map='viridis'), vmin = -3, title = ['LEF1 logP', 'WNT3 logP'])
WNT3's cis model does pretty well! But LEF1 appears difficult to model. The Cis Model expects higher expression at the branch, while expression actually peaks at the tip of the Cortex lineage. Looking at the logP(data) for LEF1 shows the model is most surprised by the data at the tip of the cortex lineage. Let's visualize this with a stream:
plot_stream(HF_rna_data[HF_rna_data.obs.mira_pseudotime > 6.5], data = ['LEF1', 'LEF1'], layers=['cis_prediction','imputed'],
palette=['slategrey','indianred'], log_pseudotime=True, figsize=(13,5), style = 'scatter', size = 1, scale_features=True)
LEF1 shows some interesting behavior. Just how bad is LEF1's Cis model compared to WNT3s? We can train a Trans model to find out.
The arguments to a TransModel
object are exactly the same as CisModel
, except it takes an additional arg: initialization_model
. To lower the variance of the trans model and ensure convergence, we initialize the trans model with the learned parameters of the cis model.
trans_model = TransModel(expr_model=rna_model, accessibility_model= atac_model, genes= ['LEF1','SHH','NFKB2','RELB','EGR2','WNT3'],
initialization_model=cis_model)
trans_model.fit(atac_adata=atac_data, expr_adata=rna_data)
Fitting models: 100%|██████████| 6/6 [00:22<00:00, 3.82s/it]
<kladiv2.cis_model.cis_model.TransModel at 0x7fd79761cfd0>
trans_model.predict(atac_adata=HF_atac_data, expr_adata=HF_rna_data)
Predicting expression: 100%|██████████| 6/6 [00:00<00:00, 11.61it/s] INFO:kladiv2.core.adata_interface:Added layer: trans_prediction
trans_model.get_logp(atac_adata=HF_atac_data, expr_adata=HF_rna_data)
Getting logp(Data): 0%| | 0/6 [00:00<?, ?it/s]/Users/alynch/projects/multiomics/kladi/kladiv2/cis_model/cis_model.py:344: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). return torch.tensor(X, requires_grad=False) Getting logp(Data): 100%|██████████| 6/6 [00:00<00:00, 11.37it/s] INFO:kladiv2.core.adata_interface:Added layer: trans_logp
The trans model represents the best estimate of gene expression we could ever make using chromatin-based features. Let's see how it compares to true and cis-predicted expression.
fig,ax = plt.subplots(1,3,figsize=(20,4))
sc.pl.umap(HF_rna_data, color = 'LEF1', **raw_umap(), ax = ax[0], show = False, title = 'LEF1 Raw Expression')
sc.pl.umap(HF_rna_data, color = 'LEF1', layer='cis_prediction', **topic_umap(), ax = ax[1], show = False, title = 'LEF1 Cis Prediction')
sc.pl.umap(HF_rna_data, color = 'LEF1', layer='trans_prediction', **topic_umap(), ax = ax[2], show = False, title = 'LEF1 Trans Prediction')
plt.show()
Cis/Trans testing is most interesting with a lot of genes, so I will import a prepared adata object with predictions and logps from 4500 models that I've already trained.
modeled_data = anndata.read_h5ad('data/shareseq/2021-08-12_checkpoints/modeled_genes.h5ad')
To make this dataset, I already trained Cis and Trans models, then ran predict
and get_logp
using each.
modeled_data.layers
Layers with keys: cis_logp, cis_prediction, cis_velocity, imputed, trans_logp, trans_prediction
First, we'll run the Cis/Trans test. The degrees_of_freedom
is equal to the number of parameters of the ATAC model.
glt.globally_regulated_gene_test(modeled_data, degrees_of_freedom= atac_model.num_topics)
INFO:kladiv2.core.adata_interface:Added keys to var: global_regulation_test_statistic, global_regulation_pval, nonzero_counts
The test added a new column: global_regulation_test_statistic, which we can use to rank genes based on the performance of the Trans vs Cis model. Let's check to make sure that high-count genes are not baised towards global regulation.
ax = sc.pl.scatter(modeled_data, y = 'global_regulation_test_statistic', x = 'nonzero_counts', color = 'global_regulation_pval', show = False)
ax.set(xscale = 'log', yscale = 'log')
plt.show()
Let's see the top globally-regulated genes and plot a few of them.
modeled_data.var.global_regulation_test_statistic.dropna().sort_values().tail(25)
gene CMAH 114.003902 NLRP1A 116.008133 TOP2A 117.000890 ZBTB7C 120.014333 NKAIN3 122.066808 PIK3C2G 123.703160 FCAMR 124.662707 DSC1 125.576277 FHOD3 126.660995 SPECC1 127.190914 ACACB 128.323325 CASP14 130.354176 PADI3 132.782485 SOAT1 135.534155 DLG2 136.134508 SCD1 152.272416 GRIP1 154.530462 ZFHX3 160.184407 GM15848 168.203803 FLG2 179.116516 FA2H 190.907322 PDZRN4 201.250570 GM11571 219.232413 KRT31 234.947621 KRT33A 519.731069 Name: global_regulation_test_statistic, dtype: float64
glt.get_chromatin_differential(modeled_data)
INFO:kladiv2.core.adata_interface:Added key to layers: chromatin_differential
### TRANSFERING DATA TO MODELED_DATA ADATA
shared_cells = np.intersect1d(modeled_data.obs_names, HF_rna_data.obs_names)
HF_modeled_data = modeled_data[shared_cells].copy()
HF_modeled_data.obsm['X_umap'] = HF_rna_data[shared_cells].obsm['X_umap']
HF_modeled_data.obs['tree_states'] = HF_rna_data[shared_cells].obs.tree_states
HF_modeled_data.obs['mira_pseudotime'] = HF_rna_data[shared_cells].obs.mira_pseudotime
HF_modeled_data.uns['tree_state_names'] = HF_rna_data.uns['tree_state_names']
HF_modeled_data.uns['connectivities_tree'] = HF_rna_data.uns['connectivities_tree']
###
The plot_chromatin_differential
function is useful for exploring sets of genes and analyzing their relationship between chromatin accessibility and expression. Just list genes you want to see for the genes
arg. The last figure in the panel shows for each cell the difference between Cis and Trans model predictions. Assuming the Trans model is the best model one can make using chromatin-based features, points which are off the line show where the Cis model over or under-estimated expression.
plot_chromatin_differential(HF_modeled_data, genes = ['DLG2','SOAT1','GRIP1','DSC1'], size = 2)
plt.show()
An interesting test is checking for a given enrichment, is that geneset especially Cis or Trans-regulated? We can test that using the global_ontology_term_test
function. Provided the results from an enrichment, it will test the overlapping genes to see if they are significantly Cis or Trans regulated. We can use enrichments from our RNA topics like so:
pd.DataFrame(
glt.global_ontology_term_test(modeled_data, enrichments=rna_model.get_enrichments(5))
).sort_values('global_pval').head(5)
ontology | term | num_genes_tested | global_pval | genes | |
---|---|---|---|---|---|
0 | GO_Molecular_Function_2018 | hydrolase activity, acting on carbon-nitrogen ... | 3 | 0.001693 | (PADI1, PADI4, PADI3) |
1 | GO_Biological_Process_2018 | membrane lipid biosynthetic process (GO:0046467) | 5 | 0.003331 | (SPTSSB, PRKD1, B3GNT5, ST8SIA6, CERS4) |
2 | GO_Biological_Process_2018 | peptidyl-arginine modification (GO:0018195) | 2 | 0.007657 | (PADI4, PADI3) |
3 | BioPlanet_2019 | Proteins and DNA sequences in cardicac structures | 3 | 0.008149 | (BMP2, JAG1, RUNX2) |
4 | GO_Molecular_Function_2018 | ionotropic glutamate receptor activity (GO:000... | 2 | 0.011763 | (GRIK4, GRIA3) |
The top terms show very significant enrichment for nonlocal regulation. Checking the chromatin differentials below, we see that the chromatin around these genes appears to be active and increasing at the branch point before they are expressed.
plot_chromatin_differential(HF_modeled_data, genes = 'SPTSSB, PRKD1, B3GNT5, ST8SIA6, CERS4'.split(', '), size = 2)
plt.show()
If we have a geneset that we would like to test, perhaps downloaded from enrichr, we can use the global_geneset_test
function.
notch_signaling = 'TCF7L1,LEF1,HIVEP3,RUNX2'.split(',')
glt.global_geneset_test(modeled_data, test_gene_group= notch_signaling)
0.0009748940851143133
To analyze general trends in globally-regulated genes, we can take to top 500 and see what they are enriched for using the new enrichr API.
rich.post_genelist(modeled_data.var.dropna().sort_values('global_regulation_test_statistic').tail(500).index.values) #posts genelist to enrichr, returns list ID
41575926
global_enrichments = rich.fetch_ontologies(41575926) # takes list ID and ontologies, returns enrichments results
rich.plot_enrichments(global_enrichments, show_top=10, enrichments_per_row=4) #plot enrichments, same as RNA model
The 4th result under the WikiPathways_2019_Mouse ontology is very interesting. Shown below, EDA signaling and some downstream effectors show similar patterns of expression. Particularly, they are highly expressed at the branch, but shut off quickly afterwards. The chromatin around these genes, however, remains accessible. This suggests signaling is exerting control over expression of these genes.
plot_chromatin_differential(HF_modeled_data, genes=global_enrichments['WikiPathways_2019_Mouse'][3]['genes'], size = 3)
plt.show()
Coming soon!