In 2018, Buenrostro et al, published a chromatin accessibility landscape of human hematopoisis using single cell ATACseq (https://www.ncbi.nlm.nih.gov/pubmed/29706549)

They obtained single cell ATAC for ~3000 human hematopoietic cells. All the cells were FAC sorted, thus Buenrostro et al. provide a ground truth for cell identity that can be used to check data processing, cell type clustering and differentiation trajectory analyses.

This tutorial aim to:

  • get familiar with AnnData objects
  • process (filter and normalize) single cell ATACseq data
  • visualize the data
  • identify cell clusters and corresponding cell types
  • identify differentially open features
  • explore possible cell trajectories

Import packages to load

In [1]:
import anndata as ad
import episcanpy.api as epi
import scanpy as sc
import numpy as np
/Users/anna.danese/anaconda3/lib/python3.7/site-packages/scanpy/api/__init__.py:6: FutureWarning: 

In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.

  FutureWarning
In [2]:
# settings for the plots
sc.set_figure_params(scanpy=True, dpi=80, dpi_save=250,
                     frameon=True, vector_friendly=True,
                     color_map="YlGnBu", format='pdf', transparent=False,
                     ipython_format='png2x')

Load the data to analyze

In [10]:
# specify the directory 
DATADIR = ''
# Load the data
adata = ad.read(DATADIR+'GSE96769_anndata.h5ad')
adata
Out[10]:
AnnData object with n_obs × n_vars = 2953 × 491437 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep'
    var: 'region'

Exploring the loaded data

For a more detailed description of the AnnData object, check out: https://anndata.readthedocs.io/en/stable/

cell names:

In [6]:
adata.obs_names
Out[6]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '2943', '2944', '2945', '2946', '2947', '2948', '2949', '2950', '2951',
       '2952'],
      dtype='object', name='index', length=2953)

The cell names are not specified in adata.obs_names. However, you can find it in the metadata.

In [7]:
# current cell annotations
adata.obs
Out[7]:
cellID line celltype11 celltype8 ToKeep
index
0 singles-BM0828-HSC-fresh-151027-1 1 HSC HSC 1
1 singles-BM0828-HSC-fresh-151027-2 2 HSC HSC 1
2 singles-BM0828-HSC-fresh-151027-3 3 HSC HSC 1
3 singles-BM0828-HSC-fresh-151027-4 4 HSC HSC 1
4 singles-BM0828-HSC-fresh-151027-5 5 HSC HSC 1
5 singles-BM0828-HSC-fresh-151027-6 6 HSC HSC 1
6 singles-BM0828-HSC-fresh-151027-7 7 HSC HSC 1
7 singles-BM0828-HSC-fresh-151027-8 8 HSC HSC 1
8 singles-BM0828-HSC-fresh-151027-9 9 HSC HSC 1
9 singles-BM0828-HSC-fresh-151027-10 10 HSC HSC 1
10 singles-BM0828-HSC-fresh-151027-11 11 HSC HSC 1
11 singles-BM0828-HSC-fresh-151027-12 12 HSC HSC 1
12 singles-BM0828-HSC-fresh-151027-13 13 HSC HSC 1
13 singles-BM0828-HSC-fresh-151027-14 14 HSC HSC 1
14 singles-BM0828-HSC-fresh-151027-15 15 HSC HSC 1
15 singles-BM0828-HSC-fresh-151027-16 16 HSC HSC 1
16 singles-BM0828-HSC-fresh-151027-17 17 HSC HSC 1
17 singles-BM0828-HSC-fresh-151027-18 18 HSC HSC 1
18 singles-BM0828-HSC-fresh-151027-19 19 HSC HSC 1
19 singles-BM0828-HSC-fresh-151027-20 20 HSC HSC 1
20 singles-BM0828-HSC-fresh-151027-21 21 HSC HSC 1
21 singles-BM0828-HSC-fresh-151027-22 22 HSC HSC 1
22 singles-BM0828-HSC-fresh-151027-23 23 HSC HSC 1
23 singles-BM0828-HSC-fresh-151027-24 24 HSC HSC 1
24 singles-BM0828-HSC-fresh-151027-25 25 HSC HSC 1
25 singles-BM0828-HSC-fresh-151027-26 26 HSC HSC 1
26 singles-BM0828-HSC-fresh-151027-27 27 HSC HSC 1
27 singles-BM0828-HSC-fresh-151027-28 28 HSC HSC 1
28 singles-BM0828-HSC-fresh-151027-29 29 HSC HSC 1
29 singles-BM0828-HSC-fresh-151027-30 30 HSC HSC 1
... ... ... ... ... ...
2923 singles-160822-BM1137-CMP-LS-67 2924 CMP CMP 1
2924 singles-160822-BM1137-CMP-LS-68 2925 CMP CMP 1
2925 singles-160822-BM1137-CMP-LS-69 2926 CMP CMP 1
2926 singles-160822-BM1137-CMP-LS-70 2927 CMP CMP 1
2927 singles-160822-BM1137-CMP-LS-71 2928 CMP CMP 1
2928 singles-160822-BM1137-CMP-LS-72 2929 CMP CMP 1
2929 singles-160822-BM1137-CMP-LS-73 2930 CMP CMP 1
2930 singles-160822-BM1137-CMP-LS-74 2931 CMP CMP 1
2931 singles-160822-BM1137-CMP-LS-75 2932 CMP CMP 1
2932 singles-160822-BM1137-CMP-LS-76 2933 CMP CMP 1
2933 singles-160822-BM1137-CMP-LS-77 2934 CMP CMP 1
2934 singles-160822-BM1137-CMP-LS-78 2935 CMP CMP 1
2935 singles-160822-BM1137-CMP-LS-79 2936 CMP CMP 1
2936 singles-160822-BM1137-CMP-LS-80 2937 CMP CMP 1
2937 singles-160822-BM1137-CMP-LS-81 2938 CMP CMP 1
2938 singles-160822-BM1137-CMP-LS-82 2939 CMP CMP 1
2939 singles-160822-BM1137-CMP-LS-83 2940 CMP CMP 1
2940 singles-160822-BM1137-CMP-LS-84 2941 CMP CMP 1
2941 singles-160822-BM1137-CMP-LS-85 2942 CMP CMP 1
2942 singles-160822-BM1137-CMP-LS-86 2943 CMP CMP 1
2943 singles-160822-BM1137-CMP-LS-87 2944 CMP CMP 1
2944 singles-160822-BM1137-CMP-LS-88 2945 CMP CMP 1
2945 singles-160822-BM1137-CMP-LS-89 2946 CMP CMP 1
2946 singles-160822-BM1137-CMP-LS-90 2947 CMP CMP 1
2947 singles-160822-BM1137-CMP-LS-91 2948 CMP CMP 1
2948 singles-160822-BM1137-CMP-LS-92 2949 CMP CMP 1
2949 singles-160822-BM1137-CMP-LS-93 2950 CMP CMP 1
2950 singles-160822-BM1137-CMP-LS-94 2951 CMP CMP 1
2951 singles-160822-BM1137-CMP-LS-95 2952 CMP CMP 1
2952 singles-160822-BM1137-CMP-LS-96 2953 CMP CMP 1

2953 rows × 5 columns

Feature name:

In [8]:
adata.var_names
Out[8]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '491427', '491428', '491429', '491430', '491431', '491432', '491433',
       '491434', '491435', '491436'],
      dtype='object', name='index', length=491437)

This is the same scenario as before. The feature names are stored in the variable metadata.

In [11]:
adata.var
Out[11]:
region
index
0 chr1_10279_10779
1 chr1_13252_13752
2 chr1_16019_16519
3 chr1_29026_29526
4 chr1_96364_96864
5 chr1_115440_115940
6 chr1_237535_238035
7 chr1_240811_241311
8 chr1_540469_540969
9 chr1_713909_714409
10 chr1_752503_753003
11 chr1_762546_763046
12 chr1_773631_774131
13 chr1_778115_778615
14 chr1_779604_780104
15 chr1_791908_792408
16 chr1_793334_793834
17 chr1_794055_794555
18 chr1_800952_801452
19 chr1_805097_805597
20 chr1_826009_826509
21 chr1_832576_833076
22 chr1_833996_834496
23 chr1_839080_839580
24 chr1_839870_840370
25 chr1_840445_840945
26 chr1_841428_841928
27 chr1_842012_842512
28 chr1_845328_845828
29 chr1_845891_846391
... ...
491407 chrX_154447087_154447587
491408 chrX_154470476_154470976
491409 chrX_154485437_154485937
491410 chrX_154492269_154492769
491411 chrX_154492803_154493303
491412 chrX_154493475_154493975
491413 chrX_154527322_154527822
491414 chrX_154543378_154543878
491415 chrX_154549084_154549584
491416 chrX_154560826_154561326
491417 chrX_154561575_154562075
491418 chrX_154562710_154563210
491419 chrX_154563838_154564338
491420 chrX_154624431_154624931
491421 chrX_154663986_154664486
491422 chrX_154664555_154665055
491423 chrX_154666950_154667450
491424 chrX_154738976_154739476
491425 chrX_154807298_154807798
491426 chrX_154822578_154823078
491427 chrX_154840821_154841321
491428 chrX_154841449_154841949
491429 chrX_154841956_154842456
491430 chrX_154842517_154843017
491431 chrX_154862057_154862557
491432 chrX_154870909_154871409
491433 chrX_154880741_154881241
491434 chrX_154891824_154892324
491435 chrX_154896342_154896842
491436 chrX_154912441_154912941

491437 rows × 1 columns

In [12]:
# renaming the features using the metadata
adata.var_names = adata.var['region']
In [13]:
adata.var_names
Out[13]:
Index(['chr1_10279_10779', 'chr1_13252_13752', 'chr1_16019_16519',
       'chr1_29026_29526', 'chr1_96364_96864', 'chr1_115440_115940',
       'chr1_237535_238035', 'chr1_240811_241311', 'chr1_540469_540969',
       'chr1_713909_714409',
       ...
       'chrX_154840821_154841321', 'chrX_154841449_154841949',
       'chrX_154841956_154842456', 'chrX_154842517_154843017',
       'chrX_154862057_154862557', 'chrX_154870909_154871409',
       'chrX_154880741_154881241', 'chrX_154891824_154892324',
       'chrX_154896342_154896842', 'chrX_154912441_154912941'],
      dtype='object', name='region', length=491437)

Get a feeling for the data:

  • What's the matrix data type?
In [15]:
adata.X
Out[15]:
<2953x491437 sparse matrix of type '<class 'numpy.int8'>'
	with 19395264 stored elements in Compressed Sparse Column format>

If the matrix is not binary:

In [12]:
epi.pp.binarize(adata, copy=False)
  • Visualize per cell coverage and sharedness of features
In [14]:
# Histogram showing the number of cells with a given number of features
epi.pp.coverage_cells(adata, binary=True, bins=100)

# Putting the number of feature in log scale
#epi.pp.filter_cells(adata, min_features=1)
#epi.pp.coverage_cells(adata, binary=True, bins=100, log=True)
In [15]:
# Histogram showing in how many cells you can find the different features
epi.pp.commonness_features(adata, binary=True)

To put the number of cells sharing a feature in log scale. You first need to make sure, all features are open in at least one cell.

In [16]:
# number of cells open for a feature in log scale
epi.pp.filter_features(adata, min_cells=1)
epi.pp.commonness_features(adata, binary=True, log=True)
  • What's the percentage of peaks being shared in at least 50 cells (but more than 0)?
In [17]:
len(adata.var[adata.var["commonness"] >= 50])/len(adata.var)
Out[17]:
0.1706308929873323
  • How many cells only have 500 features or less covered?
In [18]:
adata.obs["sum_peaks"] = adata.X.sum(axis=1)
In [19]:
len(adata.obs[adata.obs["sum_peaks"] <= 500])
Out[19]:
415
  • Take your time to play around and explore!

Additional filtering

  • to remove low quality cells, uninformative/noisy features
In [20]:
# removing features that are too lowly covered
epi.pp.filter_features(adata, min_cells=50)
adata
Out[20]:
AnnData object with n_obs × n_vars = 2953 × 83553 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks'
    var: 'region', 'commonness', 'n_cells'
In [22]:
# removing cells with too little chromatin information
epi.pp.filter_cells(adata, min_features=500)
adata
Out[22]:
AnnData object with n_obs × n_vars = 2469 × 83553 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks', 'n_features'
    var: 'region', 'commonness', 'n_cells'
  • How does the feature space compare to scRNA-seq data?
  • How would you like to proceed with that matrix?

Feature space reduction

To limit memory usage and running time, it is useful to select the most variable features for later dimensionality reduction.

We arbitrarily decided for the top 50000 most variable features. What happens if a different number of features is chosen ?

In [24]:
# selecting the top 50000 most variable features
adata50 = epi.pp.select_var_feature(adata, nb_features=50000, copy=True)
In [25]:
adata50
Out[25]:
View of AnnData object with n_obs × n_vars = 2469 × 50138 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks', 'n_features'
    var: 'region', 'commonness', 'n_cells', 'prop_shared_cells', 'variability_score'
In [26]:
# addifional filtering - if necessary
epi.pp.filter_features(adata50, min_cells=100)
epi.pp.filter_cells(adata50, min_features=1000)
adata50
Trying to set attribute `.var` of view, making a copy.
Out[26]:
AnnData object with n_obs × n_vars = 2212 × 43592 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks', 'n_features'
    var: 'region', 'commonness', 'n_cells', 'prop_shared_cells', 'variability_score'
In [23]:
# save the temporary filtered matrix
adata50.write(DATADIR+'GSE96769_anndata_top50000features_filtered.h5ad')
In [ ]:
# load the filtered data
adata50=ad.read(DATADIR+'GSE96769_anndata_top50000features_filtered.h5ad')

Dimensionality reduction

  • What methods can you use to reduce dimensionality of your data?
In [27]:
# Principal Component Analysis
epi.pp.pca(adata50, n_comps=100)
  • What do PC1 and PC2 correspond to in the data?
In [29]:
epi.pl.pca_overview(adata50)
In [30]:
sc.pl.pca(adata50, color=['celltype11', 'celltype8', 'n_features'], wspace=0.4)
In [33]:
# computing a neighborhood graph 
epi.pp.neighbors(adata50)
# Embed the neighborhood graph using UMAP
epi.tl.umap(adata50)
WARNING: neighbors/connectivities have not been computed using umap
In [34]:
adata50
Out[34]:
AnnData object with n_obs × n_vars = 2212 × 43592 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks', 'n_features'
    var: 'region', 'commonness', 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'pca', 'celltype11_colors', 'celltype8_colors', 'neighbors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
In [35]:
sc.pl.umap(adata50, color=['celltype11', 'celltype8', 'n_features'], wspace=0.4)
  • What to do next?
In [36]:
# Linear regression on the number of features per cell (remove coverage effect)
epi.pp.regress_out(adata50, "nb_features")
  • Louvain clustering and subsetting the matrix for tissues to identify cell types
In [37]:
# recompute PCA, neighborhood graph, tSNE and UMAP
epi.pp.lazy(adata50)
sc.pl.pca(adata50, color=['celltype11', 'celltype8', 'nb_features'])
sc.pl.umap(adata50, color=['celltype11', 'celltype8', 'nb_features'])
 NumbaWarning:/Users/anna.danese/anaconda3/lib/python3.7/site-packages/umap/umap_.py:349: 
Compilation is falling back to object mode WITH looplifting enabled because Function "fuzzy_simplicial_set" failed type inference due to: Untyped global name 'nearest_neighbors': cannot determine Numba type of <class 'function'>

File "../anaconda3/lib/python3.7/site-packages/umap/umap_.py", line 467:
def fuzzy_simplicial_set(
    <source elided>
    if knn_indices is None or knn_dists is None:
        knn_indices, knn_dists, _ = nearest_neighbors(
        ^

 NumbaWarning:/Users/anna.danese/anaconda3/lib/python3.7/site-packages/numba/compiler.py:725: Function "fuzzy_simplicial_set" was compiled in object mode without forceobj=True.

File "../anaconda3/lib/python3.7/site-packages/umap/umap_.py", line 350:
@numba.jit()
def fuzzy_simplicial_set(
^

 NumbaDeprecationWarning:/Users/anna.danese/anaconda3/lib/python3.7/site-packages/numba/compiler.py:734: 
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "../anaconda3/lib/python3.7/site-packages/umap/umap_.py", line 350:
@numba.jit()
def fuzzy_simplicial_set(
^

Identify cell types

  • What do you need to identify cell types? Think of scRNA-seq. How can you relate this to scATAC-seq?

In scRNA-seq you need clustering and then use marker genes to identify the cell types. Hence, you need to make use of the assumption that chromatin openness relates to gene expression. For this specific data type you need to know which promoters lie in which window, you need a table of closest genes per promoter, and a list of marker genes for each cell type you're interested in. You could also construct the count matrix directly for promoters or enhancers.

In [44]:
epi.pp.pca(adata50, n_comps=100, svd_solver='arpack')
epi.pp.neighbors(adata50)
epi.tl.umap(adata50)
WARNING: neighbors/connectivities have not been computed using umap
In [45]:
# Louvain clustering
epi.tl.louvain(adata50, resolution=0.5)
In [46]:
sc.pl.umap(adata50, color=["louvain","celltype8"], wspace=0.4)

Can you find differentially open peaks between your cell types ?

  • Are they biologically relevant? Do they correspond to cell type identity ? Do they correspond to something else ?
In [47]:
epi.tl.rank_features(adata50, groupby="louvain", n_features=25)
 UserWarning:/Users/anna.danese/anaconda3/lib/python3.7/site-packages/episcanpy/tools/_features_selection.py:30: Attention: no omic specified. We used default settings of the original Scanpy function

    			When the parameters where not specified in input
 RuntimeWarning:/Users/anna.danese/anaconda3/lib/python3.7/site-packages/scanpy/tools/_rank_genes_groups.py:223: invalid value encountered in log2
Out[47]:
()
In [48]:
adata50
Out[48]:
AnnData object with n_obs × n_vars = 2212 × 43592 
    obs: 'cellID', 'line', 'celltype11', 'celltype8', 'ToKeep', 'nb_features', 'sum_peaks', 'n_features', 'louvain'
    var: 'region', 'commonness', 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'pca', 'celltype11_colors', 'celltype8_colors', 'neighbors', 'louvain', 'louvain_colors', 'rank_features_groups'
    obsm: 'X_pca', 'X_umap', 'X_tsne'
    varm: 'PCs'

Get a list of genes/promoters in windows and annotate the adata

This step requires to load a gtf file, to provide information for the different genomic regions used in the adata. You will need more information on how to do this using Fastgenomics. Just ask!

In [ ]:
 

Can you trace lineage specification with this data?

In [49]:
sc.pl.umap(adata50, color=["louvain","celltype8"], wspace=0.4)

Partition-based graph abstraction

In [51]:
sc.tl.paga(adata50, groups='louvain')
In [52]:
sc.pl.paga(adata50)

Diffusion Map and Force-directed graph drawing

In [53]:
epi.tl.diffmap(adata50)
epi.tl.draw_graph(adata50)
In [57]:
sc.pl.diffmap(adata50, wspace=0.4, color=["louvain","celltype8"])
In [58]:
sc.pl.draw_graph(adata50, wspace=0.4,color=["louvain", 'celltype8'])

Diffusion pseudotime

In [59]:
# root on stem cell and progenitors
adata50.uns['iroot'] = np.flatnonzero(adata50.obs['louvain'] == '1')[0]
sc.tl.dpt(adata50)
In [62]:
sc.pl.diffmap(adata50, color=["dpt_pseudotime", 'celltype8'])

Among the cells that are in the Buenrostro data. Is there some cell types that might be contaminant? That might affect your attempt to find lineage specification ?

In [ ]: