Run from Raw Reads(Emilson 2024): Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol
Example 2: Emilson 2024 Dataset
The Emilson 2024 dataset emilson_2024
This paper explores the distribution of carbohydrate-related genes by exploring soil microbiomes at different depths and hillsides.
Procedure
Software and versions
Anaconda (Anaconda, version 23.7.3)
MEGAHIT (MEGAHIT, version 1.2.9)
pyHMMER (pyHMMER, newest version)
DIAMOND (DIAMOND, version 2.1.8)
TrimGalore (TrimGalore, version 0.6.0)
Pyrodigal (Pyrodigal, newest version)
BWA (BWA, version 0.7.17)
Samtools (Samtools, version 1.7)
Kraken2 (Kraken2, version 2.1.1)
KrakenTools (KrakenTools, version 1.2)
run_dbcan (run_dbcan, version 5.0.3)
MMseqs2 (MMseqs2, release 15-6f452)
Pysam (Pysam, version 0.22.1)
conda env create -f CAZyme_annotation_em.yml
conda activate CAZyme_annotation_em
Module 1: Reads processing to obtain contigs
P1. Contamination Check (TIMING ~10min)
Download the Emilson 2024 dataset:
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-05_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-05_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-60_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-60_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-60_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-60_2.fastq.gz
These MAG DNA-seq raw data were originally downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140034 and https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140038. The RNA-seq raw data were originally downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140125 and https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140128. They are renamed to indicate their collected samples
Use kraken2 to check for contaminated reads:
kraken2 --threads 32 --quick --paired --db K2 --report MAG_HS-05.kreport --output MAG_HS-05.kraken.output MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report MAG_HS-60.kreport --output MAG_HS-60.kraken.output MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report RNA_HS-05.kreport --output RNA_HS-05.kraken.output RNA_HS-05_1.fastq.gz RNA_HS-05_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report RNA_HS-60.kreport --output RNA_HS-60.kraken.output RNA_HS-60_1.fastq.gz RNA_HS-60_2.fastq.gz
Kraken2 found very little contamination in the Emilson2024 data. Consequently, there was no need for the contamination removal step.
If contamination is identified, users can align the reads to the reference genomes of potential contamination source organisms to remove the aligned reads (Box 1). The most common source in human microbiome studies is from human hosts.
Box 1: Example to remove contamination reads from human
Kraken2 will produce the following output files:
-rw-r--r-- 1 jinfang yinlab 702M May 30 00:23 MAG_HS-05.kraken.output
-rw-r--r-- 1 jinfang yinlab 819K May 30 00:23 MAG_HS-05.kreport
-rw-r--r-- 1 jinfang yinlab 1.5G May 30 00:22 MAG_HS-60.kraken.output
-rw-r--r-- 1 jinfang yinlab 903K May 30 00:22 MAG_HS-60.kreport
-rw-r--r-- 1 jinfang yinlab 702M May 30 00:23 RNA_HS-05.kraken.output
-rw-r--r-- 1 jinfang yinlab 819K May 30 00:23 RNA_HS-05.kreport
-rw-r--r-- 1 jinfang yinlab 1.5G May 30 00:22 RNA_HS-60.kraken.output
-rw-r--r-- 1 jinfang yinlab 903K May 30 00:22 RNA_HS-60.kreport
Suppose from these files, we have identified humans as the contamination source, we can use the following commands to remove the contamination reads by aligning reads to the human reference genome.
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bwa index -p hg38 Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bwa mem hg38 MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz -t 32 -o MAG_HS-05.hg38.sam
bwa mem hg38 MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz -t 32 -o MAG_HS-60.hg38.sam
samtools view -f 12 MAG_HS-05.hg38.sam > MAG_HS-05.hg38.unmap.bam
samtools view -f 12 MAG_HS-60.hg38.sam > MAG_HS-60.hg38.unmap.bam
samtools fastq -1 MAG_HS-05_1.clean.fq.gz -2 MAG_HS-05_2.clean.fq.gz MAG_HS-05.hg38.unmap.bam
samtools fastq -1 MAG_HS-60_1.clean.fq.gz -2 MAG_HS-60_2.clean.fq.gz MAG_HS-60.hg38.unmap.bam
Or by KrakenTool
extract_kraken_reads.py -k MAG_HS-05.kraken.output -taxid 9606 -exclude -s1 MAG_HS-05_1.fastq.gz -s2 MAG_HS-05_2.fastq.gz -o MAG_HS-05_1.clean.fastq -o2 MAG_HS-05_2.clean.fastq
gzip MAG_HS-05_1.clean.fastq
gzip MAG_HS-05_2.clean.fastq
extract_kraken_reads.py -k MAG_HS-60.kraken.output -taxid 9606 -exclude -s1 MAG_HS-60_1.fastq.gz -s2 MAG_HS-60_2.fastq.gz -o MAG_HS-60_1.clean.fastq -o2 MAG_HS-60_2.clean.fastq
gzip MAG_HS-60_1.clean.fastq
gzip MAG_HS-60_2.clean.fastq
P2. Trim adapter and low-quality reads (TIMING ~20min)
trim_galore --paired MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz --illumina -j 36
trim_galore --paired MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz --illumina -j 36
trim_galore --paired RNA_HS-60_1.fastq.gz RNA_HS-60_2.fastq.gz --illumina -j 36
trim_galore --paired RNA_HS-05_1.fastq.gz RNA_HS-05_2.fastq.gz --illumina -j 36
We specified –illumina to indicate that the reads were generated using the Illumina sequencing platform. Nonetheless, trim_galore can automatically detect adapters, providing flexibility for users who may know the specific sequencing platform. Details of trimming are available in the trimming report file (Box 2).
Box 2: Example output of trim_galore
In addition to the trimmed read files, Trim_galore` also generates a trimming report file. The trimming report contains details on read trimming, such as the number of trimmed reads.
-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:25 MAG_HS-05_1.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 MAG_HS-05_1_val_1.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:28 MAG_HS-05_2.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 MAG_HS-05_2_val_2.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:24 MAG_HS-60_1.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 MAG_HS-60_1_val_1.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:30 MAG_HS-60_2.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 MAG_HS-60_2_val_2.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:25 RNA_HS-05_1.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 RNA_HS-05_1_val_1.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:28 RNA_HS-05_2.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 RNA_HS-05_2_val_2.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:24 RNA_HS-60_1.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 RNA_HS-60_1_val_1.fq.gz
-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:30 RNA_HS-60_2.fastq.gz_trimming_report.txt
-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 RNA_HS-60_2_val_2.fq.gz
Warning
During the trimming process, certain reads may be entirely removed due to low quality in its entirety.
Using the --retain_unpaired
parameter in trim_galore
allows for the preservation of single-end reads.
In this protocol, this option was not selected, so that both reads of a forward-revise pair were removed.
P3. Assemble reads into contigs (TIMING ~84min)
Use Megahit for assembling reads into contigs:
megahit -m 0.5 -t 32 -o megahit_MAG_HS-60 -1 MAG_HS-60_1_val_1.fq.gz -2 MAG_HS-60_2_val_2.fq.gz --out-prefix MAG_HS-60 --min-contig-len 1000
megahit -m 0.5 -t 32 -o megahit_MAG_HS-05 -1 MAG_HS-05_1_val_1.fq.gz -2 MAG_HS-05_2_val_2.fq.gz --out-prefix MAG_HS-05 --min-contig-len 1000
MEGAHIT generates two output folders: megahit_MAG_HS-05` and megahit_MAG_HS-60. Each contains five files and one sub-folder (Box 3). MAG_HS-60.contigs.fa is the final contig sequence file. We set –min-contig-len 1000, a common practice to retain all contigs longer than 1,000 base pairs.
Box 3: Example output of MEGAHIT
-rw-r--r-- 1 jinfang yinlab 262 May 30 03:09 checkpoints.txt
-rw-r--r-- 1 jinfang yinlab 0 May 30 03:09 done
drwxr-xr-x 2 jinfang yinlab 41K May 30 03:09 intermediate_contigs
-rw-r--r-- 1 jinfang yinlab 135M May 30 03:09 MAG_HS-60.contigs.fa
-rw-r--r-- 1 jinfang yinlab 205K May 30 03:09 MAG_HS-60.log
-rw-r--r-- 1 jinfang yinlab 1.1K May 30 00:30 options.json
P4. Predict genes by pyrodigal (TIMING ~1h)
pyrodigal -p meta -i megahit_MAG_HS-60/MAG_HS-60.contigs.fa -d MAG_HS-60.cds -a MAG_HS-60.faa -f gff -o MAG_HS-60.gff -j 36
pyrodigal -p meta -i megahit_MAG_HS-05/MAG_HS-05.contigs.fa -d MAG_HS-05.cds -a MAG_HS-05.faa -f gff -o MAG_HS-05.gff -j 36
The pyrodigal tool predicts genes from contigs. The -p meta parameter indicates that the input is metagenomic contigs. The output files include the coding sequence (CDS) file, protein sequence file, and gene location file (GFF) (Box 4).
Box 4: Example output of pyrodigal
-rw-r--r-- 1 jinfang yinlab 138M May 30 04:47 MAG_HS-60.cds
-rw-r--r-- 1 jinfang yinlab 56M May 30 04:47 MAG_HS-60.gff
-rw-r--r-- 1 jinfang yinlab 61M May 30 04:47 MAG_HS-60.faa
if applied prodigal, please use the following command to fix id mapping:
dbcan_utils gff_fix -i MAG_HS-60.faa -g MAG_HS-60.gff
dbcan_utils gff_fix -i MAG_HS-05.faa -g MAG_HS-05.gff
Module 2. run_dbcan annotation to obtain CAZymes, CGCs, and substrates
CRITICAL STEP
Users can skip P5 and P6, and directly run P7 (much slower though), if they want to predict not only CAZymes and CGCs, but also substrates.
P5. CAZyme annotation at the CAZyme family level (TIMING ~10min)
run_dbcan easy_CAZyme --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.CAZyme --db_dir db --input_format NCBI
run_dbcan easy_CAZyme --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.CAZyme --db_dir db --input_format NCBI
Two arguments are required for run_dbcan
: the input sequence file (faa files) and the sequence type (protein).
By default, run_dbcan
will use three methods (pyHMMER
vs dbCAN HMMdb
, DIAMOND
vs CAZy
, pyHMMER
vs dbCAN-sub HMMdb
) for
CAZyme annotation.
Box 5: CAZyme annotation with default setting
The sequence type can be protein, prok, meta. If the input sequence file contains metagenomic contig sequences (fna file), the sequence type has to be meta, and prodigal will be called to predict genes.
P6. CGC prediction (TIMING ~15 min)
The following commands will re-run run_dbcan to not only predict CAZymes but also CGCs with protein faa and gene location gff files.
run_dbcan easy_CGC --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-60.gff --db_dir db
run_dbcan easy_CGC --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-05.gff --db_dir db
Warning
Creating own gff file
If the users would like to create their own gff
file (instead of using Prokka or Prodigal),
it is important to make sure the value of ID attribute in the gff
file matches the protein ID in the protein faa
file.
[Troubleshooting]CGC not found
If no result is found in CGC output file, it is most likely because the sequence IDs in gff
file and faa
file do not match.
Another less likely reason is that the contigs are too short and fragmented and not suitable for CGC prediction.
P7. Substrate prediction for CAZymes and CGCs (TIMING ~5h)
The following commands will re-run run_dbcan to predict CAZymes, CGCs, and their substrates.
run_dbcan easy_substrate --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-60.gff --db_dir db
run_dbcan easy_substrate --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-05.gff --db_dir db
Box 6. Example output folder content of run_dbcan substrate prediction
In the MAG_HS-05.dbCAN_ <https://bcb.unl.edu/dbCAN_tutorial/dataset5-Emilson2024/MAG_HS-05.dbCAN/>_ directory, a total of 17 files and 1 folder are generated:
-rw-rw-r-- 1 jinfang jinfang 39M Nov 1 22:18 PUL_blast.out -rw-rw-r-- 1 jinfang jinfang 3.1M Nov 1 22:15 CGC.faa -rw-rw-r-- 1 jinfang jinfang 6.9M Nov 1 22:15 cgc.gff -rw-rw-r-- 1 jinfang jinfang 702K Nov 1 22:15 cgc.out -rw-rw-r-- 1 jinfang jinfang 321K Nov 1 22:15 cgc_standard.out -rw-rw-r-- 1 jinfang jinfang 1.5M Nov 1 22:15 cgc_standard.out.json -rw-rw-r-- 1 jinfang jinfang 556K Nov 1 22:14 dbcan-sub.hmm.out -rw-rw-r-- 1 jinfang jinfang 345K Nov 1 22:14 diamond.out -rw-rw-r-- 1 jinfang jinfang 455K Nov 1 22:14 dtemp.out -rw-rw-r-- 1 jinfang jinfang 298K Nov 1 22:14 hmmer.out -rw-rw-r-- 1 jinfang jinfang 270K Nov 1 22:15 overview.txt -rw-rw-r-- 1 jinfang jinfang 1.1M Nov 1 22:15 stp.out -rw-rw-r-- 1 jinfang jinfang 54K Nov 1 22:18 substrate.out drwxrwxr-x 2 jinfang jinfang 32K Nov 2 09:48 synteny.pdf -rw-rw-r-- 1 jinfang jinfang 288K Nov 1 22:14 tf-1.out -rw-rw-r-- 1 jinfang jinfang 237K Nov 1 22:14 tf-2.out -rw-rw-r-- 1 jinfang jinfang 804K Nov 1 22:15 tp.out -rw-rw-r-- 1 jinfang jinfang 31M Nov 1 21:07 uniInput
Module 3. Read mapping (Fig. 3) to calculate abundance for CAZyme families, subfamilies, CGCs, and substrates
P8. Read mapping to all contigs of each sample (TIMING ~10 min)
mkdir samfiles
bwa index MAG_HS-05/MAG_HS-05.contigs.fa
bwa index megahit_MAG_HS-60/MAG_HS-60.contigs.fa
bwa mem -t 32 -o samfiles/MAG_HS-05.sam megahit_MAG_HS-05/MAG_HS-05.contigs.fa MAG_HS-05_1_val_1.fq.gz MAG_HS-05_2_val_2.fq.gz
bwa mem -t 32 -o samfiles/MAG_HS-60.sam megahit_MAG_HS-60/MAG_HS-60.contigs.fa MAG_HS-60_1_val_1.fq.gz MAG_HS-60_2_val_2.fq.gz
bwa mem -t 32 -o samfiles/RNA_HS-05.sam megahit_MAG_HS-05/MAG_HS-05.contigs.fa RNA_HS-05_1_val_1.fq.gz RNA_HS-05_2_val_2.fq.gz
bwa mem -t 32 -o samfiles/RNA_HS-60.sam megahit_MAG_HS-60/MAG_HS-60.contigs.fa RNA_HS-60_1_val_1.fq.gz RNA_HS-60_2_val_2.fq.gz
P9. Read mapping to all contigs of each sample (TIMING ~10min)
cd samfiles
samtools sort -@ 32 -o MAG_HS-05.bam MAG_HS-05.sam
samtools sort -@ 32 -o MAG_HS-60.bam MAG_HS-60.sam
samtools sort -@ 32 -o RNA_HS-05.bam RNA_HS-05.sam
samtools sort -@ 32 -o RNA_HS-60.bam RNA_HS-60.sam
samtools index MAG_HS-05.bam
samtools index MAG_HS-60.bam
samtools index RNA_HS-05.bam
samtools index RNA_HS-60.bam
rm -rf *sam
cd ..
P10. Read count calculation for all proteins of each sample using dbcan_utils (TIMING ~2min)
mkdir MAG_HS-05_abund && cd MAG_HS-05_abund
dbcan_utils cal_coverage -g ../MAG_HS-05.fix.gff -i ../samfiles/MAG_HS-05.bam -o MAG_HS-05.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 ----identity 0.98
cd .. && mkdir MAG_HS-60_abund && cd MAG_HS-60_abund
dbcan_utils cal_coverage -g ../MAG_HS-60.fix.gff -i ../samfiles/MAG_HS-60.bam -o MAG_HS-60.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 --–-identity 0.98
cd .. && mkdir RNA_HS-05_abund && cd RNA_HS-05_abund
dbcan_utils cal_coverage -g ../MAG_HS-05.fix.gff -i ../samfiles/RNA_HS-05.bam -o RNA_HS-05.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 --–-identity 0.98
cd .. && mkdir RNA_HS-60_abund && cd MAG_HS-60_abund
dbcan_utils cal_coverage -g ../MAG_HS-60.fix.gff -i ../samfiles/RNA_HS-60.bam -o RNA_HS-60.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 -–--identity 0.98
cd ..
Note
How to filter the mapped reads are the key step of the abundance calculation. Several parameters can be set to filter the reads. The command dbcan_utils provides –overlap_base_ratio, –mapping_quality and –identity to filter the mapped reads which will may produce more reliable abundance. The default values for these three parameters are 0.2, 30 and 0.98, respectively. Read counts are saved in depth.txt files of each sample.
P11. Read count calculation for a given region of contigs using Samtools (TIMING ~2min)
cd MAG_HS-05_abund
samtools depth -r k141_282901:2644-8574 ../samfiles/MAG_HS-05.bam > MAG_HS-05.cgc.depth.txt
The parameter -r k141_282901:2644-8574 specifies a region in a contig. For any CGC, its positional range can be found in the file cgc_standard.out produced by run_dbcan. The cgc.depth.txt files contain the raw read counts for the specified region.
Warning
The contig IDs are automatically generated by MEGAHIT. There is a small chance that a same contig ID appears in both samples. However, the two contigs in the two samples do not match each other even the ID is the same. For example, the contig ID k141_282901 is most likely only found in the MAG_HS-05 sample. Even if there is a k141_282901 in MAG_HS-60, the actual contigs in two samples are different.
P12. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (TIMING ~1min)
dbcan_utils fam_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils CGC_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
cd .. && cd RNA_HS-60_abund
dbcan_utils fam_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils CGC_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM
cd .. && cd MAG_HS-60_abund
dbcan_utils fam_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils CGC_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
cd .. && cd RNA_HS-60_abund
dbcan_utils fam_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils CGC_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM
We developed a set of Python scripts as dbcan_utils
(included in the run_dbcan
package) to take the raw read counts for all genes as input and output the normalized abundances (refer to Box 7) of CAZyme families, subfamilies, CGCs, and substrates (see Fig. 4). The parameter -a TPM
can also be set to two other metrics: RPM, or RPKM61.
RPKM is calculated as the number of mapped reads to a gene G divided by [(total number of mapped reads to all genes / 10^6) x (gene G length / 1000)].
RPM is the number of mapped reads to a gene G divided by (total number of mapped reads to all genes / 10^6).
TPM is calculated as [number of mapped reads to a gene G / (gene G length / 1000)] divided by the sum of [number of mapped reads to each gene / (the gene length / 1000)].
Box 7. Example output of dbcan_utils
As an example, RNA_HS-05_abund <https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_abund/>_ folder has 7 TSV files:
-rw-rw-r-- 1 jinfang jinfang 178K Jan 2 04:08 CGC_abund.out
-rw-rw-r-- 1 jinfang jinfang 3.3K Jan 2 04:08 CGC_substrate_majority_voting.out
-rw-rw-r-- 1 jinfang jinfang 12K Jan 2 04:08 CGC_substrate_PUL_homology.out
-rw-rw-r-- 1 jinfang jinfang 2.5K Jan 2 04:08 EC_abund.out
-rw-rw-r-- 1 jinfang jinfang 4.1K Jan 2 04:08 fam_abund.out
-rw-rw-r-- 1 jinfang jinfang 42K Jan 2 04:08 fam_substrate_abund.out
-rw-rw-r-- 1 jinfang jinfang 26K Jan 2 04:08 subfam_abund.out
Explanation of columns in these TSV files is as follows:
fam_abund.out
: CAZy family (from HMMER vs dbCAN HMMdb), sum of TPM, number of CAZymes in the family.
subfam_abund.out
: eCAMI subfamily (from HMMER vs dbCAN-sub HMMdb), sum of TPM, number of CAZymes in the subfamily.
EC_abund.out
: EC number (extracted from dbCAN-sub subfamily), sum of TPM, number of CAZymes with the EC.
fam_substrate_abund.out
: Substrate (from HMMER vs dbCAN-sub HMMdb), sum of TPM (all CAZymes in this substrate group), GeneID (all CAZyme IDs in this substrate group).
CGC_abund.out
: CGC_ID (e.g., k141_338400|CGC1), mean of TPM (all genes in the CGC), Seq_IDs (IDs of all genes in the CGC), TPM (of all genes in the CGC), Families (CAZyme family or other signature gene type of all genes in the CGC).
CGC_substrate_PUL_homology.out
: Substrate (from dbCAN-PUL blast search), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-PUL blast search), TPM (of CGCs in this substrate group).
CGC_substrate_majority_voting.out
: Substrate (from dbCAN-sub majority voting), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-sub majority voting), TPM (of CGCs in this substrate group).
Module 4: dbcan_plot for data visualization (Fig. 3) of abundances of CAZymes, CGCs, and substrates (TIMING variable)
CRITICAL STEP
P13. Heatmap for CAZyme substrate abundance across samples (Fig. S4B) (TIMING 1min)
dbcan_plot heatmap_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 -i MAG_HS-05_abund/fam_substrate_abund.out,MAG_HS-60_abund/fam_substrate_abund.out,RNA_HS-05_abund/fam_substrate_abund.out,RNA_HS-60_abund/fam_substrate_abund.out --show_abund --top 20
Here we plot the top 20 substrates in the two samples. The input files are the two CAZyme substrate abundance files calculated based on dbCAN-sub result. The default heatmap is ranked by substrate abundances. To rank the heatmap according to abundance profile using the function clustermap of seaborn package, users can invoke the --cluster_map
parameter.
P14. Barplot for CAZyme family/subfamily/EC abundance across samples (TIMING 1min)
dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/fam_abund.out,MAG_HS-60_abund/fam_abund.out,RNA_HS-05_abund/fam_abund.out,RNA_HS-60_abund/fam_abund.out --pdf fam.pdf
dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/subfam_abund.out,MAG_HS-60_abund/subfam_abund.out,RNA_HS-05_abund/subfam_abund.out,RNA_HS-60_abund/subfam_abund.out --pdf subfam.pdf
dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/EC_abund.out,MAG_HS-60_abund/EC_abund.out,RNA_HS-05_abund/EC_abund.out,RNA_HS-60_abund/EC_abund.out --pdf ec.pdf
Users can choose to generate a barplot instead of heatmap using the bar_plot
method.
P15. Synteny plot between a CGC and its best PUL hit with read mapping coverage to CGC (TIMING 1min)
dbcan_plot CGC_synteny_coverage_plot -i MAG_HS-05.dbCAN --readscount MAG_HS-05_abund/MAG_HS-05.cgc.depth.txt --cgcid 'k141_282901|CGC1'
The MAG_HS-05.dbCAN` folder contains the PUL.out` file. Using this file, the cgc_standard.out file, and the best PUL’s gff file in dbCAN-PUL.tar.gz, the CGC_synteny_plot method will create the CGC-PUL synteny plot. The –cgcid parameter is required to specify which CGC to plot (k141_282901|CGC1 in this example). The MAG_HS-05.cgc.depth.txt file is used to plot the read mapping coverage.
If users only want to plot the CGC structure:
dbcan_plot CGC_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1'
If users only want to plot the CGC structure plus the read mapping coverage:
dbcan_plot CGC_coverage_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1' --readscount MAG_HS-05_abund/MAG_HS-05.cgc.depth.txt
If users only want to plot the synteny between the CGC and PUL:
dbcan_plot CGC_synteny_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1'
Warning
The CGC IDs in different samples do not match each other. For example, specifying -i MAG_HS-05.dbCAN
is to plot
the 'k141_282901|CGC1'
in the fefifo_8022_1 sample. The 'k141_282901|CGC1'
in the fefifo_8022_7 sample most likely does not exist,
and even it does, the CGC has a different sequence even if the ID is the same.