Run from Raw Reads(Emilson 2024): Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol

Example 2: Emilson 2024 Dataset

The Emilson 2024 dataset emilson_2024

This paper explores the distribution of carbohydrate-related genes by exploring soil microbiomes at different depths and hillsides.

Procedure

Software and versions

conda env create -f CAZyme_annotation_em.yml
conda activate CAZyme_annotation_em

Module 1: Reads processing to obtain contigs

P1. Contamination Check (TIMING ~10min)

Download the Emilson 2024 dataset:

wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-05_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-05_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-60_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/MAG_HS-60_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-60_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-60_2.fastq.gz

These MAG DNA-seq raw data were originally downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140034 and https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140038. The RNA-seq raw data were originally downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140125 and https://www.ncbi.nlm.nih.gov/sra/?term=SRR20140128. They are renamed to indicate their collected samples

Use kraken2 to check for contaminated reads:

kraken2 --threads 32 --quick --paired --db K2 --report MAG_HS-05.kreport --output MAG_HS-05.kraken.output MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz

kraken2 --threads 32 --quick --paired --db K2 --report MAG_HS-60.kreport --output MAG_HS-60.kraken.output MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz

kraken2 --threads 32 --quick --paired --db K2 --report RNA_HS-05.kreport --output RNA_HS-05.kraken.output RNA_HS-05_1.fastq.gz RNA_HS-05_2.fastq.gz

kraken2 --threads 32 --quick --paired --db K2 --report RNA_HS-60.kreport --output RNA_HS-60.kraken.output RNA_HS-60_1.fastq.gz RNA_HS-60_2.fastq.gz

Kraken2 found very little contamination in the Emilson2024 data. Consequently, there was no need for the contamination removal step.

If contamination is identified, users can align the reads to the reference genomes of potential contamination source organisms to remove the aligned reads (Box 1). The most common source in human microbiome studies is from human hosts.

Box 1: Example to remove contamination reads from human

Kraken2 will produce the following output files:

-rw-r--r-- 1 jinfang yinlab 702M May 30 00:23 MAG_HS-05.kraken.output
-rw-r--r-- 1 jinfang yinlab 819K May 30 00:23 MAG_HS-05.kreport
-rw-r--r-- 1 jinfang yinlab 1.5G May 30 00:22 MAG_HS-60.kraken.output
-rw-r--r-- 1 jinfang yinlab 903K May 30 00:22 MAG_HS-60.kreport
-rw-r--r-- 1 jinfang yinlab 702M May 30 00:23 RNA_HS-05.kraken.output
-rw-r--r-- 1 jinfang yinlab 819K May 30 00:23 RNA_HS-05.kreport
-rw-r--r-- 1 jinfang yinlab 1.5G May 30 00:22 RNA_HS-60.kraken.output
-rw-r--r-- 1 jinfang yinlab 903K May 30 00:22 RNA_HS-60.kreport

Suppose from these files, we have identified humans as the contamination source, we can use the following commands to remove the contamination reads by aligning reads to the human reference genome.

wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

bwa index -p hg38 Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

bwa mem hg38 MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz -t 32 -o MAG_HS-05.hg38.sam

bwa mem hg38 MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz -t 32 -o MAG_HS-60.hg38.sam

samtools view -f 12 MAG_HS-05.hg38.sam > MAG_HS-05.hg38.unmap.bam

samtools view -f 12 MAG_HS-60.hg38.sam > MAG_HS-60.hg38.unmap.bam

samtools fastq -1 MAG_HS-05_1.clean.fq.gz -2 MAG_HS-05_2.clean.fq.gz MAG_HS-05.hg38.unmap.bam

samtools fastq -1 MAG_HS-60_1.clean.fq.gz -2 MAG_HS-60_2.clean.fq.gz MAG_HS-60.hg38.unmap.bam

Or by KrakenTool

extract_kraken_reads.py -k MAG_HS-05.kraken.output -taxid 9606 -exclude -s1 MAG_HS-05_1.fastq.gz -s2 MAG_HS-05_2.fastq.gz -o MAG_HS-05_1.clean.fastq -o2 MAG_HS-05_2.clean.fastq

gzip MAG_HS-05_1.clean.fastq

gzip MAG_HS-05_2.clean.fastq

extract_kraken_reads.py -k MAG_HS-60.kraken.output -taxid 9606 -exclude -s1 MAG_HS-60_1.fastq.gz -s2 MAG_HS-60_2.fastq.gz -o MAG_HS-60_1.clean.fastq -o2 MAG_HS-60_2.clean.fastq

gzip MAG_HS-60_1.clean.fastq

gzip MAG_HS-60_2.clean.fastq

P2. Trim adapter and low-quality reads (TIMING ~20min)

trim_galore --paired MAG_HS-60_1.fastq.gz MAG_HS-60_2.fastq.gz --illumina -j 36

trim_galore --paired MAG_HS-05_1.fastq.gz MAG_HS-05_2.fastq.gz --illumina -j 36

trim_galore --paired RNA_HS-60_1.fastq.gz RNA_HS-60_2.fastq.gz --illumina -j 36

trim_galore --paired RNA_HS-05_1.fastq.gz RNA_HS-05_2.fastq.gz --illumina -j 36

We specified –illumina to indicate that the reads were generated using the Illumina sequencing platform. Nonetheless, trim_galore can automatically detect adapters, providing flexibility for users who may know the specific sequencing platform. Details of trimming are available in the trimming report file (Box 2).

Box 2: Example output of trim_galore

In addition to the trimmed read files, Trim_galore` also generates a trimming report file. The trimming report contains details on read trimming, such as the number of trimmed reads.

-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:25 MAG_HS-05_1.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 MAG_HS-05_1_val_1.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:28 MAG_HS-05_2.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 MAG_HS-05_2_val_2.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:24 MAG_HS-60_1.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 MAG_HS-60_1_val_1.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:30 MAG_HS-60_2.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 MAG_HS-60_2_val_2.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:25 RNA_HS-05_1.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 RNA_HS-05_1_val_1.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:28 RNA_HS-05_2.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 1.4G May 30 00:28 RNA_HS-05_2_val_2.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.3K May 30 00:24 RNA_HS-60_1.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 RNA_HS-60_1_val_1.fq.gz

-rw-r--r-- 1 jinfang yinlab 4.5K May 30 00:30 RNA_HS-60_2.fastq.gz_trimming_report.txt

-rw-r--r-- 1 jinfang yinlab 2.7G May 30 00:30 RNA_HS-60_2_val_2.fq.gz

Warning

During the trimming process, certain reads may be entirely removed due to low quality in its entirety. Using the --retain_unpaired parameter in trim_galore allows for the preservation of single-end reads. In this protocol, this option was not selected, so that both reads of a forward-revise pair were removed.

P3. Assemble reads into contigs (TIMING ~84min)

Use Megahit for assembling reads into contigs:

megahit -m 0.5 -t 32 -o megahit_MAG_HS-60 -1 MAG_HS-60_1_val_1.fq.gz -2 MAG_HS-60_2_val_2.fq.gz --out-prefix MAG_HS-60 --min-contig-len 1000

megahit -m 0.5 -t 32 -o megahit_MAG_HS-05 -1 MAG_HS-05_1_val_1.fq.gz -2 MAG_HS-05_2_val_2.fq.gz --out-prefix MAG_HS-05 --min-contig-len 1000

MEGAHIT generates two output folders: megahit_MAG_HS-05` and megahit_MAG_HS-60. Each contains five files and one sub-folder (Box 3). MAG_HS-60.contigs.fa is the final contig sequence file. We set –min-contig-len 1000, a common practice to retain all contigs longer than 1,000 base pairs.

Box 3: Example output of MEGAHIT
-rw-r--r-- 1 jinfang yinlab  262 May 30 03:09 checkpoints.txt

-rw-r--r-- 1 jinfang yinlab    0 May 30 03:09 done

drwxr-xr-x 2 jinfang yinlab  41K May 30 03:09 intermediate_contigs

-rw-r--r-- 1 jinfang yinlab 135M May 30 03:09 MAG_HS-60.contigs.fa

-rw-r--r-- 1 jinfang yinlab 205K May 30 03:09 MAG_HS-60.log

-rw-r--r-- 1 jinfang yinlab 1.1K May 30 00:30 options.json

P4. Predict genes by pyrodigal (TIMING ~1h)

pyrodigal -p meta -i megahit_MAG_HS-60/MAG_HS-60.contigs.fa -d MAG_HS-60.cds -a MAG_HS-60.faa -f gff -o MAG_HS-60.gff -j 36

pyrodigal -p meta -i megahit_MAG_HS-05/MAG_HS-05.contigs.fa -d MAG_HS-05.cds -a MAG_HS-05.faa -f gff -o MAG_HS-05.gff -j 36

The pyrodigal tool predicts genes from contigs. The -p meta parameter indicates that the input is metagenomic contigs. The output files include the coding sequence (CDS) file, protein sequence file, and gene location file (GFF) (Box 4).

Box 4: Example output of pyrodigal
-rw-r--r-- 1 jinfang yinlab 138M May 30 04:47 MAG_HS-60.cds

-rw-r--r-- 1 jinfang yinlab  56M May 30 04:47 MAG_HS-60.gff

-rw-r--r-- 1 jinfang yinlab  61M May 30 04:47 MAG_HS-60.faa

if applied prodigal, please use the following command to fix id mapping:

dbcan_utils gff_fix -i MAG_HS-60.faa -g MAG_HS-60.gff

dbcan_utils gff_fix -i MAG_HS-05.faa -g MAG_HS-05.gff

Module 2. run_dbcan annotation to obtain CAZymes, CGCs, and substrates

CRITICAL STEP

Users can skip P5 and P6, and directly run P7 (much slower though), if they want to predict not only CAZymes and CGCs, but also substrates.

P5. CAZyme annotation at the CAZyme family level (TIMING ~10min)

run_dbcan easy_CAZyme --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.CAZyme --db_dir db --input_format NCBI
run_dbcan easy_CAZyme --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.CAZyme --db_dir db --input_format NCBI

Two arguments are required for run_dbcan: the input sequence file (faa files) and the sequence type (protein). By default, run_dbcan will use three methods (pyHMMER vs dbCAN HMMdb, DIAMOND vs CAZy, pyHMMER vs dbCAN-sub HMMdb) for CAZyme annotation.

Box 5: CAZyme annotation with default setting

The sequence type can be protein, prok, meta. If the input sequence file contains metagenomic contig sequences (fna file), the sequence type has to be meta, and prodigal will be called to predict genes.

P6. CGC prediction (TIMING ~15 min)

The following commands will re-run run_dbcan to not only predict CAZymes but also CGCs with protein faa and gene location gff files.

run_dbcan easy_CGC --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-60.gff --db_dir db
run_dbcan easy_CGC --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-05.gff --db_dir db

Warning

Creating own gff file If the users would like to create their own gff file (instead of using Prokka or Prodigal), it is important to make sure the value of ID attribute in the gff file matches the protein ID in the protein faa file.

[Troubleshooting]CGC not found If no result is found in CGC output file, it is most likely because the sequence IDs in gff file and faa file do not match. Another less likely reason is that the contigs are too short and fragmented and not suitable for CGC prediction.

P7. Substrate prediction for CAZymes and CGCs (TIMING ~5h)

The following commands will re-run run_dbcan to predict CAZymes, CGCs, and their substrates.

run_dbcan easy_substrate --input_raw_data MAG_HS-60.faa --mode protein --output_dir MAG_HS-60.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-60.gff --db_dir db
run_dbcan easy_substrate --input_raw_data MAG_HS-05.faa --mode protein --output_dir MAG_HS-05.PUL --input_format NCBI --input_gff_format prodigal --input_gff MAG_HS-05.gff --db_dir db
Box 6. Example output folder content of run_dbcan substrate prediction

In the MAG_HS-05.dbCAN_ <https://bcb.unl.edu/dbCAN_tutorial/dataset5-Emilson2024/MAG_HS-05.dbCAN/>_ directory, a total of 17 files and 1 folder are generated:

-rw-rw-r--  1 jinfang jinfang  39M Nov  1 22:18 PUL_blast.out
-rw-rw-r--  1 jinfang jinfang 3.1M Nov  1 22:15 CGC.faa
-rw-rw-r--  1 jinfang jinfang 6.9M Nov  1 22:15 cgc.gff
-rw-rw-r--  1 jinfang jinfang 702K Nov  1 22:15 cgc.out
-rw-rw-r--  1 jinfang jinfang 321K Nov  1 22:15 cgc_standard.out
-rw-rw-r--  1 jinfang jinfang 1.5M Nov  1 22:15 cgc_standard.out.json
-rw-rw-r--  1 jinfang jinfang 556K Nov  1 22:14 dbcan-sub.hmm.out
-rw-rw-r--  1 jinfang jinfang 345K Nov  1 22:14 diamond.out
-rw-rw-r--  1 jinfang jinfang 455K Nov  1 22:14 dtemp.out
-rw-rw-r--  1 jinfang jinfang 298K Nov  1 22:14 hmmer.out
-rw-rw-r--  1 jinfang jinfang 270K Nov  1 22:15 overview.txt
-rw-rw-r--  1 jinfang jinfang 1.1M Nov  1 22:15 stp.out
-rw-rw-r--  1 jinfang jinfang  54K Nov  1 22:18 substrate.out
drwxrwxr-x  2 jinfang jinfang  32K Nov  2 09:48 synteny.pdf
-rw-rw-r--  1 jinfang jinfang 288K Nov  1 22:14 tf-1.out
-rw-rw-r--  1 jinfang jinfang 237K Nov  1 22:14 tf-2.out
-rw-rw-r--  1 jinfang jinfang 804K Nov  1 22:15 tp.out
-rw-rw-r--  1 jinfang jinfang  31M Nov  1 21:07 uniInput

Module 3. Read mapping (Fig. 3) to calculate abundance for CAZyme families, subfamilies, CGCs, and substrates

P8. Read mapping to all contigs of each sample (TIMING ~10 min)

mkdir samfiles

bwa index MAG_HS-05/MAG_HS-05.contigs.fa

bwa index megahit_MAG_HS-60/MAG_HS-60.contigs.fa

bwa mem -t 32 -o samfiles/MAG_HS-05.sam megahit_MAG_HS-05/MAG_HS-05.contigs.fa MAG_HS-05_1_val_1.fq.gz MAG_HS-05_2_val_2.fq.gz

bwa mem -t 32 -o samfiles/MAG_HS-60.sam megahit_MAG_HS-60/MAG_HS-60.contigs.fa MAG_HS-60_1_val_1.fq.gz MAG_HS-60_2_val_2.fq.gz

bwa mem -t 32 -o samfiles/RNA_HS-05.sam megahit_MAG_HS-05/MAG_HS-05.contigs.fa RNA_HS-05_1_val_1.fq.gz RNA_HS-05_2_val_2.fq.gz

bwa mem -t 32 -o samfiles/RNA_HS-60.sam megahit_MAG_HS-60/MAG_HS-60.contigs.fa RNA_HS-60_1_val_1.fq.gz RNA_HS-60_2_val_2.fq.gz

P9. Read mapping to all contigs of each sample (TIMING ~10min)

cd samfiles

samtools sort -@ 32 -o MAG_HS-05.bam MAG_HS-05.sam

samtools sort -@ 32 -o MAG_HS-60.bam MAG_HS-60.sam

samtools sort -@ 32 -o RNA_HS-05.bam RNA_HS-05.sam

samtools sort -@ 32 -o RNA_HS-60.bam RNA_HS-60.sam

samtools index MAG_HS-05.bam

samtools index MAG_HS-60.bam

samtools index RNA_HS-05.bam

samtools index RNA_HS-60.bam

rm -rf *sam

cd ..

P10. Read count calculation for all proteins of each sample using dbcan_utils (TIMING ~2min)

mkdir MAG_HS-05_abund && cd MAG_HS-05_abund

dbcan_utils cal_coverage -g ../MAG_HS-05.fix.gff -i ../samfiles/MAG_HS-05.bam -o MAG_HS-05.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 ----identity 0.98

cd .. && mkdir MAG_HS-60_abund && cd MAG_HS-60_abund

dbcan_utils cal_coverage -g ../MAG_HS-60.fix.gff -i ../samfiles/MAG_HS-60.bam -o MAG_HS-60.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 --–-identity 0.98

cd .. && mkdir RNA_HS-05_abund && cd RNA_HS-05_abund

dbcan_utils cal_coverage -g ../MAG_HS-05.fix.gff -i ../samfiles/RNA_HS-05.bam -o RNA_HS-05.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 --–-identity 0.98

cd .. && mkdir RNA_HS-60_abund && cd MAG_HS-60_abund

dbcan_utils cal_coverage -g ../MAG_HS-60.fix.gff -i ../samfiles/RNA_HS-60.bam -o RNA_HS-60.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 -–--identity 0.98

cd ..

Note

How to filter the mapped reads are the key step of the abundance calculation. Several parameters can be set to filter the reads. The command dbcan_utils provides –overlap_base_ratio, –mapping_quality and –identity to filter the mapped reads which will may produce more reliable abundance. The default values for these three parameters are 0.2, 30 and 0.98, respectively. Read counts are saved in depth.txt files of each sample.

P11. Read count calculation for a given region of contigs using Samtools (TIMING ~2min)

cd MAG_HS-05_abund

samtools depth -r k141_282901:2644-8574 ../samfiles/MAG_HS-05.bam > MAG_HS-05.cgc.depth.txt

The parameter -r k141_282901:2644-8574 specifies a region in a contig. For any CGC, its positional range can be found in the file cgc_standard.out produced by run_dbcan. The cgc.depth.txt files contain the raw read counts for the specified region.

Warning

The contig IDs are automatically generated by MEGAHIT. There is a small chance that a same contig ID appears in both samples. However, the two contigs in the two samples do not match each other even the ID is the same. For example, the contig ID k141_282901 is most likely only found in the MAG_HS-05 sample. Even if there is a k141_282901 in MAG_HS-60, the actual contigs in two samples are different.

P12. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (TIMING ~1min)

dbcan_utils fam_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils fam_substrate_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils CGC_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils CGC_substrate_abund -bt MAG_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM


cd .. && cd RNA_HS-60_abund

dbcan_utils fam_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils fam_substrate_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils CGC_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM

dbcan_utils CGC_substrate_abund -bt RNA_HS-05.depth.txt -i ../MAG_HS-05.dbCAN -a TPM


cd .. && cd MAG_HS-60_abund

dbcan_utils fam_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils fam_substrate_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils CGC_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils CGC_substrate_abund -bt MAG_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

cd .. && cd RNA_HS-60_abund

dbcan_utils fam_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils fam_substrate_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils CGC_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

dbcan_utils CGC_substrate_abund -bt RNA_HS-60.depth.txt -i ../MAG_HS-60.dbCAN -a TPM

We developed a set of Python scripts as dbcan_utils (included in the run_dbcan package) to take the raw read counts for all genes as input and output the normalized abundances (refer to Box 7) of CAZyme families, subfamilies, CGCs, and substrates (see Fig. 4). The parameter -a TPM can also be set to two other metrics: RPM, or RPKM61.

  • RPKM is calculated as the number of mapped reads to a gene G divided by [(total number of mapped reads to all genes / 10^6) x (gene G length / 1000)].

  • RPM is the number of mapped reads to a gene G divided by (total number of mapped reads to all genes / 10^6).

  • TPM is calculated as [number of mapped reads to a gene G / (gene G length / 1000)] divided by the sum of [number of mapped reads to each gene / (the gene length / 1000)].

Box 7. Example output of dbcan_utils

As an example, RNA_HS-05_abund <https://bcb.unl.edu/dbCAN_tutorial/dataset6-Emilson2024/RNA_HS-05_abund/>_ folder has 7 TSV files:

-rw-rw-r--  1 jinfang jinfang 178K Jan  2 04:08 CGC_abund.out
-rw-rw-r--  1 jinfang jinfang 3.3K Jan  2 04:08 CGC_substrate_majority_voting.out
-rw-rw-r--  1 jinfang jinfang  12K Jan  2 04:08 CGC_substrate_PUL_homology.out
-rw-rw-r--  1 jinfang jinfang 2.5K Jan  2 04:08 EC_abund.out
-rw-rw-r--  1 jinfang jinfang 4.1K Jan  2 04:08 fam_abund.out
-rw-rw-r--  1 jinfang jinfang  42K Jan  2 04:08 fam_substrate_abund.out
-rw-rw-r--  1 jinfang jinfang  26K Jan  2 04:08 subfam_abund.out

Explanation of columns in these TSV files is as follows:

  • fam_abund.out: CAZy family (from HMMER vs dbCAN HMMdb), sum of TPM, number of CAZymes in the family.

  • subfam_abund.out: eCAMI subfamily (from HMMER vs dbCAN-sub HMMdb), sum of TPM, number of CAZymes in the subfamily.

  • EC_abund.out: EC number (extracted from dbCAN-sub subfamily), sum of TPM, number of CAZymes with the EC.

  • fam_substrate_abund.out: Substrate (from HMMER vs dbCAN-sub HMMdb), sum of TPM (all CAZymes in this substrate group), GeneID (all CAZyme IDs in this substrate group).

  • CGC_abund.out: CGC_ID (e.g., k141_338400|CGC1), mean of TPM (all genes in the CGC), Seq_IDs (IDs of all genes in the CGC), TPM (of all genes in the CGC), Families (CAZyme family or other signature gene type of all genes in the CGC).

  • CGC_substrate_PUL_homology.out: Substrate (from dbCAN-PUL blast search), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-PUL blast search), TPM (of CGCs in this substrate group).

  • CGC_substrate_majority_voting.out: Substrate (from dbCAN-sub majority voting), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-sub majority voting), TPM (of CGCs in this substrate group).

Module 4: dbcan_plot for data visualization (Fig. 3) of abundances of CAZymes, CGCs, and substrates (TIMING variable)

CRITICAL STEP

P13. Heatmap for CAZyme substrate abundance across samples (Fig. S4B) (TIMING 1min)

dbcan_plot heatmap_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 -i MAG_HS-05_abund/fam_substrate_abund.out,MAG_HS-60_abund/fam_substrate_abund.out,RNA_HS-05_abund/fam_substrate_abund.out,RNA_HS-60_abund/fam_substrate_abund.out --show_abund --top 20

Here we plot the top 20 substrates in the two samples. The input files are the two CAZyme substrate abundance files calculated based on dbCAN-sub result. The default heatmap is ranked by substrate abundances. To rank the heatmap according to abundance profile using the function clustermap of seaborn package, users can invoke the --cluster_map parameter.

P14. Barplot for CAZyme family/subfamily/EC abundance across samples (TIMING 1min)

dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/fam_abund.out,MAG_HS-60_abund/fam_abund.out,RNA_HS-05_abund/fam_abund.out,RNA_HS-60_abund/fam_abund.out --pdf fam.pdf

dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/subfam_abund.out,MAG_HS-60_abund/subfam_abund.out,RNA_HS-05_abund/subfam_abund.out,RNA_HS-60_abund/subfam_abund.out --pdf subfam.pdf

dbcan_plot bar_plot --samples MAG_HS-05,MAG_HS-60,RNA_HS-05,RNA_HS-60 --vertical_bar --top 20 -i MAG_HS-05_abund/EC_abund.out,MAG_HS-60_abund/EC_abund.out,RNA_HS-05_abund/EC_abund.out,RNA_HS-60_abund/EC_abund.out --pdf ec.pdf

Users can choose to generate a barplot instead of heatmap using the bar_plot method.

P15. Synteny plot between a CGC and its best PUL hit with read mapping coverage to CGC (TIMING 1min)

dbcan_plot CGC_synteny_coverage_plot -i MAG_HS-05.dbCAN --readscount MAG_HS-05_abund/MAG_HS-05.cgc.depth.txt --cgcid 'k141_282901|CGC1'

The MAG_HS-05.dbCAN` folder contains the PUL.out` file. Using this file, the cgc_standard.out file, and the best PUL’s gff file in dbCAN-PUL.tar.gz, the CGC_synteny_plot method will create the CGC-PUL synteny plot. The –cgcid parameter is required to specify which CGC to plot (k141_282901|CGC1 in this example). The MAG_HS-05.cgc.depth.txt file is used to plot the read mapping coverage.

If users only want to plot the CGC structure:

dbcan_plot CGC_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1'

If users only want to plot the CGC structure plus the read mapping coverage:

dbcan_plot CGC_coverage_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1'  --readscount MAG_HS-05_abund/MAG_HS-05.cgc.depth.txt

If users only want to plot the synteny between the CGC and PUL:

dbcan_plot CGC_synteny_plot -i MAG_HS-05.dbCAN --cgcid 'k141_282901|CGC1'

Warning

The CGC IDs in different samples do not match each other. For example, specifying -i MAG_HS-05.dbCAN is to plot the 'k141_282901|CGC1' in the fefifo_8022_1 sample. The 'k141_282901|CGC1' in the fefifo_8022_7 sample most likely does not exist, and even it does, the CGC has a different sequence even if the ID is the same.