Run from Raw Reads(Priest 2023): Automated CAZyme and Glycan Substrate Annotation in Microbiomes: A Step-by-Step Protocol
Example 3: Priest2023 Dataset priest_2023
The Priest2023 dataset priest_2023 was published in 2023 from a study of glycan utilization in the particulate and dissolved organic matter pools (POM and DOM) in the ocean waters of the Arctic during late summer1. This study applied PacBio HiFi long read sequencing for metagenomes (MGs) of 8 samples (4 sites of two depth) and Illumina short read sequencing for metatranscriptomes (MTs) of 4 samples (4 sites). It investigated CAZyme abundance and transcription profiles as well as inferred glycan degradation activities by comparing MGs and MTs from samples of different sites and depth. The sites are categorized into above-slope, above-shelf and open-ocean groups, and the water depth includes surface water (SRF) and bottom of the surface mixed layer (BML). CAZyme abundance and expression showed different profiles for the degradation of complex algal glycans such as sulfated fucan, laminarin and xylan in different site groups and water depths.
For this protocol, we will select two MG samples (HiFi DNA long reads, FRAM_WSC20_S25_SRF_MG and FRAM_WSC20_S25_BML_MG) and two MT samples (FRAM_WSC20_S25_SRF_MT and FRAM_WSC20_S25_BML_MT, paired-end 2x151bp reads) for CAZyme annotation. We will assemble HiFi long MG reads into contigs, which will be used to predict CAZymes and glycan substrates. We will map the MT reads to the contigs to calculate expression levels. Procedure ———
Software and versions
Anaconda (Anaconda, version 23.7.3)
Flye (Flye, version 2.9.5)
pyHMMER (pyHMMER, newest version)
DIAMOND (DIAMOND, version 2.1.8)
TrimGalore (TrimGalore, version 0.6.0)
Pyrodigal (Pyrodigal, newest version)
BWA (BWA, version 0.7.17)
Samtools (Samtools, version 1.7)
Kraken2 (Kraken2, version 2.1.1)
KrakenTools (KrakenTools, version 1.2)
run_dbcan (run_dbcan, version 5.0.3)
MMseqs2 (MMseqs2, release 15-6f452)
Pysam (Pysam, version 0.22.1)
conda env create -f CAZyme_annotation_pr.yml
conda activate CAZyme_annotation_pr
To install the databases, execute the following commands:
Database Installation Command
prepare the conda environment (available at https://github.com/bcb-unl/run_dbcan_new/tree/master/envs):
conda env create -f environment.yml
conda activate run_dbcan_env
We provide command in the run_dbcan script:
run_dbcan database --db_dir db
Users could also download all database files from the dbCAN2 website (http://bcb.unl.edu/dbCAN2/download/Databases/), and then put them into the db directory.
wget -q https://bcb.unl.edu/dbCAN2/download/run_dbCAN_database_total/dbCAN_db.tar.gz -O db.tar.gz
tar -zxvf db.tar.gz
rm db.tar.gz
Module 1: Reads processing (Fig. 3) to obtain contigs
P1. Contamination Check (TIMING ~10min)
Download the Priest2023 dataset:
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/BML_MG.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/SRF_MG.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/BML_MT_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/BML_MT_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/SRF_MT_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/SRF_MT_2.fastq.gz
Use kraken2 to check for contaminated reads:
kraken2 --threads 32 --quick --paired --db K2 --report SRF_MT.kreport --output SRF_MT.kraken.output SRF_MT_1.fastq.gz SRF_MT_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report BML_MT.kreport --output BML_MT.kraken.output BML_MT_1.fastq.gz BML_MT_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report SRF_MG.kreport --output SRF_MG.kraken.output SRF_MG.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report BML_MG.kreport --output BML_MG.kraken.output BML_MG.fastq.gz
Kraken2 found much contamination (Box 1) from human in the Priest2023 data. Consequently, human reads need to be removed before assembly.
Reads can be aligned to the reference genomes of potential contamination source organisms to remove the aligned reads. The most common source in microbiome studies is from human.
Box 1: Example of Kraken2 output files
The kreport files can be examined to identify potential contamination source organisms.
-rw-rw-r-- 1 jinfang jinfang 54M Dec 27 07:42 BML_MG.kraken.output -rw-rw-r-- 1 jinfang jinfang 1.2M Dec 27 07:42 BML_MG.kreport -rw-rw-r-- 1 jinfang jinfang 3.4G Dec 27 08:01 BML_MT.kraken.output -rw-rw-r-- 1 jinfang jinfang 1023K Dec 27 08:02 BML_MT.kreport -rw-rw-r-- 1 jinfang jinfang 61M Dec 27 07:39 SRF_MG.kraken.output -rw-rw-r-- 1 jinfang jinfang 1.2M Dec 27 07:39 SRF_MG.kreport -rw-rw-r-- 1 jinfang jinfang 2.6G Dec 27 07:50 SRF_MT.kraken.output -rw-rw-r-- 1 jinfang jinfang 1.1M Dec 27 07:51 SRF_MT.kreport
P2. Remove contamination reads from human (TIMING ~40min)
From the Kraken2 output files, we identified humans as the contamination source, we can use the following commands to remove the contamination reads by aligning reads to the human reference genome.
mkdir hg38 && cd hg38 && wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
cd .. && mkdir contamination && cd contamination
minimap2 -a -x map-hifi -MD -t 32 -o SRF_MG.hg38.sam ../hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ../SRF_MG.fastq.gz
minimap2 -a -x map-hifi -MD -t 32 -o BML_MG.hg38.sam ../hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ../BML_MG.fastq.gz
samtools fastq -f 4 -@ 32 -0 ../SRF_MG.clean.fq.gz SRF_MG.hg38.sam
samtools fastq -f 4 -@ 32 -0 ../BML_MG.clean.fq.gz BML_MG.hg38.sam
bwa mem ../hg38/hg38 ../SRF_MT_1.fastq.gz ../SRF_MT_2.fastq.gz -t 32 -o SRF_MT.hg38.sam
bwa mem ../hg38/hg38 ../BML_MT_1.fastq.gz ../BML_MT_2.fastq.gz -t 32 -o BML_MT.hg38.sam
samtools fastq -f 12 -@ 32 -1 ../SRF_MT_1.clean.fq.gz -2 ../SRF_MT_2.clean.fq.gz SRF_MT.hg38.sam
samtools fastq -f 12 -@ 32 -1 ../BML_MT_1.clean.fq.gz -2 ../BML_MT_2.clean.fq.gz BML_MT.hg38.sam
cd ..
Or we could use KrakenTools to extract reads.
extract_kraken_reads.py -k SRF_MT.kraken.output -taxid 9606 -exclude -s1 SRF_MT_1.fastq.gz -s2 SRF_MT_2.fastq.gz -o SRF_MT_1.clean.fastq -o2 SRF_MT_2.clean.fastq
gzip SRF_MT_1.clean.fastq
gzip SRF_MT_2.clean.fastq
extract_kraken_reads.py -k BML_MT.kraken.output -taxid 9606 -exclude -s1 BML_MT_1.fastq.gz -s2 BML_MT_2.fastq.gz -o BML_MT_1.clean.fastq -o2 BML_MT_2.clean.fastq
gzip BML_MT_1.clean.fastq
gzip BML_MT_2.clean.fastq
extract_kraken_reads.py -k BML_MG.kraken.output -taxid 9606 -exclude -s BML_MG_.fastq.gz -o BML_MG_.clean.fastq
gzip BML_MG.clean.fastq
extract_kraken_reads.py -k SRF_MG.kraken.output -taxid 9606 -exclude -s SRF_MG.fastq.gz -o SRF_MG.clean.fastq
gzip SRF_MG.clean.fastq
P3| Trim adapter and low-quality reads (TIMING ~20min)
trim_galore --illumina -j 8 --paired BML_MT_1.clean.fastq.gz BML_MT_2.clean.fastq.gz
trim_galore --illumina -j 8 --paired SRF_MT_1.clean.fastq.gz SRF_MT_2.clean.fastq.gz
The HiFi long reads do not need to be trimmed. Hence, this step only applies to MT illumina short read data. We specified –illumina to indicate that the reads were generated using the Illumina sequencing platform. Nonetheless, trim_galore possesses the ability to automatically detect the adapter, providing flexibility in adapter handling for users who may know the specific sequencing platform. Details of trimming are available in the trimming report file (Box 2).
Box 2: Example output of trim_galore
In addition to the trimmed read files, Trim_galore also generates a trimming report file. The trimming report contains details on read trimming, such as the number of trimmed reads.
-rw-rw-r-- 1 jinfang jinfang 4.2K Dec 28 21:56 BML_MT_1.clean.fq.gz_trimming_report.txt -rw-rw-r-- 1 jinfang jinfang 2.3G Dec 28 22:05 BML_MT_1.clean_val_1.fq.gz -rw-rw-r-- 1 jinfang jinfang 4.7K Dec 28 22:05 BML_MT_2.clean.fq.gz_trimming_report.txt -rw-rw-r-- 1 jinfang jinfang 3.0G Dec 28 22:05 BML_MT_2.clean_val_2.fq.gz -rw-rw-r-- 1 jinfang jinfang 4.9K Dec 28 10:07 SRF_MT_1.clean.fq.gz_trimming_report.txt -rw-rw-r-- 1 jinfang jinfang 2.7G Dec 28 10:19 SRF_MT_1.clean_val_1.fq.gz -rw-rw-r-- 1 jinfang jinfang 5.1K Dec 28 10:19 SRF_MT_2.clean.fq.gz_trimming_report.txt -rw-rw-r-- 1 jinfang jinfang 3.3G Dec 28 10:19 SRF_MT_2.clean_val_2.fq.gz
Warning
During the trimming process, certain reads may be entirely removed due to low quality in its entirety. Using the –retain_unpaired parameter in trim_galore allows for the preservation of single-end reads. In this protocol, this option was not selected, so that both reads of a forward-revise pair were removed.
P4. Assemble HiFi reads into metagenome (TIMING ~4h20min)
Flye was used to assemble the HiFi long reads into contigs.
flye --threads 32 --meta --pacbio-hifi BML_MG.clean.fq.gz --hifi-error 0.01 --keep-haplotypes --out-dir flye_BML_MG
flye --threads 32 --meta --pacbio-hifi SRF_MG.clean.fq.gz --hifi-error 0.01 --keep-haplotypes --out-dir flye_SRF_MG
Flye generates two folders flye_BML_MG and flye_SRF_MG. Each folder contains 6 files and 5 sub-folders (Box 3), among them assembly.fasta is the final contig sequence file. We set –hifi-error 0.01, a generally accepted error rate of HiFi sequencing. Parameter –meta is set to assemble reads into metagenomes.
Box 3: Example output of Flye
drwxrwxr-x 2 jinfang jinfang 4.0K Dec 27 20:15 00-assembly drwxrwxr-x 2 jinfang jinfang 4.0K Dec 27 20:43 10-consensus drwxrwxr-x 2 jinfang jinfang 4.0K Dec 27 21:14 20-repeat drwxrwxr-x 2 jinfang jinfang 4.0K Dec 27 21:16 30-contigger drwxrwxr-x 2 jinfang jinfang 4.0K Dec 27 22:06 40-polishing -rw-rw-r-- 1 jinfang jinfang 314M Dec 27 22:06 assembly.fasta -rw-rw-r-- 1 jinfang jinfang 311M Dec 27 22:06 assembly_graph.gfa -rw-rw-r-- 1 jinfang jinfang 6.6M Dec 27 22:06 assembly_graph.gv -rw-rw-r-- 1 jinfang jinfang 867K Dec 27 22:06 assembly_info.txt -rw-rw-r-- 1 jinfang jinfang 61M Dec 27 22:06 flye.log -rw-rw-r-- 1 jinfang jinfang 92 Dec 27 22:06 params.json
P5. Predict genes by Prodigal (~14min)
pyrodigal -p meta -i flye_BML_MG/assembly.fasta -d BML_MG.cds -a BML_MG.faa -f gff -o BML_MG.gff -j 36
pyrodigal -p meta -i flye_SRF_MG/assembly.fasta -d SRF_MG.cds -a SRF_MG.faa -f gff -o SRF_MG.gff -j 36
Box 3: Example output of Prokka
-rw-rw-r-- 1 jinfang jinfang 105M Dec 27 23:26 BML_MG.faa
-rw-rw-r-- 1 jinfang jinfang 314M Dec 27 22:06 BML_MG.cds
-rw-rw-r-- 1 jinfang jinfang 467M Dec 27 23:26 BML_MG.gff
Module 1: run_dbcan annotation to obtain CAZymes, CGCs, and substrates
Hint: If you apply prodigal
not pyrodigal
, use this to fix gene ID in gff files by dbcan_utils (TIMING ~1min)
dbcan_utils gff_fix -i BML_MG.faa -g BML_MG.gff
dbcan_utils gff_fix -i SRF_MG.faa -g SRF_MG.gff
Users can skip P6 and P7, and directly run P8 (much slower though), if they want to predict not only CAZymes and CGCs, but also substrates.
P6. CAZyme annotation at family level (TIMING ~10min)
run_dbcan easy_CAZyme --input_raw_data BML_MG.faa --mode protein --output_dir BML_MG.CAZyme --db_dir db --input_format NCBI
run_dbcan easy_CAZyme --input_raw_data SRF_MG.faa --mode protein --output_dir SRF_MG.CAZyme --db_dir db --input_format NCBI
Two arguments are required for run_dbcan
: the input sequence file (faa files) and the sequence type (protein).
By default, run_dbcan
will use three methods (pyHMMER
vs dbCAN HMMdb
, DIAMOND
vs CAZy
, pyHMMER
vs dbCAN-sub HMMdb
) for
CAZyme annotation.
The sequence type can be protein, prok, meta. If the input sequence file contains metagenomic contig sequences (fna file), the sequence type has to be meta, and prodigal will be called to predict genes.
P7. CGC prediction (TIMING ~15 min)
The following commands will re-run run_dbcan to not only predict CAZymes but also CGCs with protein faa and gene location gff files.
run_dbcan easy_CGC --input_raw_data BML_MG.faa --mode protein --output_dir BML_MG.PUL --input_format NCBI --input_gff_format prodigal --input_gff BML_MG.gff --db_dir db
run_dbcan easy_CGC --input_raw_data SRF_MG.faa --mode protein --output_dir SRF_MG.PUL --input_format NCBI --input_gff_format prodigal --input_gff SRF_MG.gff --db_dir db
Warning
Creating own gff file
If the users would like to create their own gff
file (instead of using Prokka or Prodigal),
it is important to make sure the value of ID attribute in the gff
file matches the protein ID in the protein faa
file.
[Troubleshooting]CGC not found
If no result is found in CGC output file, it is most likely because the sequence IDs in gff
file and faa
file do not match.
Another less likely reason is that the contigs are too short and fragmented and not suitable for CGC prediction.
P8. Substrate prediction for CAZymes and CGCs (TIMING ~5h)
The following commands will re-run run_dbcan to predict CAZymes, CGCs.
run_dbcan easy_substrate --input_raw_data BML_MG.faa --mode protein --output_dir BML_MG.PUL --input_format NCBI --input_gff_format prodigal --input_gff BML_MG.gff --db_dir db
run_dbcan easy_substrate --input_raw_data SRF_MG.faa --mode protein --output_dir SRF_MG.PUL --input_format NCBI --input_gff_format prodigal --input_gff SRF_MG.gff --db_dir db
Box 6: Example output folder content of run_dbcan substrate prediction
In the output directory (https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/BML_MG.dbCAN/), a total of 17 files and 1 folder are generated:
-rw-rw-r-- 1 jinfang jinfang 9.6M Dec 28 10:18 PUL_blast.out -rw-rw-r-- 1 jinfang jinfang 1.8M Dec 28 10:18 CGC.faa -rw-rw-r-- 1 jinfang jinfang 26M Dec 28 10:18 cgc.gff -rw-rw-r-- 1 jinfang jinfang 450K Dec 28 10:18 cgc.out -rw-rw-r-- 1 jinfang jinfang 212K Dec 28 10:18 cgc_standard.out -rw-rw-r-- 1 jinfang jinfang 1005K Dec 28 10:18 cgc_standard.out.json -rw-rw-r-- 1 jinfang jinfang 406K Dec 28 10:11 dbcan-sub.hmm.out -rw-rw-r-- 1 jinfang jinfang 325K Dec 28 10:11 diamond.out -rw-rw-r-- 1 jinfang jinfang 332K Dec 28 10:11 dtemp.out -rw-rw-r-- 1 jinfang jinfang 220K Dec 28 10:11 hmmer.out -rw-rw-r-- 1 jinfang jinfang 240K Dec 28 10:18 overview.txt -rw-rw-r-- 1 jinfang jinfang 1.7M Dec 28 10:17 stp.out -rw-rw-r-- 1 jinfang jinfang 17K Dec 28 10:18 substrate.out drwxrwxr-x 2 jinfang jinfang 12K Dec 28 10:19 synteny.pdf -rw-rw-r-- 1 jinfang jinfang 293K Dec 28 10:13 tf-1.out -rw-rw-r-- 1 jinfang jinfang 222K Dec 28 10:15 tf-2.out -rw-rw-r-- 1 jinfang jinfang 1.7M Dec 28 10:17 tp.out -rw-rw-r-- 1 jinfang jinfang 105M Dec 28 05:57 uniInput
Descriptions of Output Files:
PUL_blast.out
: BLAST results between CGCs and PULs.
CGC.faa
: CGC Fasta sequences.
cgc.gff
: reformatted from the user input gff file by marking CAZymes, TFs, TCs, and STPs.
cgc.out
: raw output of CGC predictions.
Each entry in cgc.out includes:
CGC_id: CGC1
type: CAZyme
contig_id: contig_10157
gene_id: BML_MG_01992
start: 33003
end: 36077
strand: +
annotation: GH2
Explanation: the gene BML_MG_01992 encodes a GH2 CAZyme in the CGC1 of the contig contig_10157. CGC1 also has other genes, which are provided in other rows. BML_MG_01992 is on the positive strand of contig_10157 from 33003 to 36077. The type can be one of the four signature gene types (CAZymes, TCs, TFs, STPs) or the null type (not annotated as one of the four signature genes).
cgc_standard.out.json: JSON format of cgc_standard.out. dbcan-sub.hmm.out: HMMER search result against dbCAN-sub HMMdb, including a column with CAZyme substrates extracted from fam-substrate-mapping-08012023.tsv. diamond.out: DIAMOND search result against the CAZy annotated protein sequences (CAZyDB.07262023.fa). dtemp.out: temporary file. hmmer.out: HMMER search result against dbCAN HMMdb. overview.txt: summary of CAZyme annotation from three methods in TSV format. An example row has the following columns: 1. Gene_ID: BML_MG_01761 2. EC#: 2.4.99.-:5 3. dbCAN: GT112(19-370) 4. dbCAN_sub: GT112_e0 5. DIAMOND: GT112 6. #ofTools: 3 Explanation: the protein BML_MG_01761 is annotated by 3 tools to be a CAZyme: (1) GT112 (CAZy defined family GT112) by HMMER vs dbCAN HMMdb with a domain range from aa position 19 to 370, (2) GT112_e0 (eCAMI defined subfamily e0; e indicates it is from eCAMI not CAZy) by HMMER vs dbCAN-sub HMMdb (derived from eCAMI subfamilies), and (3) GT112 by DIAMOND vs CAZy annotated protein sequences. The second column 2.4.99.-:5 is extracted from eCAMI, meaning that the eCAMI subfamily GT112_e0 contains 5 member proteins which have an EC 2.4.99.- according to CAZy. In most cases, the 3 tools will have the same CAZyme family assignment. When they give different assignment. We recommend a preference order: dbCAN > eCAMI/dbCAN-sub > DIAMOND. See our dbCAN2 paper2, dbCAN3 paper3, and eCAMI4 for more details. Note: If users invoked the –use_signalP parameter when running run_dbcan, there will be an additional column called signal in the overview.txt. stp.out: HMMER search result against the MiST5 compiled signal transduction protein HMMs from Pfam. tf-1.out: HMMER search result against the DBD6 compiled transcription factor HMMs from Pfam 7. tf-2.out: HMMER search result against the DBD compiled transcription factor HMMs from Superfamily 8. tp.out: DIAMOND search result against the TCDB 9 annotated protein sequences. substrate.out: summary of substrate prediction results for CGCs in TSV format from two approaches3 (dbCAN-PUL blast search and dbCAN-sub majority voting). An example row has the following columns: 1. CGC_ID: contig_10778|CGC2 2. Best hit PUL_ID in dbCAN-PUL: PUL0400 3. Substrate of the hit PUL: alginate 4. Sum of bitscores for homologous gene pairs between CGC and PUL: 851.0 5. Types of homologous gene pairs: CAZyme-CAZyme;CAZyme-CAZyme;CAZyme-CAZyme;CAZyme-CAZyme 6. Substrate predicted by majority voting of CAZymes in CGC: alginate 7. Voting score: 2.0 Explanation: The CGC2 of contig_10778 has its best hit PUL0400 (from PUL_blast.out) with alginate as substrate (from dbCAN-PUL_12-12-2023.xlsx). Four signature genes are matched between contig_10778|CGC2 and PUL0400 (from PUL_blast.out): all the four are CAZymes. The sum of blast bitscores of the 4 homologous pairs (CAZyme-CAZyme;CAZyme-CAZyme;CAZyme-CAZyme;CAZyme-CAZyme) is 851.0. Hence, the substrate of contig_10778|CGC2 is predicted to be alginate according to dbCAN-PUL blast search. The last two columns are based on the dbCAN-sub result (dbcan-sub.hmm.out), according to which two CAZymes in contig_10778|CGC2 are predicted to have alginate substrate. The voting score is thus 2.0, so that according to the majority voting rule, contig_10778|CGC2 is predicted to have an alginate substrate. Note: for many CGCs, only one of the two approaches produces substrate prediction. In some cases, the two approaches produce different substrate assignments. We recommend a preference order: dbCAN-PUL blast search > dbCAN-sub majority voting. See our dbCAN3 paper3 for more details. synteny.pdf: a folder with syntenic block alignment plots between all CGCs and PULs. uniInput: renamed Fasta file from input protein sequence file.
Module 3. Read mapping (Fig. 3) to calculate abundance for CAZyme families, subfamilies, CGCs, and substrates
P9. Read mapping to all contigs of each sample (TIMING ~10 min)
mkdir samfiles
bwa index flye_BML_MG/assembly.fasta
bwa index flye_SRF_MG/assembly.fasta
bwa mem -t 32 -o samfiles/HCZGU.sam flye_BML_MG/assembly.fasta HCZGU_1_val_1.fq.gz HCZGU_2_val_2.fq.gz
bwa mem -t 32 -o samfiles/HTOOS.sam flye_SRF_MG/assembly.fasta HTOOS_1_val_1.fq.gz HTOOS_2_val_2.fq.gz
bwa mem -t 32 -o samfiles/BML_MT.sam flye_BML_MG/assembly.fasta BML_MT_1.clean_val_1.fq.gz BML_MT_2.clean_val_2.fq.gz
bwa mem -t 32 -o samfiles/SRF_MT.sam flye_SRF_MG/assembly.fasta SRF_MT_1.clean_val_1.fq.gz SRF_MT_2.clean_val_2.fq.gz
P10. HiFi MG long read mapping to all contigs of each sample (TIMING ~20min)
minimap2 -a -x map-hifi --MD -t 32 -o samfiles/BML_MG.sam flye_BML_MG/assembly.fasta BML_MG.clean.fq.gz
minimap2 -a -x map-hifi --MD -t 32 -o samfiles/SRF_MG.sam flye_SRF_MG/assembly.fasta SRF_MG.clean.fq.gz
The –MD parameter is required in this command, because step P12 will access the MD tag in SAM files generated in this step to calculate the coverage.
P11. Convert SAM to BAM and sort (TIMING ~5min)
cd samfiles
samtools sort -@ 32 -o SRF_MG.bam SRF_MG.sam
samtools sort -@ 32 -o BML_MG.bam BML_MG.sam
samtools sort -@ 32 -o BML_MT.bam BML_MT.sam
samtools sort -@ 32 -o SRF_MT.bam SRF_MT.sam
samtools index BML_MT.bam
samtools index SRF_MT.bam
samtools index BML_MG.bam
samtools index SRF_MG.bam
rm -rf *sam
cd ..
P12. Read count calculation for all proteins of each sample using dbcan_utilsBedtools (TIMING ~6min12min)
Calculate HiFi MG read abundance for proteins in the two samples:
mkdir SRF_MG_abund && cd SRF_MG_abund
dbcan_utils cal_coverage -g ../SRF_MG.fix.gff -i ../samfiles/SRF_MG.bam -o SRF_MG.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 –-identity 0.98 --hifi
cd ..
mkdir BML_MG_abund && cd BML_MG_abund
dbcan_utils cal_coverage -g ../BML_MG.fix.gff -i ../samfiles/BML_MG.bam -o BML_MG.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 –-identity 0.98 --hifi
cd ..
Calculate Illumina MT read abundance for proteins in the two samples:
mkdir BML_MT_abund && cd BML_MT_abund
dbcan_utils cal_coverage -g ../BML_MG.fix.gff -i ../samfiles/BML_MT.bam -o BML_MT.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 –-identity 0.98
cd ..
mkdir SRF_MT_abund && cd SRF_MT_abund
dbcan_utils cal_coverage -g ../SRF_MG.fix.gff -i ../samfiles/SRF_MT.bam -o SRF_MT.depth.txt -t 6 --overlap_base_ratio 0.2 --mapping_quality 30 –-identity 0.98
cd ..
Read counts are saved in depth.txt files of each sample.
P13. Read count calculation for a given region of contigs using Samtools (TIMING ~1min)
cd SRF_MG_abund
samtools depth -r contig_12725:10309-25115 ../samfiles/SRF_MT.bam > SRF_MT.cgc.depth.txt
cd .. && cd SRF_MT_abund
samtools depth -r contig_12725:10309-25115 ../samfiles/SRF_MG.bam > SRF_MG.cgc.depth.txt
cd ..
The parameter -r contig_12725:10309-25115contig_6487:2406-6727` specifies a region in a contig. For any CGC, its positional range can be found in the file cgc_standard output produced by run_dbcan. The depth.txt files contain the raw read counts for the specified region.
P14. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (TIMING ~1min)
As the Priest2023 dataset has both MG and MT data, for each sample we will have two abundance folders, e.g., BML_MG_abund, BML_MT_abund for BML sample, and SRF_MG_abund, SRF_MT_abund for SRF sample.
cd BML_MT_abund
dbcan_utils fam_abund -bt BML_MT.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt BML_MT.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils CGC_abund -bt BML_MT.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt BML_MT.depth.txt -i ../BML_MG.dbCAN -a TPM
cd .. && cd SRF_MT_abund
dbcan_utils fam_abund -bt SRF_MT.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt SRF_MT.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils CGC_abund -bt SRF_MT.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt SRF_MT.depth.txt -i ../SRF_MG.dbCAN -a TPM
cd .. && cd BML_MG_abund
dbcan_utils fam_abund -bt BML_MG.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt BML_MG.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils CGC_abund -bt BML_MG.depth.txt -i ../BML_MG.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt BML_MG.depth.txt -i ../BML_MG.dbCAN -a TPM
cd .. && cd SRF_MG_abund
dbcan_utils fam_abund -bt SRF_MG.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils fam_substrate_abund -bt SRF_MG.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils CGC_abund -bt SRF_MG.depth.txt -i ../SRF_MG.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt SRF_MG.depth.txt -i ../SRF_MG.dbCAN -a TPM
cd ..
We developed a set of Python scripts as dbcan_utils
(included in the run_dbcan
package) to take the raw read counts for all genes as input and output the normalized abundances (refer to Box 7) of CAZyme families, subfamilies, CGCs, and substrates (see Fig. 4). The parameter -a TPM
can also be set to two other metrics: RPM, or RPKM61.
RPKM is calculated as the number of mapped reads to a gene G divided by [(total number of mapped reads to all genes / 10^6) x (gene G length / 1000)].
RPM is the number of mapped reads to a gene G divided by (total number of mapped reads to all genes / 10^6).
TPM is calculated as [number of mapped reads to a gene G / (gene G length / 1000)] divided by the sum of [number of mapped reads to each gene / (the gene length / 1000)].
Box 7. Example output of dbcan_utils
As an example, SRF_MG_abund <https://bcb.unl.edu/dbCAN_tutorial/dataset3-Priest2023/SRF_MG_abund/>_ folder has 7 TSV files:
-rw-rw-r-- 1 jinfang jinfang 122K Jun 19 07:30 CGC_abund.out
-rw-rw-r-- 1 jinfang jinfang 835 Jun 19 07:30 CGC_substrate_majority_voting.out
-rw-rw-r-- 1 jinfang jinfang 4.5K Jun 19 07:30 CGC_substrate_PUL_homology.out
-rw-rw-r-- 1 jinfang jinfang 2.6K Jun 19 07:30 EC_abund.out
-rw-rw-r-- 1 jinfang jinfang 3.3K Jun 19 07:30 fam_abund.out
-rw-rw-r-- 1 jinfang jinfang 18K Jun 19 07:30 fam_substrate_abund.out
-rw-rw-r-- 1 jinfang jinfang 6.9M Jun 19 04:07 SRF_MG.depth.txt
-rw-rw-r-- 1 jinfang jinfang 19K Jun 19 07:30 subfam_abund.out
Explanation of columns in these TSV files is as follows:
fam_abund.out
: CAZy family (from HMMER vs dbCAN HMMdb), sum of TPM, number of CAZymes in the family.
subfam_abund.out
: eCAMI subfamily (from HMMER vs dbCAN-sub HMMdb), sum of TPM, number of CAZymes in the subfamily.
EC_abund.out
: EC number (extracted from dbCAN-sub subfamily), sum of TPM, number of CAZymes with the EC.
fam_substrate_abund.out
: Substrate (from HMMER vs dbCAN-sub HMMdb), sum of TPM (all CAZymes in this substrate group), GeneID (all CAZyme IDs in this substrate group).
CGC_abund.out
: CGC_ID (e.g., k141_338400|CGC1), mean of TPM (all genes in the CGC), Seq_IDs (IDs of all genes in the CGC), TPM (of all genes in the CGC), Families (CAZyme family or other signature gene type of all genes in the CGC).
CGC_substrate_PUL_homology.out
: Substrate (from dbCAN-PUL blast search), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-PUL blast search), TPM (of CGCs in this substrate group).
CGC_substrate_majority_voting.out
: Substrate (from dbCAN-sub majority voting), sum of TPM, CGC_IDs (all CGCs predicted to have the substrate from dbCAN-sub majority voting), TPM (of CGCs in this substrate group).
Module 4: dbcan_plot for data visualization (Fig. 3) of abundances of CAZymes, CGCs, and substrates (TIMING variable)
P15. Heatmap for CAZyme substrate abundance across samples (TIMING 1min)
dbcan_plot heatmap_plot --show_abund --top 20 --samples BML_MG,BML_MT,SRF_MG,SRF_MT -i BML_MG_abund/fam_substrate_abund.out,BML_MT_abund/fam_substrate_abund.out,SRF_MG_abund/fam_substrate_abund.out,SRF_MT_abund/fam_substrate_abund.out
P16. Barplot for CAZyme family/subfamily/EC abundance across samples (Fig. S5A-C) (TIMING 1min)
dbcan_plot bar_plot --samples BML_MG,BML_MT,SRF_MG,SRF_MT --vertical_bar --top 20 -i BML_MG_abund/fam_abund.out,BML_MT_abund/fam_abund.out,SRF_MG_abund/fam_abund.out,SRF_MT_abund/fam_abund.out --pdf fam.pdf
dbcan_plot bar_plot --samples BML_MG,BML_MT,SRF_MG,SRF_MT --vertical_bar --top 20 -i BML_MG_abund/subfam_abund.out,BML_MT_abund/subfam_abund.out,SRF_MG_abund/subfam_abund.out,SRF_MT_abund/subfam_abund.out --pdf subfam.pdf
dbcan_plot bar_plot --samples BML_MG,BML_MT,SRF_MG,SRF_MT --vertical_bar --top 20 -i BML_MG_abund/EC_abund.out,BML_MT_abund/EC_abund.out,SRF_MG_abund/EC_abund.out,SRF_MT_abund/EC_abund.out --pdf ec.pdf
P17. Synteny plot between a CGC and its best PUL hit with read mapping coverage to CGC(use –db_dir if needed) (Fig. S5E) (TIMING 1min)
dbcan_plot CGC_synteny_coverage_plot -i SRF_MG.dbCAN --cgcid 'contig_12725|CGC1' --readscount SRF_MG_abund/SRF_MG.cgc.depth.txt
dbcan_plot CGC_synteny_coverage_plot -i SRF_MG.dbCAN --cgcid 'contig_12725|CGC1' --readscount SRF_MT_abund/SRF_MT.cgc.depth.txt
If users only want to plot the CGC structure:
dbcan_plot CGC_plot -i SRF_MG.dbCAN --cgcid 'contig_12725|CGC1'
If users only want to plot the CGC structure plus the read mapping coverage:
dbcan_plot CGC_coverage_plot -i SRF_MG.dbCAN --cgcid 'contig_12725|CGC1' --readscount SRF_MG_abund/SRF_MG.cgc.depth.txt
If users only want to plot the synteny between the CGC and PUL:
dbcan_plot CGC_synteny_plot -i SRF_MG.dbCAN --cgcid 'contig_12725|CGC1'