Run from Raw Reads(Cater 2023): Supplementary Protocol for assembly-free

Introduction

Overview

There are three sections in this document:

Procedure of assembly-free

Example : Carter 2023 Dataset Carter 2023

S1. Download Carter2023 raw reads (~10min)

To download the required raw reads, use the following wget commands:

wget https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Dry2014_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Dry2014_2.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Wet2014_1.fastq.gz
wget https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Wet2014_2.fastq.gz

These raw data were originally downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=ERR7745896 and https://www.ncbi.nlm.nih.gov/sra/?term=ERR7738162 and renamed to indicate their collected seasons.

S2. Install Anaconda (~3min)

Download and install the latest version of Anaconda for Linux from https://www.anaconda.com/download#downloads. Once Anaconda is successfully installed, proceed to create a dedicated conda environment named CAZyme_annotation and activate it. Subsequently, all the required tools can be seamlessly installed within this environment (yml file is provided on github).

conda env create -f CAZyme_annotation_main.yml
conda activate CAZyme_annotation_main

S4. Configure databases required by run_dbcan (~2h)

To install the databases, execute the following commands:

Database Installation Command

prepare the conda environment (available at https://github.com/bcb-unl/run_dbcan_new/tree/master/envs):

conda env create -f environment.yml
conda activate run_dbcan_env
  1. We provide command in the run_dbcan script:

run_dbcan database --db_dir db
  1. Users could also download all database files from the dbCAN2 website (http://bcb.unl.edu/dbCAN2/download/Databases/), and then put them into the db directory.

wget -q https://bcb.unl.edu/dbCAN2/download/run_dbCAN_database_total/dbCAN_db.tar.gz -O db.tar.gz
tar -zxvf db.tar.gz
rm db.tar.gz

Download database required by Kraken2 (very slow; can be skipped if users do not intend to run Kraken2):

kraken2-build --standard --db K2

or could download database directly from the https://benlangmead.github.io/aws-indexes/k2, and extract the tarball to the db folder, here we use standard database as an example:

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240605.tar.gz
mkdir K2
tar -xvf k2_standard_20240605.tar.gz -C K2

CRITICAL STEP

The downloaded files must be all in the right location (the db folder).

The CAZy.dmnd file is needed for DIAMOND search.

The dbCAN.hmm and dbCAN_sub.hmm files are for HMMER search.

The tcdb.dmnd, TF.HMM, and STP.hmm files are for CGC prediction.

The PUL.faa file consists of protein sequences from experimentally validated PULs for BLAST search to predict substrates for CGCs.

The dbCAN-PUL_12-12-2023.txt and dbCAN-PUL_12-12-2023.xlsx files contain PUL-substrate mapping curated from literature.

Lastly, the fam-substrate-mapping.tsv file is the family-EC-substrate mapping table for the prediction of CAZyme substrates.

Warning

Users should use a clean version of Anaconda. If the above steps failed, we suggest users reinstall their Anaconda. The Anaconda installation and configuration step may experience prolonged time while resolving environment dependencies. Users should be patient during this process. Alternatively, users may consider “mamba”, another Python package manager that offers similar functionality to Anaconda. Information and access to mamba software can be found at https://github.com/mamba-org/mamba.

Procedure

Module 1: Reads processing (Fig. 2) to obtain contigs

P1. Contamination Check

Use kraken2 to check for contaminated reads:

kraken2 --threads 32 --quick --paired --db K2 --report Wet2014.kreport --output Wet2014.kraken.output Wet2014_1.fastq.gz Wet2014_2.fastq.gz
kraken2 --threads 32 --quick --paired --db K2 --report Dry2014.kreport --output Dry2014.kraken.output Dry2014_1.fastq.gz Dry2014_2.fastq.gz

Kraken2 found very little contamination in the Carter2023 data. Consequently, there was no need for the contamination removal step. For Kraken2 visualization, users can use pavian (https://github.com/fbreitwieser/pavian?tab=readme-ov-file).

If contamination is identified, users can align the reads to the reference genomes of potential contamination source organisms to remove the aligned reads (Box 1). The most common source in human microbiome studies is from human hosts.

Box 1: Removing Contamination Reads from Humans

Kraken2 will produce the following output files.

-rw-rw-r-- 1 jinfang jinfang 2.0G Dec 12 10:24 Dry2014.kraken.output
-rw-rw-r-- 1 jinfang jinfang 1.2M Dec 12 10:25 Dry2014.kreport
-rw-rw-r-- 1 jinfang jinfang 5.1G Dec 12 09:47 Wet2014.kraken.output
-rw-rw-r-- 1 jinfang jinfang 1.1M Dec 12 09:48 Wet2014.kreport

Suppose from these files, we have identified humans as the contamination source, we can use the following commands to remove the contamination reads by aligning reads to the human reference genome.

wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bwa index -p hg38 Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bwa mem hg38 Wet2014_1.fastq.gz Wet2014_2.fastq.gz -t 32 -o Wet2014.hg38.sam
bwa mem hg38 Dry2014_1.fastq.gz Dry2014_2.fastq.gz -t 32 -o Dry2014.hg38.sam
samtools view -f 12 Wet2014.hg38.sam > Wet2014.hg38.unmap.bam
samtools view -f 12 Dry2014.hg38.sam > Dry2014.hg38.unmap.bam
samtools fastq -1 Wet2014_1.clean.fq.gz -2 Wet2014_2.clean.fq.gz Wet2014.hg38.unmap.bam
samtools fastq -1 Dry2014_1.clean.fq.gz -2 Dry2014_2.clean.fq.gz Dry2014.hg38.unmap.bam

KrakenTools could also extract host reads quickly and easied which is recommended. We use tax 2759 (plant) as an example. Please read KrakenTools README for more information (https://github.com/jenniferlu717/KrakenTools?tab=readme-ov-file).

extract_kraken_reads.py \
-k Dry2014.kraken.output \
-s1 Dry2014_1.fastq.gz -s2 Dry2014_2.fastq.gz \
--fastq-output --exclude \
--taxid 9606 \
-o Dry2014_1.clean.fq.gz -o2 Dry2014_2.clean.fq.gz

P2. Trim adapter and low-quality reads (TIMING ~20min)

trim_galore --paired Wet2014_1.fastq.gz Wet2014_2.fastq.gz --illumina -j 36
trim_galore --paired Dry2014_1.fastq.gz Dry2014_2.fastq.gz --illumina -j 36

We specified –illumina to indicate that the reads were generated using the Illumina sequencing platform. Nonetheless, trim_galore can automatically detect adapters, providing flexibility for users who may know the specific sequencing platform. Details of trimming are available in the trimming report file (Box 2).

Box 2: Example output of trim_galore

In addition to the trimmed read files, Trim_galore` also generates a trimming report file. The trimming report contains details on read trimming, such as the number of trimmed reads.

-rw-rw-r-- 1 jinfang jinfang 4.2K Dec 13 01:48 Dry2014_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 jinfang jinfang 2.0G Dec 13 01:55 Dry2014_1_val_1.fq.gz
-rw-rw-r-- 1 jinfang jinfang 4.4K Dec 13 01:55 Dry2014_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 jinfang jinfang 2.4G Dec 13 01:55 Dry2014_2_val_2.fq.gz
-rw-rw-r-- 1 jinfang jinfang 4.4K Dec 13 01:30 Wet2014_1.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 jinfang jinfang 3.4G Dec 13 01:46 Wet2014_1_val_1.fq.gz
-rw-rw-r-- 1 jinfang jinfang 4.6K Dec 13 01:46 Wet2014_2.fastq.gz_trimming_report.txt
-rw-rw-r-- 1 jinfang jinfang 3.7G Dec 13 01:46 Wet2014_2_val_2.fq.gz

Warning

During the trimming process, certain reads may be entirely removed due to low quality in its entirety. Using the --retain_unpaired parameter in trim_galore allows for the preservation of single-end reads. In this protocol, this option was not selected, so that both reads of a forward-revise pair were removed.

P3. Covert the fastq file to fasta (~9min)

Need to install seqtk` first

seqtk seq -a Dry2014_1_val_1.fq.gz > Dry2014_1.fna

seqtk seq -a Dry2014_2_val_2.fq.gz > Dry2014_2.fna

seqtk seq -a Wet2014_1_val_1.fq.gz > Wet2014_1.fna

seqtk seq -a Wet2014_2_val_2.fq.gz > Wet2014_2.fna

Module 2. Read mapping (Fig. 2) to calculate abundance for CAZyme families, subfamilies, CGCs, and substrates

P4. Map reads to proteins in CAZyDB by using DIAMOND blastx (~5h10min)

ln -s db/CAZy.dmnd .

diamond blastx --db CAZy --out Dry2014_1.blastx --threads 32 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp qlen slen --id 80 --query-cover 90 --query Dry2014_1.fna --max-target-seqs 1 --quiet

diamond blastx --db CAZy --out Dry2014_2.blastx --threads 32 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp qlen slen --id 80 --query-cover 90 --query Dry2014_2.fna --max-target-seqs 1 --quiet

diamond blastx --db CAZy --out Wet2014_1.blastx --threads 32 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp qlen slen --id 80 --query-cover 90 --query Wet2014_1.fna --max-target-seqs 1 --quiet

diamond blastx --db CAZy --out Wet2014_2.blastx --threads 32 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp qlen slen --id 80 --query-cover 90 --query Wet2014_2.fna --max-target-seqs 1 --quiet

The parameters –query-cover 90` and –id 80` are required to align the reads to CAZy database. The thresholds of these parameters will significantly impact the results. The –max-target-seqs 1` parameter specify to only keep the best CAZy hit, which means each read will be mapped to only one CAZyme protein in the CAZyDB.

P5. dbcan_asmfree to calculate the abundance of CAZyme families, subfamily, EC and substrate (~11min)

We developed a Python script as dbcan_asmfree (included in the run_dbcan package) to take the DIAMOND blastx results from pair-end reads and the raw reads as inputs. With the diamond_fam_abund method, it will output the CAZyme family abundance, with the diamond_subfam_abund method, it will output the eCAMI subfamily abundance, with the diamond_EC_abund method, it will output the EC abundance and with the diamond_substrate_abund method, it will output the substrate abundance.

dbcan_asmfree diamond_fam_abund -paf1 Dry2014_1.blastx -paf2 Dry2014_2.blastx --raw_reads Dry2014_1_val_1.fq.gz  -n FPKM -o Dry2014_fam_abund

dbcan_asmfree diamond_fam_abund -paf1 Wet2014_1.blastx -paf2 Wet2014_2.blastx --raw_reads Wet2014_1_val_1.fq.gz -n FPKM -o Wet2014_fam_abund



dbcan_asmfree diamond_subfam_abund -paf1 Dry2014_1.blastx -paf2 Dry2014_2.blastx --raw_reads Dry2014_1_val_1.fq.gz -o Dry2014_subfam_abund -n FPKM

dbcan_asmfree diamond_subfam_abund -paf1 Wet2014_1.blastx -paf2 Wet2014_2.blastx --raw_reads Wet2014_1_val_1.fq.gz -o Wet2014_subfam_abund -n FPKM



dbcan_asmfree diamond_EC_abund -i Dry2014_subfam_abund -o Dry2014_EC_abund

dbcan_asmfree diamond_EC_abund -i Wet2014_subfam_abund -o Wet2014_EC_abund



dbcan_asmfree diamond_substrate_abund -i Dry2014_subfam_abund -o Dry2014_substrate_abund

dbcan_asmfree diamond_substrate_abund -i Wet2014_subfam_abund -o Wet2014_substrate_abund

The substrate prediction in the assemble-free route is based on the fam-substrate-mapping-08012023.tsv mapping table. Specifically, from the blastx result files, reads are first grouped according to their best CAZyDB proteins. Then, reads are further grouped according to the CAZyme families and eCAMI subfamilies where the best CAZyDB proteins belong to. Lastly, the fam-substrate-mapping-08012023.tsv mapping table is used to map CAZyme families and ECs (from eCAMI subfamilies) to substrates. Therefore, the substrate abundance is calculated in a similar way as implemented in dbCAN-sub.

Module 4: dbcan_plot for data visualization (Fig. 2) of abundances of CAZymes, CGCs, and substrates (TIMING variable)

P6. Barplot for CAZyme family, subfamily, and EC abundance across samples (TIMING 1min)

dbcan_plot bar_plot --samples Wet2014,Dry2014 --vertical_bar --top 20 -i Wet2014_fam_abund,Dry2014_fam_abund --pdf fam.pdfDry2014.blastx.CAZy.FPKM.tsv,Wet2014.blastx.CAZy.FPKM.tsv

dbcan_plot bar_plot --samples Wet2014,Dry2014 --vertical_bar --top 20 -i Wet2014_subfam_abund,Dry2014_subfam_abund --pdf subfam.pdf

dbcan_plot bar_plot --samples Wet2014,Dry2014 --vertical_bar --top 20 -i Wet2014_fam_abund,Dry2014_fam_abund --pdf ec.pdf

Here we plot the top 20 substrates in the two samples. The input files are the two CAZyme substrate abundance files calculated based on dbCAN-sub result. The default heatmap is ranked by substrate abundances. To rank the heatmap according to abundance profile using the clustermap function of the seaborn package (https://github.com/mwaskom/seaborn), users can invoke the --cluster_map parameter.

P7. Heatmap for CAZyme substrate abundance across samples (Fig. S2D) (TIMING 1min)

dbcan_plot heatmap_plot --samples Wet2014,Dry2014 --show_abund --top 20 -i  Wet2014_substrate_abund,Dry2014_substrate_abund