Quick Start
This section provides a quick guide to running the run_dbcan tool suite with example data and explains the output files generated.
For the updated run_dbcan, we provide two types of approach for users:
1.Automated analysis that can be done with one line of command.
2.One by one command that allows checking problems by breaking down the steps or making autonomous changes to some of the results.
Here we show all the steps in one line of command.
We provide multiple example data sets for users to test the tool suite. The example data sets are available in the example_data directory.
We currently support configuration using anaconda and pip.
Hint
we didn’t ask the users to git clone all repo from the github because we’ve uploaded it to the Pypi, and the users can install it by pip. Users need to prepare the environmental files, which could be downloaded from the github repo. We provide the environmental files in the envs folder, and could also be found directly in this link: prepare the conda environment (available at https://github.com/bcb-unl/run_dbcan_new/tree/master/envs)
conda env create -f environment.yml
conda activate run_dbcan_env
1. Running Example Data for CAZyme Annotation
To run the dbCAN tool suite on the Escherichia coli Strain MG1655 example data, use the following command. The input file EscheriaColiK12MG1655.fna represents the FASTA format complete genome DNA sequence.
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.fna -O EscheriaColiK12MG1655.fna
run_dbcan easy_CAZyme --input_raw_data EscheriaColiK12MG1655.fna --mode prok --output_dir output_EscheriaColiK12MG1655_fna --db_dir db
For the protein sequence input, use the following command (Please note that the input format is needed for the protein sequence only. NCBI represents the fasta ID format like NCBI “>WP_000002088.1”, and the JGI mode represents the fasta ID format like JGI “>jgi|Xylhe1|242238|”).:
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.faa -O EscheriaColiK12MG1655.faa
run_dbcan easy_CAZyme --input_raw_data EscheriaColiK12MG1655.faa --mode protein --output_dir output_EscheriaColiK12MG1655_faa --db_dir db --input_format NCBI
We also provide eukaryotes example data sets. For example, to run the dbCAN tool suite on the Xylona heveae TC161 example data, use the following command:
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.faa -O Xylona_heveae_TC161.faa
run_dbcan easy_CAZyme --input_raw_data Xylona_heveae_TC161.faa --mode protein --output_dir output_Xylona_heveae_TC161_faa --db_dir db --input_format NCBI
And JGI dataset Xylhe1:
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.aa.fasta -O Xylhe1_GeneCatalog_proteins_20130827.aa.fasta
run_dbcan easy_CAZyme --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta --mode protein --output_dir output_Xylhe1_faa --db_dir db --input_format JGI
2. Understanding the Output
After running the tool, several output files are generated in the output folder, each with specific information:
- uniInput.faa
The unified input file for subsequent tools, created by Prodigal if a nucleotide sequence is used, or provided by the user as protein sequence.
- dbCAN-sub.substrate.tsv
Output from the pyHMMER using dbCAN_sub-HMM.
- diamond_results.tsv
Results from the Diamond BLAST using CAZy.faa.
- dbCAN_hmm_results.tsv
Output from the pyHMMER using dbCAN-HMM..
- overview.tsv
Summarizes CAZyme predictions across tools. We recommend results using at least two tools (Shown as the “Recommend Results”).
3. Running Example Data for CGC Annotation (please check the previous step for downloading example fasta data, we don’t repeat it here to avoid issues. Here we download the gff files.)
run_dbcan easy_CGC --input_raw_data EscheriaColiK12MG1655.fna --mode prok --output_dir output_EscheriaColiK12MG1655_fna_CGC --db_dir db --input_gff gff --input_gff_format prodigal
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.gff -O EscheriaColiK12MG1655.gff
run_dbcan easy_CGC --input_raw_data EscheriaColiK12MG1655.faa --mode protein --output_dir output_EscheriaColiK12MG1655_faa_CGC --db_dir db --input_format NCBI --input_gff EscheriaColiK12MG1655.gff --input_gff_format NCBI_prok
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.gff -O Xylona_heveae_TC161.gff
run_dbcan easy_CGC --input_raw_data Xylona_heveae_TC161.faa --mode protein --output_dir output_Xylona_heveae_TC161_faa_CGC --db_dir db --input_format NCBI --input_gff Xylona_heveae_TC161.gff --input_gff_format NCBI_euk
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.gff -O Xylhe1_GeneCatalog_proteins_20130827.gff
run_dbcan easy_CGC --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta --mode protein --output_dir output_Xylhe1_faa_CGC --db_dir db --input_format JGI --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff --input_gff_format JGI
4. Understanding the Output
including the output files from the previous step, and new outputs:
- non_CAZyme.faa
The non-CAZyme protein sequences extracted from uniInput.faa, which is based on the overview results.
- TC_results.tsv
Results from the Diamond BLAST using TCDB to annotate transporter protein.
- TF_results.tsv
Results from the pyHMMER using TF-HMM to annotate transcription factor protein.
- STP_results.tsv
Results from the pyHMMER using STP-HMM to annotate signal transduction protein.
- total_cgc_info.tsv
The total annotation of all signature proteins combing TC, TF, and STP. Using the same overlap method to filter as CAZyme annotation.
- cgc.gff
The input file of CGCFinder in gff format. This is generated by the tool suite based on the input_gff file and “total_cgc_info.tsv”.
- cgc_standard_out.tsv
The standard output of CGCFinder.
1. Running Example Data for Substrate Prediction (please check the previous step for downloading example fasta data, we don’t repeat it here to avoid issues.)
run_dbcan easy_substrate --input_raw_data EscheriaColiK12MG1655.fna --mode prok --output_dir output_EscheriaColiK12MG1655_fna_sub --db_dir db --input_gff gff --input_gff_format prodigal
run_dbcan easy_substrate --input_raw_data EscheriaColiK12MG1655.faa --mode protein --output_dir output_EscheriaColiK12MG1655_faa_sub --db_dir db --input_format NCBI --input_gff EscheriaColiK12MG1655.gff --input_gff_format NCBI_prok
run_dbcan easy_substrate --input_raw_data Xylona_heveae_TC161.faa --mode protein --output_dir output_Xylona_heveae_TC161_faa_sub --db_dir db --input_format NCBI --input_gff Xylona_heveae_TC161.gff --input_gff_format NCBI_euk
run_dbcan easy_substrate --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta --mode protein --output_dir output_Xylhe1_faa_sub --db_dir db --input_format JGI --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff --input_gff_format JGI
1. Understanding the Output
including the output files from the previous step, and new outputs:
- substrate.out
The final output of substrate prediction, which includes the substrate prediction results of each CAZyme gene cluster.
- PUL_blast.out
The DIAMOND blastp results of CGCs against dbCAN-PULs.
- synteny_pdf/
The synteny plot folder including predicted results. The plot shows the gene cluster mapping between PULs and CGCs.