1. Usage Information

1.1. CpG_anno_probe.py

This program adds comprehensive annotation information to each 450K/850K array probe ID. Basically, it will add 17 columns to the orignal input data file. These 17 columns include (from left to rigth):

Header Name

Description

hg19_pos

The genomic position of the CpG on human genome assembly hg19 (or GRCh37)

hg38_pos

The genomic position of the CpG on human genome assembly hg38 (or GRCh38).

strand

Strand of the CpG. Value - “R” (reverse strand) or “F” (forward strand).

geneSymbol

Genes the CpG has been assigned to. “N/A” indicates no genes were found. This is retrieved from the Illumina MethylationEpic v1.0 B4 manifest file.

CpGisland

The CpG island (CGI) that overlaps with this CpG. “N/A” indicates no CGIs were found.

with_450K

Boolean indicating whether this CpG probe is also included in 450K. “0” - No, “1”- Yes.

SNP_ID

SNPs (rsID) that are close to this CpG. Multiple SNPs are separated by “;”. “N/A” indicates no SNPs were found.

SNP_distance

The nucleotide distances between SNPs and the CpG.

SNP_MAF

The minor allele frequencies (MAF) of SNPs.

Cross_Reactive

Boolean (“0” - No, “1”- Yes) indicating whether this CpG could be affected by cross-hybridisation or underlying genetic variation as reported by this paper.

ENCODE_TF_ChIP

Transcription factor (TF) binding sites identified from ChIP-seq experiments performed,by the ENCODE project. Peaks from 1264 experiments representing 338 transcription factors in 130 cell types are combined (N - 10,560,472). BED format file was downloaded from the UCSC Tabel Browser, and detailed description is provided here.

ENCODE_DNaseI

DNase I hypersensitivity sites identified from ENCODE DNase-seq experiments. Peaks from 125 cell types are combined (N - 1,867,665). BED format file was downloaded from UCSC Table Browser, and detailed description is provided here.

ENCODE_H3K27ac_ChIP

H3K27ac peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N - 665,650)

ENCODE_H3K4me1_ChIP

H3K4me1 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N - 1,435,550)

ENCODE_H3K4me3_ChIP

H3K4me3 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N - 525,824)

ENCODE_chromHMM

Chromatin State Segmentation by chromHMM from ENCODE. Chromatin states across 9 cell types (GM12878, H1-hESC, K562, HepG2, HUVEC, HMEC, HSMM, NHEK, NHLF) were learned by computationally by integrating 9 factors (CTCF, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H4K20me1 ) plus input. A total of 15 states were identified, include: State-1 (Active Promoter), state-2 (Weak Promoter), state-3 (Inactive/poised Promoter), state-4 and 5 (Strong enhancer), state-6 and 7 (Weak/poised enhancer), state-8 (insulator), state-9 (Transcriptional transition), state-10 (Transcriptional elongation), state-11 (Weak transcribed), state-12 (Polycomb-repressed), state-13 (Heterochromatin or low signal), state-14 and 15 (Repetitive/Copy Number Variation). Orignal chromatin state BED file was downloaded from UCSC Table Browser, and detailed description is provided here.

FANTOM_enhancer

PHANTOM5 human enhancers downloaded from here.

Notes

  • For peaks identified from ENCODE ChIP-seq and DNase-seq (ENCODE_TF_ChIP, ENCODE_H3K27ac_ChIP, ENCODE _H3K4me1_ChIP, ENCODE_H3K4me3_ChIP and ENCODE_DNaseI), we require the probe must be located in the 100 bp window centered on the middle of the peak.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input_file-INPUT_FILE

Input data file (Tab separated) with certain column containing 450K/850K array CpG IDs. This file can be regular text file or compressed file (.gz, .bz2).

-a ANNO_FILE, --annotation-ANNO_FILE

Annotation file. This file can be regular text file or compressed file (.gz, .bz2).

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

-p PROBE_COL, --probe_column-PROBE_COL

The number specifying which column contains probe IDs. Note: the column index starts with 0. default-0.

-l, --header

Input data file has a header row.

Input files

Command

# probe IDs are located in the 4th column (-p 3)

$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv -i test_01.hg19.bed6 -o output

or (take gzipped files as input)

$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv.gz -i test_01.hg19.bed6.gz -o output

@ 2019-06-28 09:12:41: Read annotation file "../epic/MethylationEPIC_CpGtools.tsv" ...
@ 2019-06-28 09:12:52: Add annotation information to "test_01.hg19.bed6" ...

Output files

  • output.anno.txt

1.2. CpG_aggregation.py

Aggregate proportion values of a list of CpGs that located in give genomic regions (eg. CpG islands, promoters, exons, etc).

Example of input file

Chrom  Start   End     score
chr1   100017748       100017749       3,10
chr1   100017769       100017770       0,10
chr1   100017853       100017854       16,21

Notes

  • Outlier CpG will be removed if the probability of observing its proportion vlaue is less than p-cutoff. For example, if alpha set to 0.05 and there are 10 CpGs (n - 10) located in a particular genomic region, the p-cutoff of this genomic region is 0.005 (0.05/10). Supposing the total reads mapped to this region is 100, out of which 25 are methylated reads (i.e regional methylation level (beta) - 25/100 - 0.25)

    The probability of observing CpG (3,10) is :

    pbinom(q-3, size-10, prob-0.25) - 0.7759

    The probability of observing CpG (0,10) is :

    pbinom(q-0, size-10, prob-0.25) - 0.05631

    The probability of observing CpG (16,21) is :

    pbinom(q-16, size-21, prob-0.25, lower.tail-FALSE) - 1.19e-07 (outlier)

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-INPUT_FILE

Input CpG file in BED format. The first 3 columns contain “Chrom”, “Start”, and “End”. The 4th column contains proportion values.

-a ALPHA_CUT, --alpha-ALPHA_CUT

The chance of mistakingly assign a particular CpG as an outlier for each genomic region. default-0.05

-b BED_FILE, --bed-BED_FILE

BED3+ file specifying the genomic regions.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$CpG_aggregation.py -b hg19.RefSeq.union.1Kpromoter.bed.gz  -i 0_du145_133_glp_sh1.bed -o out

Output

chr1    567292  568293  3       0       93      3       0       93
chr1    713567  714568  6       0       100     6       0       100
chr1    762401  763402  7       0       110     7       0       110
chr1    762470  763471  10      0       158     10      0       158
chr1    854571  855572  2       12      16      2       12      16
chr1    860620  861621  16      91      232     16      91      232
chr1    894178  895179  12      151     229     41      506     735
Column1-3:

Genome coordinates

Column4-6:

numbers of “CpG”, “aggregated methyl reads”, and “aggregate total reads” after outlier filtering

Column7-9:

numbers of “CpG”, “aggregated methyl reads”, and “aggregate total reads” before outlier filtering

1.3. CpG_distrb_chrom.py

This program calculates the distribution of CpG over chromosomes

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILES, --input-files-INPUT_FILES

Input CpG file(s) in BED3+ format. Multiple BED files should be separated by “,” (eg: “-i file_1.bed,file_2.bed,file_3.bed”). BED file can be a regular text file or compressed file (.gz, .bz2). The barplot figures will NOT be generated if you provide more than 12 samples (bed files). [required]

-n FILE_NAMES, --names-FILE_NAMES

Shorter and meaningful names to label samples. Should be separated by “,” and match CpG BED files in number. If not provided, basenames of CpG BED files will be used to label samples. [optional]

-s CHROM_SIZE, --chrom-size-CHROM_SIZE

Chromosome size file. Tab or space separated text file with 2 columns: the first column is chromosome name/ID, the second column is chromosome size. This file will determine: (1) which chromosomes are included in the final barplots, so do NOT include ‘unplaced’, ‘alternative’ contigs in this file. (2) The order of chromosomes in the final barplots. [required]

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file. [required]

Input files

Command

$ chrom_distribution.py -i 450K_probe.hg19.bed3.gz,850K_probe.hg19.bed3.gz -n 450K,850K \
  -s hg19.chrom.sizes -o chromDist

Output files

  • chromDist.txt

  • chromDist.r

  • chromDist.CpG_total.pdf

  • chromDist.CpG_percent.pdf

  • chromDist.CpG_perMb.pdf

Total CpG count per chromsome

_images/chromDist.CpG_total.png

CpG percent on each chromosome (normalized to total CpGs)

_images/chromDist.CpG_percent.png

CpG per Mb (normalized to chromsome size)

_images/chromDist.CpG_perMb.png

1.4. CpG_distrb_gene_centered.py

This program calculates the distribution of CpG over gene-centered genomic regions including ‘Coding exons’, ‘UTR exons’, ‘Introns’, ‘ Upstream intergenic regions’, and ‘Downsteam intergenic regions’.

Notes

Please note, a particular genomic region can be assigned to different groups listed above, because most genes have multiple transcripts, and different genes could overlap on the genome. For example, a exon of gene A could be located in a intron of gene B. To address this issue, we define the priority order as below:

  1. Coding exons

  2. UTR exons

  3. Introns

  4. Upstream intergenic regions

  5. Downsteam intergenic regions

Higher-priority group override the low-priority group. For example, if a certain part of a intron is overlapped with exon of other transcripts/genes, the overlapped part will be considered as exon (i.e. removed from intron) since “exon” has higher priority.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED file specifying the C position. This BED file should have at least 3 columns (Chrom, ChromStart, ChromeEnd). Note: the first base in a chromosome is numbered 0. This file can be a regular text file or compressed file (.gz, .bz2).

-r GENE_FILE, --refgene-GENE_FILE

Reference gene model in standard BED-12 format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1).

-d DOWNSTREAM_SIZE, --downstream-DOWNSTREAM_SIZE

Size of down-stream intergenic region w.r.t. TES (transcription end site). default-2000 (bp)

-u UPSTREAM_SIZE, --upstream-UPSTREAM_SIZE

Size of up-stream intergenic region w.r.t. TSS (transcription start site). default-2000 (bp)

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files - 850K_probe.hg19.bed3.gz - hg19.RefSeq.union.bed.gz

Command

$ CpG_distrb_gene_centered.py -i 850K_probe.hg19.bed3.gz -r hg19.RefSeq.union.bed.gz -o geneDist

Output files

  • geneDist.tsv

  • geneDist.r

  • geneDist.pdf

_images/geneDist.png

1.5. CpG_distrb_region.py

This program calculates the distribution of CpG over user-specified genomic regions.

Notes

  • A maximum of 10 BED files (define 10 different genomic regions) can be analyzed together.

  • The order of BED files is important (i.e. considered as “priority order”). Overlapped genomic regions will be kept in the BED file with the highest priority and removed from BED files of lower priorities. For example, users provided 3 BED files via “-i promoters.bed,enhancers.bed,intergenic.bed”, then if an enhancer region is overlapped with promoters, the overlapped part will be removed from “enhancers.bed”.

  • BED files can be regular or compressed by ‘gzip’ or ‘bz’.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i CPG_FILE, --cpg-CPG_FILE

BED file specifying the C position. This BED file should have at least 3 columns (Chrom, ChromStart, ChromeEnd). Note: the first base in a chromosome is numbered 0. This file can be a regular text file or compressed file (.gz, .bz2).

-b BED_FILES, --bed-BED_FILES

List of BED files specifying the genomic regions.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

# check the distribution of 850K probes in 4 genomic regions (CpG islands, Promoters,
# Bivalent promoters, and Heterochromatin regions)

$CpG_distrb_region.py -i 850K_probe.hg19.bed3.gz -b  hg19_H3K4me3.bed4,hg19_CGI.bed4,\
 hg19_H3K27ac_with_H3K4me1.bed4,hg19_H3K27me3.bed4 -o regionDist

Output files

  • regionDist.tsv

  • regionDist.r

  • regionDist.pdf

_images/regionDist.png

1.6. CpG_logo.py

This program generates DNA motif logo for a given set of CpGs. To answer the question of “what is the genomic context for a given list of CpGs ?”. This program first extract genomic sequences around C postion, and then generate motif matrices include:

  • position frequency matrix (PFM)

  • position probability matrix (PPM)

  • position weight matrix (PWM)

  • MEME format matrix

  • Jaspar format matrix

It also generate motif logo using weblogo

Notes

  • input BED file must has strand information.

Options
--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED file specifying the C position. This BED file should have at least 6 columns (Chrom, ChromStart, ChromeEnd, name, score, strand). Note: Must provide correct strand information. This file can be a regular text file or compressed file (.gz, .bz2).

-r GENOME_FILE, --refgenome-GENOME_FILE

Reference genome seqeunces in FASTA format. Must be indexed using samtools “faidx” command.

-e EXTEND_SIZE, --extend-EXTEND_SIZE

Number of bases extended to up- and down-stream. default-5 (bp)

-n MOTIF_NAME, --name-MOTIF_NAME

Motif name. default-motif

-o OUT_FILE, --output-OUT_FILE

Prefix of output file.

Input files

Command

$CpG_logo.py -i 450_CH.hg19.bed.gz -r hg19.fa -o 450_CH

Output files

  • 450_CH.logo.fa

  • 450_CH.logo.jaspar

  • 450_CH.logo.meme

  • 450_CH.logo.pfm

  • 450_CH.logo.ppm

  • 450_CH.logo.pwm

  • 450_CH.logo.logo.pdf

_images/450_CH.logo.png

1.7. CpG_to_gene.py

This program annotates CpGs by assigning them to their putative target genes. Follows the “Basal plus extension rules” used by GREAT.

Basal regulatory domain is a user-defined genomic region around the TSS (transcription start site). By default, from TSS upstream 5 Kb to TSS downstream 1 Kb is considered as the gene’s basal regulatory domain. When defining a gene’s basal regulatory domain, the other nearby genes are ignored (which means different genes’ basal regulatory domain can be overlapped.)

Extended regulatory domain is a genomic region that is further extended from basal regulatory domain in both directions to the nearest gene’s basal regulatory domain but no more than the maximum extension (specified by ‘-e’, default - 1000 kb) in one direction. In other words, the “extension” stops when it reaches other genes’ “basal regulatory domain” or the extension limit, whichever comes first.

Basal regulatory domain and Extended regulatory domain are illustrated in below diagram

_images/gene_domain.png

Noets

  • Which genes are assigned to a particular CpG largely depends on gene annotation. A “conservative” gene model (such as Refseq curated protein coding genes) is recommended.

  • In the refgene file, multiple isoforms should be merged into a single gene.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED3+ file specifying the C position. BED3+ file could be a regular text file or compressed file (.gz, .bz2). [required]

-r GENE_FILE, --refgene-GENE_FILE

Reference gene model in BED12 format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). “One gene one transcript” is recommended. Since most genes have multiple transcripts; one can collapse multiple transcripts of the same gene into a single super transcript or select the canonical transcript.

-u BASAL_UP_SIZE, --basal-up-BASAL_UP_SIZE

Size of extension to upstream of TSS (used to define gene’s “basal regulatory domain”). default-5000 (bp)

-d BASAL_DOWN_SIZE, --basal-down-BASAL_DOWN_SIZE

Size of extension to downstream of TSS (used to define gene’s basal regulatory domain). default-1000 (bp)

-e EXTENSION_SIZE, --extension-EXTENSION_SIZE

Size of extension to both up- and down-stream of TSS (used to define gene’s “extended regulatory domain”). default-1000000 (bp)

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file. Two additional columns will be appended to the original BED file with the last column indicating “genes whose extended regulatory domain are overlapped with the CpG”, the 2nd last column indicating “genes whose basal regulatory domain are overlapped with the CpG”. [required]

Input files

Command

$CpG_to_gene.py -i  850K_probe.hg19.bed3.gz -r hg19.RefSeq.union.bed.gz -o output

Output files

  • output.associated_genes.txt

1.8. beta_PCA.py

This program performs PCA (principal component analysis) for samples.

Example of input data file

ID     Sample_01       Sample_02       Sample_03       Sample_04
cg_001 0.831035        0.878022        0.794427        0.880911
cg_002 0.249544        0.209949        0.234294        0.236680
cg_003 0.845065        0.843957        0.840184        0.824286
...

Example of input group file

Sample,Group
Sample_01,normal
Sample_02,normal
Sample_03,tumor
Sample_04,tumo
...

Notes

  • Rows with missing values will be removed

  • Beta values will be standardized into z scores

  • Only the first two components will be visualized

  • Variance% explained by each components are printed to screen

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-INPUT_FILE

Tab separated data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs.

-g GROUP_FILE, --group-GROUP_FILE

Comma separated group file defining the biological groups of each sample. Different group will be colored differently in the PCA plot.

-n N_COMPONENTS, --ncomponent-N_COMPONENTS

Number of components. default-2

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$beta_PCA.py -i cirrHCV_vs_normal.data.tsv -g cirrHCV_vs_normal.grp.csv -o HCV_vs_normal

Output files

  • HCV_vs_normal.PCA.r

  • HCV_vs_normal.PCA.tsv

  • HCV_vs_normal.PCA.pdf

_images/HCV_vs_normal.PCA.png

1.9. beta_jitter_plot.py

This program generates jitter plot (a.k.a. strip chart) and bean plot for each sample (column)

Example of input

CpG_ID  Sample_01       Sample_02       Sample_03       Sample_04
cg_001  0.831035        0.878022        0.794427        0.880911
cg_002  0.249544        0.209949        0.234294        0.236680
cg_003  0.845065        0.843957        0.840184        0.824286

Notes

  • User must install the beanplot R library.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-INPUT_FILE

Tab separated data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs.

-f FRACTION, --fraction-FRACTION

Fraction of total data points (CpGs) used to generate jitter plot. Decrease this number if the jitter plot is over-crowded. default-0.5

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$beta_jitterPlot.py -f 1 -i test_05_TwoGroup.tsv.gz -o Jitter

Output files

  • Jitter.r

  • Jitter.pdf

_images/Jitter.png

1.10. beta_m_conversion.py

Convert Beta-value into M-value or vice vers

Example of input (beta)

CpG_ID Sample_01 Sample_02 Sample_03 Sample_04 cg_001 0.831035 0.878022 0.794427 0.880911 cg_002 0.249544 0.209949 0.234294 0.236680 cg_003 0.845065 0.843957 0.840184 0.824286

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-INPUT_FILE

Tab separated data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs. This file can be a regular text file or compressed file (.gz, .bz2) or accessible url.

-d DATA_TYPE, --dtype-DATA_TYPE

Input data type either “Beta” or “M”.

-o OUT_FILE, --output-OUT_FILE

Output file.

1.11. beta_profile_gene_centered.py

This program calculates the methylation profile (i.e. average beta value) for genomic regions around genes. These genomic regions include:

  • 5’UTR exon

  • CDS exon

  • 3’UTR exon,

  • first intron

  • internal intron

  • last intron

  • up-stream intergenic

  • down-stream intergenic

Example of input (BED6+)

chr22   44021512        44021513        cg24055475      0.9231  -
chr13   111568382       111568383       cg06540715      0.1071  +
chr20   44033594        44033595        cg21482942      0.6122  -

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED6+ file specifying the C position. This BED file should have at least 6 columns (Chrom, ChromStart, ChromeEnd, Name, Beta_value, Strand). BED6+ file can be a regular text file or compressed file (.gz, .bz2).

-r GENE_FILE, --refgene-GENE_FILE

Reference gene model in standard BED12 format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). “Strand” column must exist in order to decide 5’ and 3’ UTRs, up- and down-stream intergenic regions.

-d DOWNSTREAM_SIZE, --downstream-DOWNSTREAM_SIZE

Size of down-stream genomic region added to gene. default-2000 (bp)

-u UPSTREAM_SIZE, --upstream-UPSTREAM_SIZE

Size of up-stream genomic region added to gene. default-2000 (bp)

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Command

$beta_profile_gene_centered.py -i test_02.bed6.gz  -r hg19.RefSeq.union.bed.gz -o gene_profile

Output files

  • gene_profile.txt

  • gene_profile.r

  • gene_profile.pdf

_images/gene_profile.png

1.12. beta_profile_region.py

This program calculates methylation profile (i.e. average beta value) around user specified genomic regions.

Example of input

# BED6 format (INPUT_FILE)
chr22   44021512        44021513        cg24055475      0.9231  -
chr13   111568382       111568383       cg06540715      0.1071  +
chr20   44033594        44033595        cg21482942      0.6122  -

# BED3 format (REGION_FILE)
chr1    15864   15865
chr1    18826   18827
chr1    29406   29407

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED6+ file specifying the C position. This BED file should have at least 6 columns (Chrom, ChromStart, ChromeEnd, Name, Beta_value, Strand). BED6+ file can be a regular text file or compressed file (.gz, .bz2).

-r REGION_FILE, --region-REGION_FILE

BED3+ file of genomic regions. This BED file should have at least 3 columns (Chrom, ChromStart, ChromeEnd). If the 6-th column does not exist, all regions will be considered as on “+” strand.

-d DOWNSTREAM_SIZE, --downstream-DOWNSTREAM_SIZE

Size of extension to downstream. default-2000 (bp)

-u UPSTREAM_SIZE, --upstream-UPSTREAM_SIZE

Size of extension to upstream. default-2000 (bp)

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files - test_02.bed6.gz - hg19.RefSeq.union.1Kpromoter.bed

Command

$beta_profile_region.py -r hg19.RefSeq.union.1Kpromoter.bed.gz -i test_02.bed6.gz -o region_profile

Output files

  • region_profile.txt

  • region_profile.r

  • region_profile.pdf

_images/region_profile.png

1.13. beta_stacked_barplot.py

This program creates stacked barplot for each sample. The stacked barplot showing the proportions of CpGs whose beta values are falling into these 4 ranges: 1. [0.00, 0.25] #first quantile 2. [0.25, 0.50] #second quantile 3. [0.50, 0.75] #third quantile 4. [0.75, 1.00] #forth quantile

Example of input file

CpG_ID  Sample_01       Sample_02       Sample_03       Sample_04
cg_001  0.831035        0.878022        0.794427        0.880911
cg_002  0.249544        0.209949        0.234294        0.236680

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$beta_stacked_barplot.py -i cirrHCV_vs_normal.data.tsv -o stacked_bar

Output files

  • stacked_bar.r

  • stacked_bar.pdf

_images/stacked_bar.png

1.14. beta_stats.py

This program gives basic information of CpGs located in each genomic region. It adds 6 columns to the input BED file:

  1. Number of CpGs detected in the genomic region

  2. Min methylation level

  3. Max methylation level

  4. Average methylation level across all CpGs

  5. Median methylation level across all CpGs

  6. Standard deviation

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

BED6+ file specifying the C position. This BED file should have at least 6 columns (Chrom, ChromStart, ChromeEnd, Name, Beta_value, Strand). Note: the first base in a chromosome is numbered 0. This file can be a regular text file or compressed file (.gz, .bz2)

-r REGION_FILE, --region-REGION_FILE

BED3+ file of genomic regions. This BED file should have at least 3 columns (Chrom, ChromStart, ChromeEnd).

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$beta_stats.py -r hg19.RefSeq.union.1Kpromoter.bed.gz -i test_02.bed6.gz -o region_stats

Output files

  • region_stats.txt

1.15. beta_topN.py

This program picks the top N rows (according to standard deviation) from the input file. The resulting file can be used for clustering/PCA analysis

Example of input

CpG_ID Sample_01 Sample_02 Sample_03 Sample_04 cg_001 0.831035 0.878022 0.794427 0.880911 cg_002 0.249544 0.209949 0.234294 0.236680 cg_003 0.845065 0.843957 0.840184 0.824286

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Tab separated data frame file containing beta values with the 1st row containing sample IDs and the 1st column containing CpG IDs.

-c CPG_COUNT, --count-CPG_COUNT

Number of most variable CpGs (ranked by standard deviation) to keep. default-1000

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$beta_topN.py -i test_05_TwoGroup.tsv.gz -c 500 -o test_05_TwoGroup

Output file

  • test_05_TwoGroup.sortedStdev.tsv

  • test_05_TwoGroup.sortedStdev.topN.tsv

1.16. beta_trichotmize.py

Rather than using hard threshold to call “methylated” or “unmethylated” CpGs or regions, this program uses probability approach (Bayesian Gaussian Mixture model) to trichotmize beta values into three status:

  • Un-methylated (labeled as “0” in result file)

  • Semi-methylated (labeled as “1” in result file)

  • Full-methylated (labeled as “2” in result file)

  • unassigned (labeled as “-1” in result file)

Basically, GMM will first calculate probability p0, p1, and p2 for each CpG based on its beta value:

p0

the probability that the CpG is un-methylated

p1

the probability that the CpG is semi-methylated

p2

the probability that the CpG is full-methylated

The classification will be made using rules:

if p0 -- max(p0, p1, p2):
       un-methylated
elif p2 -- max(p0, p1, p2):
       full-methylated
elif p1 -- max(p0, p1, p2):
       if p1 >- prob_cutoff:
               semi-methylated
       else:
               unknown/unassigned

Input files

Command

$beta_trichotmize.py -i test_05_TwoGroup.tsv -r

Below histogram and piechart showed the proportion of CpGs assigned to “Un-methylated”, “Semi-methylated” and “Full-methylated”.

_images/trichotmize.png

1.17. dmc_ttest.py

Differential CpG analysis using T test for two groups comparison or ANOVA for multiple groups comparison.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: automatically switch to use ANOVA if more than 2 groups were defined in this file.

-p, --paired

If ‘-p/–paired’ flag was specified, use paired t-test which requires the equal number of samples in both groups. Paired sampels are matched by the order. This option will be ignored for multiple group analysis.

-w, --welch

If ‘-w/–welch’ flag was specified, using Welch’s t-test which does not assume the two samples have equal variance. If omitted, use standard two-sample t-test (i.e. assuming the two samples have equal variance). This option will be ignored for paired t-test and multiple group analysis.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

#Two group comparison. Compare normal livers to HCV-related cirrhosis livers
$dmc_ttest.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv -o ttest_2G

#Three group comparison. Compare normal livers, HCV-related cirrhosis livers, and liver cancers
$dmc_ttest.py -i test_06_ThreeGroup.tsv.gz -g test_06_ThreeGroup.grp.csv -o ttest_3G

Output files

  • ttest_2G.pval.txt

  • ttest_3G.pval.txt

1.18. dmc_glm.py

This program performs differential CpG analysis using generalized liner model. It allows for covariants analysis.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). This file can be regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological groups of each sample as well as other covariables such as gender, age. The first varialbe is grouping variable (must be categorical), all the other variables are considered as covariates (can be categorial or continuous). Sample IDs shoud match to the “Data file”.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$dmc_glm.py  -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv -o GLM_2G

$dmc_glm.py  -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp2.csv -o GLM_2G

Outpu files

  • GLM_2G.results.txt

  • GLM_2G.r

  • GLM_2G.pval.txt (final results)

1.19. dmc_nonparametric.py

This program performs differential CpG analysis uisng the Mann-Whitney U test for two group comparison, and the Kruskal-Wallis H-test for multiple groups comparison.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: automatically switch to use Kruskal-Wallis H-test if more than 2 groups were defined in this file.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$dmc_nonparametric.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv -o U_test

$dmc_nonparametric.py -i test_06_TwoGroup.tsv.gz -g test_06_TwoGroup.grp.csv -o H_test

1.20. dmc_Bayes.py

Different from statistical testing, this program tries to estimates “how different the means between the two groups are” using Bayesian approach. An MCMC is used to estimate the “means”, “difference of means”, “95% HDI (highest posterior density interval)”, and the posterior probability that the HDI does NOT include “0”.

It is similar to John Kruschke’s BEST algorithm (Bayesian Estimation Supersedes T test)

Notes

  • This program is much slower than T test due to MCMC (Markov chain Monte Carlo) step. Running it with multiple threads is highly recommended.

Options
--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: Only for two group comparison.

-n N_ITER, --niter-N_ITER

Iteration times when using MCMC Metropolis-Hastings’s agorithm to draw samples from the posterior distribution. default-5000

-b N_BURN, --burnin-N_BURN

Number of samples to discard. Thes initial samples are usually not completely valid because the Markov Chain has not stabilized to the stationary distributio. default-500.

-p N_PROCESS, --processor-N_PROCESS

Number of processes. default-1

-s SEED, --seed-SEED

The seed used by the random number generator. default-99

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$  dmc_Bayes.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv.gz -p 10 -o dmc_output

Output files

  • dmc_output.bayes.tsv: this file consists of 6 columns:

  1. ID : CpG ID

  2. mu1 : Mean methylation level estimated from group1

  3. mu2 : Mean methylation level estimated from gropu2

  4. mu_diff : Difference between mu1 and mu2

  5. mu_diff (95% HDI) : 95% of “High Density Interval” of mu_diff. The HDI indicates which points of a distribution are most credible. This interval spans 95% of mu_diff’s distribution.

  6. The probability that mu1 and mu2 are different.

$head -10 dmc_output.bayes.tsv

ID     mu1     mu2     mu_diff mu_diff (95% HDI)       Probability
cg00001099     0.775209        0.795404        -0.020196       (-0.065148,0.023974)    0.811024
cg00000363     0.610565        0.469523        0.141042        (0.030769,0.232965)     0.994665
cg00000884     0.845973        0.873761        -0.027787       (-0.051976,-0.004398)   0.984882
cg00000714     0.190868        0.199233        -0.008365       (-0.030071,0.014006)    0.816141
cg00000957     0.772905        0.827528        -0.054623       (-0.092116,-0.016465)   0.995327
cg00000292     0.748394        0.766326        -0.017932       (-0.051286,0.012583)    0.889729
cg00000807     0.729162        0.683732        0.045430        (-0.001523,0.086588)    0.981551
cg00000721     0.935903        0.935080        0.000823        (-0.013210,0.018628)    0.508686
cg00000948     0.898609        0.897536        0.001073        (-0.020663,0.026813)    0.518238

1.21. dmc_fisher.py

This program performs differential CpG analysis using Fisher exact test on proportion value. It applies to two sample comparison with no biological/technical replicates. If biological/ technical replicates are provided, methyl reads and total reads of all replicates will be merged (i.e. ignores biological/technical variations)

Input file format

# number before "," indicates number of methyl reads, and number after "," indicates
# number of total reads
cgID        sample_1    sample_2
CpG_1       129,170     166,178
CpG_2       24,77       67,99

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing methylation proportions (represented by “methyl_count,total_count”, eg. “20,30”) with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological group of each sample. It is a comma-separated two columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Output

  • 3 columns (“Odds ratio”, “pvalue” and “FDR adjusted pvalue”) will append to the original table.

1.22. dmc_logit.py

This program performs differential CpG analysis using logistic regression model based on proportion values. It allows for covariable analysis. Users can choose to use “binomial” or “quasibinomial” family to model the data. The quasibinomial family estimates an addition parameter indicating the amount of the oversidpersion.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing methylation proportions (represented by “methyl_count,total_count”, eg. “20,30”) with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological groups of each sample as well as other covariables such as gender, age. The first varialbe is grouping variable (must be categorical), all the other variables are considered as covariates (can be categorial or continuous). Sample IDs shoud match to the “Data file”.

-f FAMILY_FUNC, --family-FAMILY_FUNC

Error distribution and link function to be used in the GLM model. Can be integer 1 or 2 with 1 - “quasibinomial” and 2 - “binomial”. Default-1.

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$ dmc_logit.py -i test_04_TwoGroup.tsv.gz -g test_04_TwoGroup.grp.csv -o output_quasibin
$ dmc_logit.py -i test_04_TwoGroup.tsv.gz -g test_04_TwoGroup.grp.csv -f 2  -o output_bin

1.23. dmc_bb.py

This program performs differential CpG analysis using “beta binomial” model on proportion values. It allows for covariant analysis.

Notes - You must install R package aod before running this program.

Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input-file-INPUT_FILE

Data file containing methylation proportions (represented by “methyl_count,total_count”, eg. “20,30”) with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group-GROUP_FILE

Group file defining the biological groups of each sample as well as other covariables such as gender, age. The first varialbe is grouping variable (must be categorical), all the other variables are considered as covariates (can be categorial or continuous). Sample IDs shoud match to the “Data file”..

-o OUT_FILE, --output-OUT_FILE

Prefix of the output file.

Input files

Command

$ python3 ../bin/dmc_bb.py -i test_04_TwoGroup.tsv.gz -g test_04_TwoGroup.grp.csv -o OUT_bb