--- title: SEQLinkage keywords: fastai sidebar: home_sidebar summary: "Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data" description: "Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data" nb_path: "nbs/index.ipynb" ---
{% raw %}
{% endraw %}

Features

  • It can do linkage analysis on single variants and CHP markers.
  • It can deal with families from different population.
  • It can handle large-scale whole-genome linkage analysis.

Pre-requisites

Make sure you install the pre-requisited before running seqlink:

#install cstatgen
conda install -c conda-forge xeus-cling
conda install -c anaconda swig 
conda install -c conda-forge gsl
pip install egglib
git clone https://github.com/statgenetics/cstatgen.git
cd cstatgen
python setup.py install

#install paramlink2
R
install.packages("paramlink2")

Install

pip install SEQLinkage

How to use

{% raw %}
!seqlink --help
usage: seqlink [-h] [--chp] --fam FILE --vcf FILE [--anno FILE] [--pop FILE]
               [--included-vars FILE] [-b FILE] [-c P] [-o Name]
               [--build STRING] [--freq INFO] [--chrom-prefix STRING]
               [--run-linkage] [-K FLOAT] [--moi STRING] [-W FLOAT] [-M FLOAT]
               [--theta-max FLOAT] [--theta-inc FLOAT]

SEQLinkage V2, linkage analysis using sequence data

options:
  -h, --help            show this help message and exit

Collapsed haplotype pattern method arguments:
  --chp                 Generate CHP markers.
  --fam FILE            Input pedigree and phenotype information in FAM
                        format.
  --vcf FILE            Input VCF file, bgzipped.
  --anno FILE           Input annotation file from annovar.
  --pop FILE            Input two columns file, first column is family ID,
                        second column population information.
  --included-vars FILE  Variants to be included for linkage analysis, if None,
                        the analysis won't filter any variants. But you can
                        still set AF cutoff by -c argment.
  -b FILE, --blueprint FILE
                        Blueprint file that defines regional marker (format:
                        "chr startpos endpos name avg.distance male.distance
                        female.distance").
  -c P, --maf-cutoff P  MAF cutoff to define variants to be excluded from
                        analyses. this is useful, if you need to analyse
                        multiple population together.
  -o Name, --output Name
                        Output name prefix.
  --build STRING        Reference genome version for VCF file.
  --freq INFO           Info field name for allele frequency in VCF file.
  --chrom-prefix STRING
                        Prefix to chromosome name in VCF file if applicable,
                        e.g. "chr".

LINKAGE options:
  --run-linkage         Perform Linkage analysis.
  -K FLOAT, --prevalence FLOAT
                        Disease prevalence. Default to 0.001.
  --moi STRING          Mode of inheritance, AD/AR: autosomal
                        dominant/recessive. Default to AD.
  -W FLOAT, --wt-pen FLOAT
                        Penetrance for wild type. Default to 0.01.
  -M FLOAT, --mut-pen FLOAT
                        Penetrance for mutation. Default to 0.9.
  --theta-max FLOAT     Theta upper bound. Default to 0.5.
  --theta-inc FLOAT     Theta increment. Default to 0.05.
{% endraw %}

Linkage analysis on specific regions

Normally, the regions are gene regions. you can also use self-defined regions, such as promoter regions, enhancer regions.

{% raw %}
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
MESSAGE: write to pickle: data/test_chp/chr1result/chr1result0.pickle,Gene number:4,Time:0.00012452216405007575
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.3767304867506027
create data/test_chp/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============
{% endraw %}
{% raw %}
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_var --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=False, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_var', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
MESSAGE: write to pickle: data/test_var/chr1result/chr1result0.pickle,Gene number:4,Time:3.229235919813315e-05
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.6164871137589216
create data/test_var/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============
{% endraw %}

No annotation

If you don't have the annotation file. there is no need to add --pop. And --freq should be setted based on the INFO column in vcf file.

{% raw %}
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --freq='AF' --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp_na --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno=None, pop=None, included_vars=None, blueprint='testdata/test_blueprint_ext.txt', maf_cutoff=0.05, output='data/test_chp_na', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: Loading marker map from [testdata/test_blueprint_ext.txt] ...
MESSAGE: 6 families with a total of 18 samples will be scanned for 12 pre-defined units
SNVHap MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
MESSAGE: write to pickle: data/test_chp_na/chrallresult/chrallresult0.pickle,Gene number:4,Time:6.505574979301956e-05
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.input
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.lods
0.3394373469054699
create data/test_chp_na/chrallresult/chrallresult0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============
{% endraw %}

Whole-genome linkage analysis

if --blueprint is not provided, the genomic region will be seperated to pseudogenes with 1000 variants.

{% raw %}
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt -c 0.05 -o data/test_chp_wg --chp --run-linkage
MESSAGE: Binary trait detected in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam]
MESSAGE: Generate regions by annotation
MESSAGE: Namespace(chp=True, tfam='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_ped.fam', vcf='/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz', anno='testdata/test_chr1_anno.csv', pop='testdata/test_fam_pop.txt', included_vars=None, blueprint=None, maf_cutoff=0.05, output='data/test_chp_wg', build='hg38', freq='AF', chr_prefix=None, run_linkage=True, prevalence=0.001, inherit_mode='AD', wild_pen=0.01, muta_pen=0.9, theta_max=0.5, theta_inc=0.05)
MESSAGE: 18 samples found in FAM file but not in VCF file:

MESSAGE: 18 samples found in [/mnt/vast/hpc/csg/yin/Github/linkage/SEQpy3/nbs/testdata/test_snps.vcf.gz]
MESSAGE: separate chromosome to regions
MESSAGE: 6 families with a total of 18 samples will be scanned for 1 pre-defined units
MESSAGE: write to pickle: data/test_chp_wg/chr1result/chr1result0.pickle,Gene number:1,Time:0.00011511449753824207
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.input
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.lods
0.23506776429712772
create data/test_chp_wg/chr1result/chr1result0_AFcutoff0.05_linkage.besthlod
MESSAGE: ============= Finish analysis ==============
{% endraw %}

Input format

  • --fam, Fam file (required, format: "fid iid fathid mothid sex trait[1 control, 2 case, -9 or 0 missing]")
{% raw %}
%%writefile testdata/test_ped.fam
1033    1033_2  0       0       2       -9
1033    1033_1  0       0       1       -9
1033    1033_99 1033_1  1033_2  2       1
1033    1033_9  1033_1  1033_2  2       1
1033    1033_3  1033_1  1033_2  2       2
1036    1036_99 1036_1  1036_2  2       2
1036    1036_6  0       0       1       2
1036    1036_1  0       0       1       -9
1036    1036_3  1036_6  1036_99 2       1
1036    1036_4  1036_6  1036_99 2       1
1036    1036_2  0       0       2       -9
1036    1036_5  1036_6  1036_99 1       1
10J_103 10J_103_10      0       0       1       -9
10J_103 10J_103_4       0       0       1       -9
10J_103 10J_103_3       0       0       2       -9
10J_103 10J_103_2       10J_103_4       10J_103_3       2       2
10J_103 10J_103_1       10J_103_10      10J_103_3       1       2
10J_109 10J_109_2       10J_109_6       10J_109_5       1       2
10J_109 10J_109_3       10J_109_6       10J_109_5       1       2
10J_109 10J_109_4       10J_109_6       10J_109_5       1       2
10J_109 10J_109_6       0       0       1       -9
10J_109 10J_109_1       10J_109_6       10J_109_5       1       2
10J_109 10J_109_5       0       0       2       2
10J_109 10J_109_7       10J_109_6       10J_109_5       1       1
10J_112 10J_112_3       0       0       1       1
10J_112 10J_112_5       10J_112_3       10J_112_2       1       2
10J_112 10J_112_1       10J_112_3       10J_112_2       2       1
10J_112 10J_112_7       10J_112_3       10J_112_2       1       1
10J_112 10J_112_2       0       0       2       2
10J_119 10J_119_2       0       0       1       1
10J_119 10J_119_5       0       0       2       1
10J_119 10J_119_4       0       0       1       1
10J_119 10J_119_6       10J_119_4       10J_119_5       1       2
10J_119 10J_119_7       10J_119_4       10J_119_5       2       2
10J_119 10J_119_1       10J_119_4       10J_119_5       2       2
10J_119 10J_119_3       10J_119_2       10J_119_1       1       1
Overwriting ../testdata/test_ped.fam
{% endraw %}
  • --vcf, VCF file (required, vcf.gz with vcf.gz.tbi)
    bgzip -c file.vcf > file.vcf.gz
    tabix -p vcf file.vcf.gz
  • --anno, Annotation file from ANNOVAR, It must contains the allele frequency for population in the file of family population information. For example in here, The annotation file must have AF_amr, AF_afr, AF_nfe columns.
{% raw %}
anno=pd.read_csv('testdata/test_chr1_anno.csv')
{% endraw %} {% raw %}
anno
Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene ... CLNDISDB CLNREVSTAT CLNSIG DN ID Patient ID Phenotype Platform Study Pubmed ID Otherinfo1
0 1 10140 10147 ACCCTAAC A intergenic NONE;DDX11L1 dist=NONE;dist=1727 . . ... . . . . . . . . . chr1:10140:ACCCTAAC:A
1 1 10146 10147 AC A intergenic NONE;DDX11L1 dist=NONE;dist=1727 . . ... . . . . . . . . . chr1:10146:AC:A
2 1 10146 10148 ACC * . . . . . ... . . . . . . . . . chr1:10146:ACC:*
3 1 10150 10151 CT C intergenic NONE;DDX11L1 dist=NONE;dist=1723 . . ... . . . . . . . . . chr1:10150:CT:C
4 1 10172 10177 CCCTAA C intergenic NONE;DDX11L1 dist=NONE;dist=1697 . . ... . . . . . . . . . chr1:10172:CCCTAA:C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 1 66479 66487 TATTTATAG * . . . . . ... . . . . . . . . . chr1:66479:TATTTATAG:*
996 1 66480 66481 AT A intergenic FAM138A;OR4F5 dist=30399;dist=2610 . . ... . . . . . . . . . chr1:66480:AT:A
997 1 66480 66483 ATTT * . . . . . ... . . . . . . . . . chr1:66480:ATTT:*
998 1 66481 66488 TTTATAGA T intergenic FAM138A;OR4F5 dist=30400;dist=2603 . . ... . . . . . . . . . chr1:66481:TTTATAGA:T
999 1 66481 66488 TTTATAGA * . . . . . ... . . . . . . . . . chr1:66481:TTTATAGA:*

1000 rows × 152 columns

{% endraw %} {% raw %}
anno.loc[:,['AF_amr', 'AF_afr', 'AF_nfe']]
AF_amr AF_afr AF_nfe
0 0.0006 0.0008 0.0007
1 0.6380 0.6300 0.6413
2 . . .
3 0.0357 0.0426 0.0370
4 0.0086 0.0097 0.0084
... ... ... ...
995 . . .
996 0.0039 0.0036 0.0076
997 . . .
998 0.0163 0.0410 0.0279
999 . . .

1000 rows × 3 columns

{% endraw %}

Or, create a self-defined annotation file like this:

    Chr Start   AF_amr  AF  AF_nfe  AF_afr
chr1:10140:ACCCTAAC:A   1   10140   0.0006  0.0007  0.0007  0.0008
chr1:10146:AC:A 1   10146   0.638   0.6328  0.6413  0.63
chr1:10150:CT:C 1   10150   0.0357  0.0375  0.037   0.0426
chr1:10172:CCCTAA:C 1   10172   0.0086  0.0082  0.0084  0.0097
chr1:10178:CCTAA:C  1   10178   0.5 0.3333  0.2955  0.4375
chr1:10198:TAACCC:T 1   10198   0.0 0.0 0.0 0.0
chr1:10231:C:A  1   10231   0.2 0.0366  0.0 0.05
chr1:10236:AACCCT:A 1   10236   0.0 0.0 0.0 0.0
chr1:10247:TAAACCCTA:T  1   10247   0.2222  0.2089  0.1429  0.4211

The index must match with the ID in vcf file.

  • --pop, The file of family population information
{% raw %}
%%writefile testdata/test_fam_pop.txt
1033 AF_amr
1036 AF_amr
10J_103 AF_afr
10J_109 AF_nfe
10J_112 AF_nfe
10J_119 AF_nfe
Writing ../testdata/test_fam_pop.txt
{% endraw %}
  • --included-vars, The file with one column of variants For example:
    chr1:10140:ACCCTAAC:A
    chr1:10172:CCCTAA:C
    chr1:10198:TAACCC:T
    chr1:10236:AACCCT:A
    chr1:10261:T:TA
    chr1:10262:AACCCT:A
  • --blueprint, The blueprint file that defines regional marker (format: "chr startpos endpos name avg.distance male.distance female.distance"). The first four columns are required.
{% raw %}
%%writefile testdata/test_blueprint_ext.txt
1       11868   14362   LOC102725121@1  9.177127474362311e-07   1.1657192989882668e-06  6.814189157634088e-07
1       11873   14409   DDX11L1 9.195320788455595e-07   1.1680302941673515e-06  6.82769803434766e-07
1       14361   29370   WASH7P  1.5299877409602128e-06  1.94345806118021e-06    1.136044574393209e-06
1       17368   17436   MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1 1.217692507120495e-06   1.5467668502473368e-06  9.041595098829462e-07
1       30365   30503   MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1       2.1295973889038703e-06  2.705108741548526e-06   1.5812659765416382e-06
1       34610   36081   FAM138A@1,FAM138C@1,FAM138F@1   2.4732411024120156e-06  3.1416201771056266e-06  1.8364278747737466e-06
1       69090   70008   OR4F5   4.866641545668504e-06   6.181823219621424e-06   3.6135725636621673e-06
1       134772  140566  LOC729737       9.633289838108921e-06   1.2236630588823159e-05  7.152898262617822e-06
1       490755  495445  LOC100132062@1,LOC100132287@1   2.2828130832833112e-05  2.8997300893994373e-05  1.6950315013571593e-05
1       450739  451678  OR4F16@1,OR4F29@1,OR4F3@1       2.575942360468604e-05   3.2720758549649544e-05  1.912685483821856e-05
1       627379  629009  LOC101928626    3.943568768003252e-05   5.009295373297854e-05   2.9281737249900675e-05
1       632614  632685  MIR12136        3.974742311959244e-05   5.048893386847169e-05   2.9513206656665908e-05
Overwriting testdata/test_blueprint_ext.txt
{% endraw %}

Or

{% raw %}
%%writefile testdata/test_blueprint.txt
1       11868   14362   LOC102725121@1
1       11873   14409   DDX11L1
1       14361   29370   WASH7P
1       17368   17436   MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
1       30365   30503   MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1
1       34610   36081   FAM138A@1,FAM138C@1,FAM138F@1
1       69090   70008   OR4F5
1       134772  140566  LOC729737
1       490755  495445  LOC100132062@1,LOC100132287@1
1       450739  451678  OR4F16@1,OR4F29@1,OR4F3@1
1       627379  629009  LOC101928626
1       632614  632685  MIR12136
Overwriting testdata/test_blueprint.txt
{% endraw %}

Output format