--- title: Title keywords: fastai sidebar: home_sidebar nb_path: "nbs/06_SNPmatch.ipynb" ---
{% raw %}
{% endraw %}

pip install biopython

Two scenarios

  1. one dataset (vcf) be used in different softwares (regenie and fine mapping). we only need to consider flip and all snps in sumstat must be in genodata.
  2. two datasets. They may have strand flips and reverse reference alleles, or both. So we need to consider all kinds of issues. I think the best way is to show all the potential results (multi SNPs in one postion and SNPs more than one base) and print a summary result, which we may get some very important info in there.

For region extraction case, I think we only need to consider scenario 1, which make genotype consistent in different softwares by shifting the sign of beta in sumstats. For merging exome and imput data, it is not necessary to consider the SNPs' overlapping in defferent datasets. If two SNPs are the same, both of them will be show in a credit set.

{% raw %}
{% endraw %} {% raw %}

check_indels[source]

check_indels(query)

{% endraw %} {% raw %}

load_yaml[source]

load_yaml(yaml_file)

{% endraw %} {% raw %}

parse_input[source]

parse_input(yml_input)

{% endraw %} {% raw %}
{% endraw %} {% raw %}

match_ss_with_bim[source]

match_ss_with_bim(query, subject)

snp match case in one dataset

{% endraw %} {% raw %}
{% endraw %} {% raw %}

check_ss[source]

check_ss(ss, bim)

{% endraw %} {% raw %}
{% endraw %} {% raw %}

compare_snps[source]

compare_snps(query, subject, only_match=True)

input: query and subject, two data frame with the first five column: chr, pos, snp, a0, a1 output: data frame included six boolean columns (keep,exact,flip,reverse,both, complement) and two columns of the index query and subject.

{% endraw %} {% raw %}

allele_match[source]

allele_match(a0, a1, ref0, ref1)

input: a0 and a1 are the first snp, ref0 and ref1 are the second snp. output: keep,exact,flip,reverse,both, complement. boolean values.

{% endraw %} {% raw %}
{% endraw %} {% raw %}

check_ss1[source]

check_ss1(ss, bim, keep_ambiguous=True)

index by chr+por+ordered ref and alt

{% endraw %} {% raw %}
{% endraw %} {% raw %}

pair_match[source]

pair_match(a1, a2, ref1, ref2)

{% endraw %} {% raw %}

strand_flip[source]

strand_flip(s)

{% endraw %} {% raw %}

namebyordA0_A1[source]

namebyordA0_A1(df, cols=['CHR', 'POS', 'A0', 'A1'])

{% endraw %} {% raw %}
{% endraw %} {% raw %}

snps_match[source]

snps_match(query, subject, keep_ambiguous=True)

{% endraw %} {% raw %}
{% endraw %} {% raw %}

snps_match_dup[source]

snps_match_dup(query, subject, keep_ambiguous=True)

{% endraw %} {% raw %}

snps_match_nodup[source]

snps_match_nodup(query, subject, keep_ambiguous=True)

{% endraw %} {% raw %}
{% endraw %}

2. Testing snps_match

1.Testing check_ss1 function

{% raw %}
region = [5,272741,1213528]
geno_path = '../MWE_region_extraction/ukb23156_c5.merged.filtered.5_272741_1213528.bed'
sumstats_path = '../MWE_region_extraction/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats'
pheno_path = None
unr_path = 'MWE_region_extraction/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.white_europeans.filtered.092821_ldprun_unrelated.filtered.prune.txt'
imp_geno_path = '../MWE_region_extraction/ukb_imp_chr5_v3_05_272856_1213643.bgen'
imp_sumstats_path = '../MWE_region_extraction/100521_UKBB_Hearing_aid_f3393_expandedwhite_15601cases_237318ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz'
imp_ref = 'hg19'
bgen_sample_path = '../MWE_region_extraction/ukb_imp_chr5_v3_05_272856_1213643.sample'
output_sumstats = 'test.snp_stats.gz'
output_LD = 'test_corr.csv.gz'

#main(region,geno_path,sumstats_path,pheno_path,unr_path,imp_geno_path,imp_sumstats_path,imp_ref,output_sumstats,output_LD)
{% endraw %} {% raw %}
imp_geno_path = '/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr5_v3.bgen'
{% endraw %} {% raw %}
exome_sumstats = Sumstat(sumstats_path)
exome_sumstats.extractbyregion(region)
this region [5, 272741, 1213528] has 3884 SNPs in Sumstat
{% endraw %} {% raw %}
ss.index = range(len(ss))
{% endraw %} {% raw %}
ss.loc[ss.index[:10]]
CHR POS REF ALT SNP BETA SE P
5:272741_G:A 5 272741 A G chr5:272741:A:G 1.789340 0.250526 7.055048e-11
5:272748_G:C 5 272748 G C chr5:272748:G:C 1.698750 0.279645 2.479648e-08
5:272755_G:A 5 272755 A G chr5:272755:A:G 1.708990 0.242533 1.335826e-10
5:272758_G:A 5 272758 A G chr5:272758:A:G -1.285990 1.782120 4.705362e-01
5:272771_G:C 5 272771 C G chr5:272771:C:G 1.620460 0.466804 1.956681e-03
5:272791_C:A 5 272791 A C chr5:272791:A:C 0.014049 0.071979 8.452456e-01
5:272816_T:C 5 272816 C T chr5:272816:C:T -0.259206 0.276487 3.485018e-01
5:272822_T:G 5 272822 T G chr5:272822:T:G -1.267800 1.258490 3.137451e-01
5:272829_T:C 5 272829 T C chr5:272829:T:C 0.301320 0.878689 7.316595e-01
5:275930_G:A 5 275930 G A chr5:275930:G:A -0.379434 0.533934 4.773083e-01
{% endraw %} {% raw %}
tmp = namebyordA0_A1(exome_sumstats.ss[['CHR','POS','REF','ALT']])
{% endraw %} {% raw %}
tmp = pd.Series(tmp).str.split('_')
{% endraw %} {% raw %}
tmp
0        [5:272741, G:A]
1        [5:272748, G:C]
2        [5:272755, G:A]
3        [5:272758, G:A]
4        [5:272771, G:C]
              ...       
3879    [5:1213517, T:C]
3880    [5:1213518, G:A]
3881    [5:1213524, T:C]
3882    [5:1213527, T:C]
3883    [5:1213528, G:A]
Length: 3884, dtype: object
{% endraw %} {% raw %}
ss = exome_sumstats.ss
{% endraw %} {% raw %}
ss.index = pd.Series(tmp)
{% endraw %} {% raw %}
ss.index
Index(['5:272741_G:A', '5:272748_G:C', '5:272755_G:A', '5:272758_G:A',
       '5:272771_G:C', '5:272791_C:A', '5:272816_T:C', '5:272822_T:G',
       '5:272829_T:C', '5:275930_G:A',
       ...
       '5:1213465_T:C', '5:1213471_T:C', '5:1213482_T:C', '5:1213483_G:A',
       '5:1213495_T:C', '5:1213517_T:C', '5:1213518_G:A', '5:1213524_T:C',
       '5:1213527_T:C', '5:1213528_G:A'],
      dtype='object', length=3884)
{% endraw %} {% raw %}
tmp = ss.index.to_series().apply(lambda x: x.split(':')[0])
{% endraw %} {% raw %}
tmp.unique()
array(['5'], dtype=object)
{% endraw %} {% raw %}
len(ss.index[0].split('_')[0].split(':'))
2
{% endraw %} {% raw %}
'ab_c'.split('_')[0]
'ab'
{% endraw %} {% raw %}
ss.index.isin(ss.index)
array([ True,  True,  True, ...,  True,  True,  True])
{% endraw %} {% raw %}
ss.index[ss.index.duplicated(keep=False)]
Index(['5:432782_TC:T', '5:432782_TC:T', '5:473309_AGCG:A', '5:473309_AGCG:A',
       '5:612613_AC:A', '5:612613_AC:A', '5:665193_GCCTTGC:G',
       '5:665193_GCCTTGC:G', '5:692497_GC:G', '5:692497_GC:G', '5:889546_AC:A',
       '5:889546_AC:A', '5:891687_ACTT:A', '5:891687_ACTT:A', '5:912005_AT:A',
       '5:912005_AT:A', '5:1035376_AC:A', '5:1035376_AC:A',
       '5:1038307_GCACGAGCACCACCACCAC:G', '5:1038307_GCACGAGCACCACCACCAC:G',
       '5:1038331_GCAC:G', '5:1038331_GCACCAC:G', '5:1038331_GCAC:G',
       '5:1038331_GCACCACCAC:G', '5:1038331_GCACCAC:G',
       '5:1038331_GCACCACCAC:G', '5:1056514_CCA:C', '5:1056514_CCA:C',
       '5:1112017_TCCCGG:T', '5:1112017_TCCCGG:T'],
      dtype='object')
{% endraw %} {% raw %}
list(tmp)
[['5:272741', 'G:A'],
 ['5:272748', 'G:C'],
 ['5:272755', 'G:A'],
 ['5:272758', 'G:A'],
 ['5:272771', 'G:C'],
 ['5:272791', 'C:A'],
 ['5:272816', 'T:C'],
 ['5:272822', 'T:G'],
 ['5:272829', 'T:C'],
 ['5:275930', 'G:A']]
{% endraw %} {% raw %}
tmp.apply(lambda x: x[0])
0    5:272741
1    5:272748
2    5:272755
3    5:272758
4    5:272771
5    5:272791
6    5:272816
7    5:272822
8    5:272829
9    5:275930
dtype: object
{% endraw %} {% raw %}
exome_geno = Genodata(geno_path)
{% endraw %} {% raw %}
bim = exome_geno.bim
{% endraw %} {% raw %}
check_ss1(exome_sumstats.ss,bim)
Paired SNPs 3847
Warning: there are 402 ambiguous SNPs
Overlap SNPs 3847
/tmp/2206559.1.plot.q/ipykernel_14021/227429081.py:31: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
CHR POS REF ALT SNP BETA SE P
5:272741:G:A 5 272741 G A chr5:272741:G:A -1.789340 0.250526 7.055048e-11
5:272748:G:C 5 272748 C G chr5:272748:C:G -1.698750 0.279645 2.479648e-08
5:272755:G:A 5 272755 G A chr5:272755:G:A -1.708990 0.242533 1.335826e-10
5:272758:G:A 5 272758 G A chr5:272758:G:A 1.285990 1.782120 4.705362e-01
5:272771:G:C 5 272771 G C chr5:272771:G:C -1.620460 0.466804 1.956681e-03
... ... ... ... ... ... ... ... ...
5:1213517:T:C 5 1213517 T C chr5:1213517:T:C 1.093870 1.988000 5.821595e-01
5:1213518:G:A 5 1213518 A G chr5:1213518:A:G 1.221880 1.772910 4.906999e-01
5:1213524:T:C 5 1213524 C T chr5:1213524:C:T -0.151751 0.664055 8.192405e-01
5:1213527:T:C 5 1213527 T C chr5:1213527:T:C 1.080770 1.865680 5.623918e-01
5:1213528:G:A 5 1213528 A G chr5:1213528:A:G 1.057160 2.654680 6.904639e-01

3847 rows × 8 columns

{% endraw %} {% raw %}
imput_sumstats = Sumstat(imp_sumstats_path)
imput_geno = Genodata(imp_geno_path,bgen_sample_path)
{% endraw %} {% raw %}
check_ss1(imput_sumstats.ss,imput_geno.bim)
Paired SNPs 632
Warning: there are 85 ambiguous SNPs
Overlap SNPs 632
/tmp/2206559.1.plot.q/ipykernel_3137/3698415862.py:34: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
CHR POS REF ALT SNP BETA SE P
5:73072354:T:C 5 73072354 T C chr5:73072354:T:C 0.101991 0.012277 1.073001e-16
5:73072504:C:A 5 73072504 A C chr5:73072504:A:C 0.015024 0.036310 6.790301e-01
5:73072530:G:A 5 73072530 A G chr5:73072530:A:G 0.095930 0.050772 5.883557e-02
5:73072863:G:A 5 73072863 G A chr5:73072863:G:A -0.072351 0.667580 9.136960e-01
5:73072873:G:A 5 73072873 A G chr5:73072873:A:G 0.042773 0.543868 9.373146e-01
... ... ... ... ... ... ... ... ...
5:73144301:T:C 5 73144301 T C chr5:73144301:T:C 0.079786 0.012277 8.143292e-11
5:73144431:TAG:T 5 73144431 TAG T chr5:73144431:TAG:T 0.029964 0.023528 2.028154e-01
5:73144432:G:A 5 73144432 A G chr5:73144432:A:G 0.088728 0.012569 1.627796e-12
5:73144716:G:A 5 73144716 G A chr5:73144716:G:A 0.300375 0.575205 6.015282e-01
5:73144845:G:A 5 73144845 A G chr5:73144845:A:G 0.090138 0.012494 5.022269e-13

632 rows × 8 columns

{% endraw %} {% raw %}
ss
CHR POS REF ALT SNP BETA SE P
5:73072354:T:C 5 73072354 T C chr5:73072354:T:C 0.101991 0.012277 1.073001e-16
5:73072504:C:A 5 73072504 A C chr5:73072504:A:C 0.015024 0.036310 6.790301e-01
5:73072530:G:A 5 73072530 A G chr5:73072530:A:G 0.095930 0.050772 5.883557e-02
5:73072863:G:A 5 73072863 G A chr5:73072863:G:A -0.072351 0.667580 9.136960e-01
5:73072873:G:A 5 73072873 A G chr5:73072873:A:G 0.042773 0.543868 9.373146e-01
... ... ... ... ... ... ... ... ...
5:73144301:T:C 5 73144301 T C chr5:73144301:T:C 0.079786 0.012277 8.143292e-11
5:73144431:TAG:T 5 73144431 TAG T chr5:73144431:TAG:T 0.029964 0.023528 2.028154e-01
5:73144432:G:A 5 73144432 A G chr5:73144432:A:G 0.088728 0.012569 1.627796e-12
5:73144716:G:A 5 73144716 G A chr5:73144716:G:A 0.300375 0.575205 6.015282e-01
5:73144845:G:A 5 73144845 A G chr5:73144845:A:G 0.090138 0.012494 5.022269e-13

632 rows × 8 columns

{% endraw %} {% raw %}
bim
chrom snp cm pos a0 a1 i
5:73072354:T:C 5 chr5:73072354:T:C 0.0 73072354 T C 2401738
5:73072371:T:C 5 chr5:73072371:C:T 0.0 73072371 C T 2401739
5:73072425:G:A 5 chr5:73072425:G:A 0.0 73072425 G A 2401740
5:73072501:T:G 5 chr5:73072501:G:T 0.0 73072501 G T 2401741
5:73072504:C:A 5 chr5:73072504:A:C 0.0 73072504 A C 2401742
... ... ... ... ... ... ... ...
5:73144742:G:A 5 chr5:73144742:G:A 0.0 73144742 G A 2404138
5:73144750:G:A 5 chr5:73144750:A:G 0.0 73144750 A G 2404139
5:73144765:G:A 5 chr5:73144765:A:G 0.0 73144765 A G 2404140
5:73144823:G:A 5 chr5:73144823:G:A 0.0 73144823 G A 2404141
5:73144845:G:A 5 chr5:73144845:A:G 0.0 73144845 A G 2404142

2405 rows × 7 columns

{% endraw %} {% raw %}
imput_sumstats
sumstat:                  CHR       POS  REF ALT                  SNP      BETA  \
5:73072354:T:C      5  73072354    T   C    chr5:73072354:T:C  0.101991   
5:73072504:C:A      5  73072504    A   C    chr5:73072504:A:C  0.015024   
5:73072530:G:A      5  73072530    A   G    chr5:73072530:A:G  0.095930   
5:73072863:G:A      5  73072863    G   A    chr5:73072863:G:A -0.072351   
5:73072873:G:A      5  73072873    A   G    chr5:73072873:A:G  0.042773   
...               ...       ...  ...  ..                  ...       ...   
5:73144301:T:C      5  73144301    T   C    chr5:73144301:T:C  0.079786   
5:73144431:TAG:T    5  73144431  TAG   T  chr5:73144431:TAG:T  0.029964   
5:73144432:G:A      5  73144432    A   G    chr5:73144432:A:G  0.088728   
5:73144716:G:A      5  73144716    G   A    chr5:73144716:G:A  0.300375   
5:73144845:G:A      5  73144845    A   G    chr5:73144845:A:G  0.090138   

                        SE             P  
5:73072354:T:C    0.012277  1.073001e-16  
5:73072504:C:A    0.036310  6.790301e-01  
5:73072530:G:A    0.050772  5.883557e-02  
5:73072863:G:A    0.667580  9.136960e-01  
5:73072873:G:A    0.543868  9.373146e-01  
...                    ...           ...  
5:73144301:T:C    0.012277  8.143292e-11  
5:73144431:TAG:T  0.023528  2.028154e-01  
5:73144432:G:A    0.012569  1.627796e-12  
5:73144716:G:A    0.575205  6.015282e-01  
5:73144845:G:A    0.012494  5.022269e-13  

[632 rows x 8 columns]
{% endraw %} {% raw %}
region = [5, 73776529, 73849020]
hg38toimpref = Liftover('hg38','hg19')
imp_region = hg38toimpref.region_liftover(region)
imput_sumstats.extractbyregion(imp_region)
imput_geno.extractbyregion(imp_region)
imput_sumstats.match_ss(imput_geno.bim)
imput_geno.geno_in_stat(imput_sumstats.ss)
this region (5, 73072354, 73144845) has 632 SNPs in Sumstat
this region (5, 73072354, 73144845) has 2405 SNPs in Genodata
Paired SNPs 632
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/2206559.1.plot.q/ipykernel_3137/2846184474.py in <module>
      4 imput_sumstats.extractbyregion(imp_region)
      5 imput_geno.extractbyregion(imp_region)
----> 6 imput_sumstats.match_ss(imput_geno.bim)
      7 imput_geno.geno_in_stat(imput_sumstats.ss)

/mnt/mfs/statgen/yin/Github/LDtools/LDtools/sumstat.py in match_ss(self, bim)
     61 
     62     def match_ss(self,bim):
---> 63         self.ss = check_ss1(self.ss,bim)
     64 

/mnt/mfs/statgen/yin/Github/LDtools/LDtools/utils.py in check_ss1(ss, bim, keep_ambiguous)
    153     print("Paired SNPs",ss.shape[0])
    154     #input paired ss and bim
--> 155     pm = pair_match(ss.REF,ss.ALT,bim.a0,bim.a1)
    156     if keep_ambiguous:
    157         print('Warning: there are',sum(pm.ambiguous),'ambiguous SNPs')

/mnt/mfs/statgen/yin/Github/LDtools/LDtools/utils.py in pair_match(a1, a2, ref1, ref2)
    183     result["ambiguous"] = ((a1=="A") & (a2=="T")) | ((a1=="T") & (a2=="A")) | ((a1=="C") & (a2=="G")) | ((a1=="G") & (a2=="C"))
    184     # as long as scenario 1 is involved, sign_flip will return TRUE
--> 185     result["sign_flip"] = ((a1==ref2) & (a2==ref1)) | ((a1==flip2) & (a2==flip1))
    186     # as long as scenario 2 is involved, strand_flip will return TRUE
    187     result["strand_flip"] = ((a1==flip1) & (a2==flip2)) | ((a1==flip2) & (a2==flip1))

~/miniconda3/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~/miniconda3/lib/python3.8/site-packages/pandas/core/arraylike.py in __eq__(self, other)
     30     @unpack_zerodim_and_defer("__eq__")
     31     def __eq__(self, other):
---> 32         return self._cmp_method(other, operator.eq)
     33 
     34     @unpack_zerodim_and_defer("__ne__")

~/miniconda3/lib/python3.8/site-packages/pandas/core/series.py in _cmp_method(self, other, op)
   5494 
   5495         if isinstance(other, Series) and not self._indexed_same(other):
-> 5496             raise ValueError("Can only compare identically-labeled Series objects")
   5497 
   5498         lvalues = self._values

ValueError: Can only compare identically-labeled Series objects
{% endraw %} {% raw %}
a=exome_sumstats.sample(n=1000)
{% endraw %} {% raw %}
a = exome_sumstats.ss
{% endraw %} {% raw %}
a = a.sort_index()
{% endraw %} {% raw %}
aa = a.copy()
{% endraw %} {% raw %}
aa.REF = list(a.ALT)
aa.ALT = list(a.REF)
{% endraw %} {% raw %}
tmp =compare_snps(imput_sumstats.ss,exome_sumstats.ss)
keep   exact  flip   reverse  both   complement
False  False  False  False    False  False         8439
True   False  False  True     False  False           13
       True   False  False    False  False            9
       False  True   False    False  False            5
False  False  False  False    False  True             4
True   False  False  False    True   False            2
dtype: int64
{% endraw %} {% raw %}
tmp
keep exact flip reverse both complement qidx sidx
0 False False False False False False 6767726 -1
1 False False False False False False 6767727 -1
2 False False False False False False 6767728 -1
3 False False False False False False 6767729 -1
4 False False False False False False 6767730 -1
... ... ... ... ... ... ... ... ...
8467 False False False False False False 6776191 -1
8468 False False False False False False 6776192 -1
8469 False False False False False False 6776193 -1
8470 False False False False False False 6776194 -1
8471 False False False False False False 6776195 -1

8472 rows × 8 columns

{% endraw %} {% raw %}
imput_sumstats.ss.loc[tmp.qidx[tmp.exact==False].drop_duplicates()]
CHR POS REF ALT SNP BETA SE P
6767726 5 272851 A G chr5:272851:A:G 0.357496 0.888197 0.687318
6767727 5 272906 A C chr5:272906:A:C -0.003007 0.019764 0.879070
6767728 5 273143 A G chr5:273143:A:G -0.013693 0.016716 0.412684
6767729 5 273160 G C chr5:273160:G:C 0.235713 0.348772 0.499145
6767730 5 273534 C T chr5:273534:C:T 0.050095 0.139496 0.719509
... ... ... ... ... ... ... ... ...
6776191 5 1213094 C T chr5:1213094:C:T -0.015881 0.023298 0.495462
6776192 5 1213134 G A chr5:1213134:G:A -1.142280 1.344380 0.395509
6776193 5 1213223 C T chr5:1213223:C:T -0.003009 0.013631 0.825270
6776194 5 1213404 T TC chr5:1213404:T:TC -0.039146 0.117837 0.739735
6776195 5 1213510 C T chr5:1213510:C:T 0.009318 0.012922 0.470845

8461 rows × 8 columns

{% endraw %} {% raw %}
imput_sumstats.ss.loc[tmp.qidx[tmp.exact==False]]
CHR POS REF ALT SNP BETA SE P
6767726 5 272851 A G chr5:272851:A:G 0.357496 0.888197 0.687318
6767727 5 272906 A C chr5:272906:A:C -0.003007 0.019764 0.879070
6767728 5 273143 A G chr5:273143:A:G -0.013693 0.016716 0.412684
6767729 5 273160 G C chr5:273160:G:C 0.235713 0.348772 0.499145
6767730 5 273534 C T chr5:273534:C:T 0.050095 0.139496 0.719509
... ... ... ... ... ... ... ... ...
6776191 5 1213094 C T chr5:1213094:C:T -0.015881 0.023298 0.495462
6776192 5 1213134 G A chr5:1213134:G:A -1.142280 1.344380 0.395509
6776193 5 1213223 C T chr5:1213223:C:T -0.003009 0.013631 0.825270
6776194 5 1213404 T TC chr5:1213404:T:TC -0.039146 0.117837 0.739735
6776195 5 1213510 C T chr5:1213510:C:T 0.009318 0.012922 0.470845

8463 rows × 8 columns

{% endraw %} {% raw %}
print(tmp.iloc[:,:6].value_counts())
keep   exact  flip   reverse  both   complement
False  False  False  False    False  False         118
True   False  True   False    False  False           1
dtype: int64
{% endraw %} {% raw %}
sum(tmp['query']==-1)
0
{% endraw %} {% raw %}
tmp[tmp.reverse]
keep exact flip reverse both complement query subject
2 True False True True False True 811358 811358
14 True False True True False True 811368 811368
27 True False True True False True 811381 811381
41 True False True True False True 811395 811395
46 True False True True False True 811400 811400
50 True False True True False True 811402 811402
65 True False True True False True 811415 811415
74 True False True True False True 811422 811422
75 True False True True False True 811423 811423
83 True False True True False True 811431 811431
91 True False True True False True 811439 811439
{% endraw %} {% raw %}
tmp[tmp.both]
keep exact flip reverse both complement query subject
{% endraw %} {% raw %}
tmp.flip.value_counts()
True     96
False    10
Name: flip, dtype: int64
{% endraw %} {% raw %}
tmp[tmp.flip==False]
keep exact flip reverse both complement query subject
12 False False False False False False 811367 811368
13 False False False False False True 811368 811367
48 False False False False False True 811400 811401
49 False False False False False False 811401 811400
65 False False False False False False 811414 811415
66 False False False False False True 811415 811414
72 False False False False False False 811418 811419
73 False False False False False False 811419 811418
97 False False False False False False 811440 811441
98 False False False False False False 811441 811440
{% endraw %} {% raw %}
a[10:]
CHR POS REF ALT SNP BETA SE P
811366 5 275982 C T chr5:275982:C:T -1.158610 1.623030 0.475317
811367 5 275984 A G chr5:275984:A:G -1.215220 0.968170 0.209418
811368 5 275984 A T chr5:275984:A:T -1.023460 3.179240 0.747513
811369 5 275989 T G chr5:275989:T:G -1.059680 1.854150 0.567647
811370 5 275996 C T chr5:275996:C:T 1.908060 1.769520 0.280905
... ... ... ... ... ... ... ... ...
811447 5 306692 G A chr5:306692:G:A 0.974559 0.321401 0.005570
811448 5 306695 C T chr5:306695:C:T -1.051520 1.797830 0.558627
811449 5 306703 C T chr5:306703:C:T 0.645902 1.310650 0.622147
811450 5 306732 C T chr5:306732:C:T -1.033180 3.113500 0.740011
811451 5 306748 C T chr5:306748:C:T 1.738350 0.358186 0.000012

86 rows × 8 columns

{% endraw %} {% raw %}
tmp[tmp.exact==True]
keep exact flip reverse both complement query subject
47 True True False False False False 811401 811367
64 True True False False True True 811415 811400
{% endraw %} {% raw %}
sum(tmp['query'] == tmp['subject'])
91
{% endraw %} {% raw %}
tmp.query == tmp.subject
0      False
1      False
2      False
3      False
4      False
       ...  
106    False
107    False
108    False
109    False
110    False
Name: subject, Length: 111, dtype: bool
{% endraw %} {% raw %}
a.loc[[811401,811415],:]
CHR POS REF ALT SNP BETA SE P
811401 5 276395 G A chr5:276395:G:A -1.19077 1.87388 0.525130
811415 5 276938 C G chr5:276938:C:G -1.12324 1.80101 0.532845
{% endraw %} {% raw %}
aa.loc[[811367,811400],:]
CHR POS REF ALT SNP BETA SE P
811367 5 275984 G A chr5:275984:A:G -1.215220 0.968170 0.209418
811400 5 276395 C G chr5:276395:G:C 0.852057 0.586238 0.146104
{% endraw %} {% raw %}
tmp.exact.value_counts()
False    109
True       2
Name: exact, dtype: int64
{% endraw %} {% raw %}
a.POS.value_counts()
306584    2
276395    2
276938    2
275984    2
276973    2
         ..
276216    1
276215    1
276202    1
276191    1
306748    1
Name: POS, Length: 91, dtype: int64
{% endraw %} {% raw %}
a
CHR POS REF ALT SNP BETA SE P
1733 1 976512 G A chr1:976512:G:A -1.043870 3.130310 0.738780
6959 1 1333753 G C chr1:1333753:G:C -1.048260 2.806530 0.708771
9458 1 1495680 T C chr1:1495680:T:C 0.147934 0.850903 0.861979
11995 1 1737811 T C chr1:1737811:T:C -0.046021 1.078830 0.965974
13127 1 1955424 T C chr1:1955424:T:C 0.140340 0.104258 0.178278
... ... ... ... ... ... ... ... ...
2991282 22 37206803 C T chr22:37206803:C:T 2.635060 1.949880 0.176570
2999438 22 39240918 T TGCGC chr22:39240918:T:TGCGC 1.020610 1.452690 0.482324
3013185 22 44786566 G A chr22:44786566:G:A 0.580112 0.696360 0.404809
3024911 22 50315712 A G chr22:50315712:A:G -1.085230 2.524020 0.667223
3028072 22 50603663 G A chr22:50603663:G:A -1.089630 2.563200 0.670760

1000 rows × 8 columns

{% endraw %} {% raw %}
smry = []
query = a[:10].itertuples()
subject = a[100:210].itertuples()
qi,si = next(query,None),next(subject,None)
multi_snps = []
while(qi and si):
    if qi[1]>si[1]:
        si = next(subject,None)
        multi_snps = []
        continue
    elif qi[1]<si[1]:
        qi = next(query,None)
        if len(multi_snps)==0:
            smry.append([False]*5+[-1,-1])
        else:
            for s in multi_snps:
                smry.append(snp_match(qi[3],qi[4],s[3],s[4])+[qi[0],s[0]])
        continue
    else:
        if qi[2]>si[2]:
            si = next(subject,None)
            multi_snps = []
            continue
        elif qi[2]<si[2]:
            qi = next(query,None)
            if len(multi_snps)==0:
                smry.append(np.array([False]*5))
            else:
                for s in multi_snps:
                    smry.append(snp_match(qi[3],qi[4],s[3],s[4])+[qi[0],s[0]])
            continue
        else:
            #same pos has multiple snps
            #query compare with each of them in subject
            multi_snps.append(si)
            smry.append(snp_match(qi[3],qi[4],si[3],si[4])+[qi[0],si[0]])
            si = next(subject,None)
smry = pd.DataFrame(smry)
Pandas(Index=1733, CHR=1, POS=976512, REF='G', ALT='A', SNP='chr1:976512:G:A', BETA=-1.04387, SE=3.13031, P=0.7387797791025819) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=6959, CHR=1, POS=1333753, REF='G', ALT='C', SNP='chr1:1333753:G:C', BETA=-1.04826, SE=2.80653, P=0.7087710984181352) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=9458, CHR=1, POS=1495680, REF='T', ALT='C', SNP='chr1:1495680:T:C', BETA=0.147934, SE=0.850903, P=0.861979425862772) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=11995, CHR=1, POS=1737811, REF='T', ALT='C', SNP='chr1:1737811:T:C', BETA=-0.0460207, SE=1.07883, P=0.9659741397427256) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=13127, CHR=1, POS=1955424, REF='T', ALT='C', SNP='chr1:1955424:T:C', BETA=0.14034, SE=0.104258, P=0.178277690755054) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=14530, CHR=1, POS=2303171, REF='C', ALT='T', SNP='chr1:2303171:C:T', BETA=-1.05115, SE=2.34098, P=0.6534178585573059) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=15435, CHR=1, POS=2406725, REF='G', ALT='A', SNP='chr1:2406725:G:A', BETA=-1.11988, SE=1.3376, P=0.4024602568261106) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=31462, CHR=1, POS=10307349, REF='G', ALT='A', SNP='chr1:10307349:G:A', BETA=-1.10213, SE=1.49971, P=0.4624001858860571) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=33819, CHR=1, POS=11080678, REF='T', ALT='C', SNP='chr1:11080678:T:C', BETA=-1.65391, SE=1.53434, P=0.2810657974090764) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
Pandas(Index=38760, CHR=1, POS=12348810, REF='C', ALT='T', SNP='chr1:12348810:C:T', BETA=2.45887, SE=1.90922, P=0.1977839290228308) Pandas(Index=301220, CHR=1, POS=242107640, REF='T', ALT='C', SNP='chr1:242107640:T:C', BETA=0.331632, SE=0.441286, P=0.452343131230923)
{% endraw %} {% raw %}
smry
[array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False]),
 array([False, False, False, False, False])]
{% endraw %} {% raw %}
ai = a[:5].itertuples()
{% endraw %} {% raw %}
for i in ai:
    print(i)
    tmp = next(ai,None)
    print('tmp',tmp)
Pandas(Index=1733, CHR=1, POS=976512, REF='G', ALT='A', SNP='chr1:976512:G:A', BETA=-1.04387, SE=3.13031, P=0.7387797791025819)
tmp Pandas(Index=6959, CHR=1, POS=1333753, REF='G', ALT='C', SNP='chr1:1333753:G:C', BETA=-1.04826, SE=2.80653, P=0.7087710984181352)
Pandas(Index=9458, CHR=1, POS=1495680, REF='T', ALT='C', SNP='chr1:1495680:T:C', BETA=0.147934, SE=0.850903, P=0.861979425862772)
tmp Pandas(Index=11995, CHR=1, POS=1737811, REF='T', ALT='C', SNP='chr1:1737811:T:C', BETA=-0.0460207, SE=1.07883, P=0.9659741397427256)
Pandas(Index=13127, CHR=1, POS=1955424, REF='T', ALT='C', SNP='chr1:1955424:T:C', BETA=0.14034, SE=0.104258, P=0.178277690755054)
tmp None
{% endraw %}

3.SNPs match in two sumstats

{% raw %}
def snps_match(query,subject,keep_ambiguous=True):
    query.index = query.iloc[:,:2].astype(str).agg(':'.join, axis=1)
    subject.index = subject.iloc[:,:2].astype(str).agg(':'.join, axis=1)
    #overlap snps by chr+pos
    print("Total rows of query: ",query.shape[0],"Total rows of subject: ",subject.shape[0])
    subject = subject[subject.index.isin(query.index)]
    query = query.loc[subject.index]
    print("Overlap chr:pos",query.shape[0])
    if query.index.duplicated().any():
        raise Exception("There are duplicated chr:pos")
    pm = pair_match(query.ALT,query.REF,subject.ALT,subject.REF)
    if keep_ambiguous:
        print('Warning: there are',sum(~pm.ambiguous),'ambiguous SNPs')
        pm = pm.iloc[:,1:]
    else:
        pm = pm[~pm.ambiguous].iloc[:,1:]
    keep_idx = pm.any(axis=1)
    print("Overlap SNPs",sum(keep_idx))
    #overlap snps by chr+pos+alleles.
    new_subject = subject[keep_idx]
    #update beta and snp info
    new_query = pd.concat([new_subject.iloc[:,:5],query[keep_idx].iloc[:,5:]],axis=1)
    new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
    return new_query,new_subject
{% endraw %} {% raw %}
def pair_match(a1,a2,ref1,ref2):
    # a1 and a2 are the first data-set
	# ref1 and ref2 are the 2nd data-set
	# Make all the alleles into upper-case, as A,T,C,G:
    a1 = a1.str.upper()
    a2 = a2.str.upper()
    ref1 = ref1.str.upper()
    ref2 = ref2.str.upper()
	# Strand flip, to change the allele representation in the 2nd data-set
    flip1 = ref1.apply(strand_flip)
    flip2 = ref2.apply(strand_flip)
    result = {}
    result["ambiguous"] = ((a1=="A") & (a2=="T")) | ((a1=="T") & (a2=="A")) | ((a1=="C") & (a2=="G")) | ((a1=="G") & (a2=="C"))
    # as long as scenario 1 is involved, sign_flip will return TRUE
    result["sign_flip"] = ((a1==ref2) & (a2==ref1)) | ((a1==flip2) & (a2==flip1))
	# as long as scenario 2 is involved, strand_flip will return TRUE
    result["strand_flip"] = ((a1==flip1) & (a2==flip2)) | ((a1==flip2) & (a2==flip1))
	# remove other cases, eg, tri-allelic, one dataset is A C, the other is A G, for example.
    result["exact_match"] = ((a1 == ref1) & (a2 == ref2))
    return pd.DataFrame(result)
{% endraw %} {% raw %}
def strand_flip(s):
    return ''.join(Seq(s).reverse_complement())
{% endraw %}

4.Create testing MWE

{% raw %}
exome_sumstats = Sumstat(sumstats_path)
#exome_sumstats.extractbyregion(region)
{% endraw %} {% raw %}
a=exome_sumstats.ss.sample(n=1000)
{% endraw %} {% raw %}
a = a.sort_index()
{% endraw %} {% raw %}
a
CHR POS REF ALT SNP BETA SE P
709 1 953305 G A chr1:953305:G:A -0.137062 0.731144 0.851298
2118 1 1014359 G T chr1:1014359:G:T -1.110110 1.724070 0.519648
4067 1 1211543 G T chr1:1211543:G:T -0.220586 0.947774 0.815963
7641 1 1355779 GA G chr1:1355779:GA:G 0.096933 0.488808 0.842807
13114 1 1955369 C G chr1:1955369:C:G -0.066739 0.762243 0.930230
... ... ... ... ... ... ... ... ...
3022270 22 50210362 C T chr22:50210362:C:T -1.147250 1.521600 0.450867
3023900 22 50278367 C T chr22:50278367:C:T -1.222950 1.248900 0.327472
3025632 22 50455242 G A chr22:50455242:G:A 0.634902 0.433051 0.142618
3026405 22 50488346 A G chr22:50488346:A:G -0.564272 0.900978 0.531125
3026963 22 50517779 C T chr22:50517779:C:T 0.989030 0.882422 0.262367

1000 rows × 8 columns

{% endraw %} {% raw %}
ss1 = a[20:520].copy()
{% endraw %} {% raw %}
ss1 = ss1[~ss1.POS.duplicated()]
{% endraw %} {% raw %}
def reverse_refalt(ss):
    ss = ss.copy()
    ref = ss.REF.copy()
    ss.REF = ss.ALT
    ss.ALT = ref
    ss.BETA = -ss.BETA
    return ss
{% endraw %} {% raw %}
def flip_snps(ss):
    ss = ss.copy()
    ss.REF = [strand_flip(i) for i in ss.REF]
    ss.ALT = [strand_flip(i) for i in ss.ALT]
    return ss
{% endraw %} {% raw %}
snps_match(flip_snps(ss1),a,keep_ambiguous=True)
Total rows of query:  500 Total rows of subject:  1000
Overlap chr:pos 500
Warning: there are 418 ambiguous SNPs
Overlap SNPs 500
/tmp/1988675.1.plot.q/ipykernel_30609/2132708780.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
(              CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850   1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns],
               CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850  -1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns])
{% endraw %} {% raw %}
snps_match(reverse_refalt(ss1),a)
Total rows of query:  500 Total rows of subject:  1000
Overlap chr:pos 500
Warning: there are 418 ambiguous SNPs
Overlap SNPs 500
/tmp/1988675.1.plot.q/ipykernel_30609/2132708780.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
(              CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850  -1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns],
               CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850  -1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns])
{% endraw %} {% raw %}
snps_match(flip_snps(reverse_refalt(ss1)),a)
Total rows of query:  500 Total rows of subject:  1000
Overlap chr:pos 500
Warning: there are 418 ambiguous SNPs
Overlap SNPs 500
/tmp/1988675.1.plot.q/ipykernel_30609/2132708780.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_query.BETA[pm.sign_flip] = -new_query.BETA[pm.sign_flip]
(              CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850  -1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns],
               CHR        POS         REF ALT                         SNP  \
 1:18744859      1   18744859           G   A           chr1:18744859:G:A   
 1:19112524      1   19112524           C   T           chr1:19112524:C:T   
 1:19112744      1   19112744           C   T           chr1:19112744:C:T   
 1:19220870      1   19220870  TTCACACCGA   T  chr1:19220870:TTCACACCGA:T   
 1:19324561      1   19324561           G   A           chr1:19324561:G:A   
 ...           ...        ...         ...  ..                         ...   
 10:93074445    10   93074445           G   A          chr10:93074445:G:A   
 10:94171346    10   94171346           C   T          chr10:94171346:C:T   
 10:97370850    10   97370850           A   T          chr10:97370850:A:T   
 10:100293231   10  100293231           A   G         chr10:100293231:A:G   
 10:101803994   10  101803994           A   G         chr10:101803994:A:G   
 
                   BETA        SE         P  
 1:18744859   -1.280560  1.327880  0.334864  
 1:19112524   -1.063210  2.760900  0.700168  
 1:19112744   -0.475555  0.539997  0.378501  
 1:19220870   -1.101250  2.434530  0.651020  
 1:19324561    0.908711  1.011330  0.368904  
 ...                ...       ...       ...  
 10:93074445  -1.282030  0.989365  0.195040  
 10:94171346  -1.054960  2.390500  0.658984  
 10:97370850  -1.069910  2.301690  0.642049  
 10:100293231  0.511226  1.280380  0.689689  
 10:101803994 -1.083250  2.012430  0.590386  
 
 [500 rows x 8 columns])
{% endraw %}

4.1 Write out test examples

{% raw %}
a.to_csv('data/testflip/snps1000.regenie.snp_stats.gz', sep = "\t", header = True, index = False,compression='gzip')
{% endraw %} {% raw %}
ss1.columns = ['CHR','POS','REF','ALT','SNP','BETA','SE','P']
ss1.to_csv('data/testflip/snps500.regenie.snp_stats.gz', sep = "\t", header = True, index = False,compression='gzip')
{% endraw %} {% raw %}
fss1 = flip_snps(ss1)
fss1.columns = ['CHR','POS','A0','A1','SNP','STAT','SE','P']
fss1.to_csv('data/testflip/flip/snps500_flip.regenie.snp_stats.gz', sep = "\t", header = True, index = False,compression='gzip')
{% endraw %} {% raw %}
reverse_refalt(ss1).to_csv('data/testflip/snps500_rea0a1.regenie.snp_stats.gz', sep = "\t", header = True, index = False,compression='gzip')
{% endraw %} {% raw %}
flip_snps(reverse_refalt(ss1)).to_csv('data/testflip/snps500_flip_rea0a1.regenie.snp_stats.gz', sep = "\t", header = True, index = False,compression='gzip')
{% endraw %}