Using the GEMINI API

The GeminiQuery class

class gemini.GeminiQuery(db, include_gt_cols=False, out_format=<gemini.GeminiQuery.DefaultRowFormat object>, variant_id_getter=None)[source]

An interface to submit queries to an existing Gemini database and iterate over the results of the query.

We create a GeminiQuery object by specifying database to which to connect:

from gemini import GeminiQuery
gq = GeminiQuery("my.db")

We can then issue a query against the database and iterate through the results by using the run() method:

for row in gq:
    print row

Instead of printing the entire row, one access print specific columns:

gq.run("select chrom, start, end from variants")
for row in gq:
    print row['chrom']

Also, all of the underlying numpy genotype arrays are always available:

gq.run("select chrom, start, end from variants")
for row in gq:
    gts = row.gts
    print row['chrom'], gts
    # yields "chr1" ['A/G' 'G/G' ... 'A/G']

The run() methods also accepts genotype filter:

query = "select chrom, start, end" from variants"
gt_filter = "gt_types.NA20814 == HET"
gq.run(query)
for row in gq:
    print row

Lastly, one can use the sample_to_idx and idx_to_sample dictionaries to gain access to sample-level genotype information either by sample name or by sample index:

# grab dict mapping sample to genotype array indices
smp2idx = gq.sample_to_idx

query  = "select chrom, start, end from variants"
gt_filter  = "gt_types.NA20814 == HET"
gq.run(query, gt_filter)

# print a header listing the selected columns
print gq.header
for row in gq:
    # access a NUMPY array of the sample genotypes.
    gts = row['gts']
    # use the smp2idx dict to access sample genotypes
    idx = smp2idx['NA20814']
    print row, gts[idx]
header

Return a header describing the columns that were selected in the query issued to a GeminiQuery object.

index2sample

Return a dictionary mapping sample names to genotype array offsets:

gq = GeminiQuery("my.db")
i2s = gq.index2sample

print i2s[1088]
# yields "NA20814"
run(query, gt_filter=None, show_variant_samples=False, variant_samples_delim=', ', predicates=None, needs_genotypes=False, needs_genes=False, show_families=False, subjects=None)[source]

Execute a query against a Gemini database. The user may specify:

  1. (reqd.) an SQL query.
  2. (opt.) a genotype filter.
sample2index

Return a dictionary mapping sample names to genotype array offsets:

gq = GeminiQuery("my.db")
s2i = gq.sample2index

print s2i['NA20814']
# yields 1088

Extracting the VCF INFO tags with GEMINI API

The GEMINI API is useful to extract the individual tags within the INFO field of a VCF (stored as a compressed dictionary in the variants table). This would be of particular interest to those who want to add custom annotations to their VCF and still be able to access the individual tags programmatically. Here is an example where we try to extract the dbNSFP fields from the ‘INFO’ tag of a VCF, using the API.

#!/usr/bin/env python
import sys
from gemini import GeminiQuery

database = sys.argv[1]
gq = GeminiQuery(database)
query = "SELECT variant_id, chrom, start, end, ref, alt, info \
         FROM variants"

gq.run(query)

for row in gq:
    try:
        print "\t".join([str(row['chrom']), str(row['start']), str(row['end']),
                      str(row['ref']), str(row['alt']), str(row.info['dbNSFP_SIFT_pred'])])
    except KeyError:
        pass

# yields
chr1    906272  906273  C       T       P|D|P
chr1    906273  906274  C       A       D|D|D
chr1    906276  906277  T       C       D|D|D
chr1    906297  906298  G       T       B|B|B
chr1    1959074 1959075 A       C       D
chr1    1959698 1959699 G       A       B
chr1    1961452 1961453 C       T       P
chr1    2337953 2337954 C       T       D
comments powered by Disqus