Using the GEMINI API¶
The GeminiQuery class¶
-
class
gemini.
GeminiQuery
(db, include_gt_cols=False, out_format=<gemini.GeminiQuery.DefaultRowFormat object>, variant_id_getter=None)[source]¶ An interface to submit queries to an existing Gemini database and iterate over the results of the query.
We create a GeminiQuery object by specifying database to which to connect:
from gemini import GeminiQuery gq = GeminiQuery("my.db")
We can then issue a query against the database and iterate through the results by using the
run()
method:for row in gq: print row
Instead of printing the entire row, one access print specific columns:
gq.run("select chrom, start, end from variants") for row in gq: print row['chrom']
Also, all of the underlying numpy genotype arrays are always available:
gq.run("select chrom, start, end from variants") for row in gq: gts = row.gts print row['chrom'], gts # yields "chr1" ['A/G' 'G/G' ... 'A/G']
The
run()
methods also accepts genotype filter:query = "select chrom, start, end" from variants" gt_filter = "gt_types.NA20814 == HET" gq.run(query) for row in gq: print row
Lastly, one can use the
sample_to_idx
andidx_to_sample
dictionaries to gain access to sample-level genotype information either by sample name or by sample index:# grab dict mapping sample to genotype array indices smp2idx = gq.sample_to_idx query = "select chrom, start, end from variants" gt_filter = "gt_types.NA20814 == HET" gq.run(query, gt_filter) # print a header listing the selected columns print gq.header for row in gq: # access a NUMPY array of the sample genotypes. gts = row['gts'] # use the smp2idx dict to access sample genotypes idx = smp2idx['NA20814'] print row, gts[idx]
-
header
¶ Return a header describing the columns that were selected in the query issued to a GeminiQuery object.
-
index2sample
¶ Return a dictionary mapping sample names to genotype array offsets:
gq = GeminiQuery("my.db") i2s = gq.index2sample print i2s[1088] # yields "NA20814"
-
run
(query, gt_filter=None, show_variant_samples=False, variant_samples_delim=', ', predicates=None, needs_genotypes=False, needs_genes=False, show_families=False, subjects=None)[source]¶ Execute a query against a Gemini database. The user may specify:
- (reqd.) an SQL query.
- (opt.) a genotype filter.
-
sample2index
¶ Return a dictionary mapping sample names to genotype array offsets:
gq = GeminiQuery("my.db") s2i = gq.sample2index print s2i['NA20814'] # yields 1088
-
Extracting the VCF INFO tags with GEMINI API¶
The GEMINI API is useful to extract the individual tags within the INFO field of a VCF (stored as a compressed dictionary in the variants table). This would be of particular interest to those who want to add custom annotations to their VCF and still be able to access the individual tags programmatically. Here is an example where we try to extract the dbNSFP fields from the ‘INFO’ tag of a VCF, using the API.
#!/usr/bin/env python
import sys
from gemini import GeminiQuery
database = sys.argv[1]
gq = GeminiQuery(database)
query = "SELECT variant_id, chrom, start, end, ref, alt, info \
FROM variants"
gq.run(query)
for row in gq:
try:
print "\t".join([str(row['chrom']), str(row['start']), str(row['end']),
str(row['ref']), str(row['alt']), str(row.info['dbNSFP_SIFT_pred'])])
except KeyError:
pass
# yields
chr1 906272 906273 C T P|D|P
chr1 906273 906274 C A D|D|D
chr1 906276 906277 T C D|D|D
chr1 906297 906298 G T B|B|B
chr1 1959074 1959075 A C D
chr1 1959698 1959699 G A B
chr1 1961452 1961453 C T P
chr1 2337953 2337954 C T D