SeqFindr package

Submodules

SeqFindr.blast module

SeqFindr BLAST methods

SeqFindr.blast.make_BLAST_database(fasta_file)

Given a fasta_file, generate a nucleotide BLAST database

Database will end up in DB/ of working directory or OUTPUT/DB if an output directory is given in the arguments

Parameters:fasta_file (string) – full path to a fasta file
Return type:the strain id (must be delimited by ‘_’)
SeqFindr.blast.parse_BLAST(blast_results, tol, careful)

Using NCBIXML parse the BLAST results, storing & returning good hits

Here good hits are:
  • hsp.identities/float(record.query_length) >= tol
Parameters:
  • blast_results (string) – full path to a blast run output file (in XML format)
  • tol (float) – the cutoff threshold (see above for explaination)
Return type:

list of satifying hit names

SeqFindr.blast.run_BLAST(query, database, args)

Given a mfa of query sequences of interest & a database, search for them.

Important to note:
  • Turns dust filter off,
  • Only a single target sequence (top hit),
  • Output in XML format as blast.xml.

# TODO: Add evalue filtering ? # TODO: add task=’blastn’ to use blastn scoring ?

Warning

default is megablast

Warning

tblastx funcationality has not been checked

Parameters:
  • query (string) – the fullpath to the vf.mfa
  • database (string) – the full path of the databse to search for the vf in
  • args (argparse args (dictionary)) – the arguments parsed to argparse
Returns:

the path of the blast.xml file

SeqFindr.config module

SeqFindr configuration class: 100% test coverage, > 9 PyLint score

class SeqFindr.config.SeqFindrConfig(alt_location=None)

Bases: object

A SeqFindr configuration class - subtle manipulation to plots

dump_items()

Prints all set configuration options to STDOUT

SeqFindr.config.read_config(alt_location)

Read a SeqFindr configuration file

Currently only supports category colors in RGB format

category_colors = [(0,0,0),(255,255,255),....,(r,g,b)]

SeqFindr.imaging module

SeqFindr.imaging.generate_colors(number_required, seed)

Generate a list of length number of distinct “good” random colors

See: https://github.com/fmder/ghalton

Based on http://martin.ankerl.com/2009/12/09/
how-to-create-random-colors-programmatically/
Parameters:
  • number_required – int
  • seed – the random seed
Type:

int

Type:

int

Return type:

a list of lists in the form: [[243, 137, 121], [232, 121, 243], [216, 121, 243]]

SeqFindr.imaging.hsv_to_rgb(h, s, v)

Convert HSV to RGB

Parameters:
  • h – hue
  • s – saturation
  • v – value

SeqFindr.seqfindr module

SeqFindr v0.31.4 - A tool to easily create informative genomic feature plots (http://github.com/mscook/SeqFindr)

SeqFindr.seqfindr.build_matrix_row(all_vfs, accepted_hits, score=None)

Populate row given all possible hits, accepted hits and an optional score

Parameters:
  • all_vfs (list) – a list of all virulence factor ids
  • accepted_hits (list) – a list of a hits that passed the cutoof
  • score (float) – the value to fill the matrix with (default = None which implies 0.5)
Return type:

a list of floats

SeqFindr.seqfindr.cluster_matrix(matrix, y_labels, dpi)

From a matrix, generate a distance matrix & perform hierarchical clustering

Parameters:
  • matrix – a numpy matrix of scores
  • y_labels – the virulence factor ids for all row elements
SeqFindr.seqfindr.core(args)

The ‘core’ SeqFindr method

TODO: Exception handling if do_run fails or produces no results

Parameters:args – the arguments given from argparse
SeqFindr.seqfindr.do_run(args, data_path, match_score, vfs_list)

Perform a SeqFindr run

SeqFindr.seqfindr.match_matrix_rows(ass_mat, cons_mat)

Reorder a second matrix based on the first row element of the 1st matrix

Parameters:
  • ass_mat (list) – a 2D list of scores
  • cons_mat (list) – a 2D list scores
Return type:

2 matricies (2D lists)

SeqFindr.seqfindr.plot_matrix(matrix, strain_labels, vfs_classes, gene_labels, show_gene_labels, color_index, config_object, grid, seed, dpi, size, svg, aspect='auto')

Plot the VF hit matrix

Parameters:
  • matrix – the numpy matrix of scores
  • strain_labels – the strain (y labels)
  • vfs_classes – the VFS class (in mfa header [class])
  • gene_labels – the gene labels
  • show_gene_labels – wheter top plot the gene labels
  • color_index – for a single class, choose a specific color
SeqFindr.seqfindr.prepare_queries(args)

Given a set of sequences of interest, extract all query & query classes

A sequence of interest file is a mfa file in the format:

>ident, gene id, annotation, organism [class]

query = gene id query_class = class

Location of sequence of interest file is defined by args.seqs_of_interest

Parameters:args (argparse args) – the argparse args containing args.seqs_of_interest (fullpath) to a sequence of interest DB (mfa file)
Return type:2 lists, 1) of all queries and, 2) corresponding query classes
SeqFindr.seqfindr.strip_bases(args)

Strip the 1st and last ‘N’ bases from mapping consensuses

Uses:
  • args.cons
  • args.seqs_of_interest
  • arg.strip

To avoid the effects of lead in and lead out coverage resulting in uncalled bases

Parameters:args (argparse args) – the argparse args containing args.strip value
Return type:the updated args to reflect the args.cons & args.seqs_of_interest location
SeqFindr.seqfindr.strip_id_from_matrix(mat)

Remove the ID (1st row element) form a matrix

Parameters:mat – a 2D list
Return type:a 2D list with the 1st row elelemnt (ID) removed

SeqFindr.util module

SeqFindr utility methods

SeqFindr.util.check_database(database_file)

Check the database conforms to the SeqFindr format

Note

this is not particulalry extensive

Args database_file:
 full path to a database file as a string
SeqFindr.util.ensure_paths_for_args(args)

Ensure all arguments with paths are absolute & have simplification removed

Just apply os.path.abspath & os.path.expanduser

Parameters:args – the arguments given from argparse
Returns:an updated args
SeqFindr.util.get_fasta_files(data_path)

Returns all files ending with .fas/.fa/fna in a directory

Parameters:data_path – the full path to the directory of interest
Returns:a list of fasta files (valid extensions: .fas, .fna, .fa
SeqFindr.util.init_output_dirs(output_dir)

Create the output base (if needed) and change dir to it

Parameters:args – the arguments given from argparse
SeqFindr.util.is_protein(fasta_file)

Checks if a FASTA file is protein or nucleotide.

Will return -1 if no protein detected

TODO: Abiguity characters? TODO: exception if mix of protein/nucleotide?

Parameters:fasta_file (string) – path to input FASTA file
Returns:number of protein sequences in fasta_file (int)
SeqFindr.util.order_inputs(order_index_file, dir_listing)

Given an order index file, maintain this order in the matrix plot

This implies no clustering. Typically used when you already have a phlogenetic tree.

Parameters:
  • order_index_file (string) – full path to a ordered file (1 entry per line)
  • dir_listing (list) – a listing from util.get_fasta_files
Return type:

list of updated glob.glob dir listing to match order specified

SeqFindr.vfdb_to_seqfindr module

vfdb_to_seqfindr

Convert VFDB formatted files (or like) to SeqFindr formatted database files

VFDB: Virulence Factors Database www.mgc.ac.cn/VFs/ a reference database for bacterial virulence factors.

This is based on a sample file (TOTAL_Strep_VFs.fas) provided by Nouri Ben Zakour.

Examples:

# Default (will set VFDB classification identifiers as the classification)
$ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf

# Sets any classification to blank ([ ])
$ vfdb_to_seqfindr -i TOTAL_Strep_VFs.fas -o TOTAL_Strep_VFs.sqf -b

# Reads a user defined classification. 1 per in same order as input
# sequences
$ python convert_vfdb_to_SeqFindr.py -i TOTAL_Strep_VFs.fas
  -o TOTAL_Strep_VFs.sqf -c blah.dat

About option –class_file

Suppose you want to annotate a VF class with user defined values. Simply develop a file containing the scheme (1-1 matching). If you had 6 input sequences and the first 3 are Fe transporters and the next two are Toxins and the final sequence is Misc your class file would look like this:

Fe transporter Fe transporter Fe transporter Toxins Toxins Misc

SeqFindr.vfdb_to_seqfindr.main(args)
SeqFindr.vfdb_to_seqfindr.order_by_class(args)

Ensure that all particualr classes are in the same block

Module contents