Help on Python package (pygtftk)¶
The GTF class¶
Class declaration of GTF object.
When using gtfk a GTF object methods may return:
- a GTF object (a representation of a GTF file).
- a TAB object (a representation of a tabulated file).
- a FASTA object (a representation of a fasta file).
-
class
pygtftk.gtf_interface.
GTF
(input_obj=None, check_ensembl_format=True, new_data=None)¶ An interface to a GTF file. This object returns a GTF, TAB or FASTA object.
-
add_attr_column
(input_file=None, new_key='new_key')¶ Simply add a new column of attribute/value to each line.
Parameters: - input_file – a single column file with nrow(GTF_DATA) lines. File lines should be sorted according to the GTF_DATA object. Lines for which a new key should not be added should contain “?”.
- new_key – name of the novel key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import NEWLINE >>> a_path = get_example_file()[0] >>> a_col = get_example_file(ext="csv")[0] >>> a_gtf = GTF(a_path) >>> a_gtf = a_gtf.add_attr_column(open(a_col), new_key='foo') >>> a_list = a_gtf.extract_data('foo', as_list=True) >>> assert [int(x) for x in a_list] == list(range(70))
-
add_attr_from_dict
(feat='*', key='transcript_id', a_dict=None, new_key='new_key')¶ Add key/value pairs to the GTF object.
Parameters: - feat – The comma separated list of target feature. If None, all the features.
- key – The name of the key used for joining (e.g ‘transcript_id’).
- a_dict – key and values will be used to call add_attr_from_file.
- new_key – A name for the novel key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_dict = a_gtf.nb_exons() >>> a_gtf = a_gtf.add_attr_from_dict(feat="transcript", a_dict=a_dict, new_key="exon_nb") >>> b_dict = a_gtf.select_by_key("feature", "transcript").extract_data("transcript_id,exon_nb", as_dict_of_values=True) >>> assert a_dict['G0006T001'] == int(b_dict['G0006T001']) >>> assert a_dict['G0008T001'] == int(b_dict['G0008T001'])
-
add_attr_from_file
(feat=None, key='transcript_id', new_key='new_key', inputfile=None, has_header=False)¶ Add key/value pairs to the GTF object.
Parameters: - feat – The comma separated list of target feature. If None, all the features.
- key – The name of the key used for joining (i.e the key corresponding to value provided in the file).
- new_key – A name for the novel key.
- inputfile – A two column (e.g transcript_id and new value) file.
- has_header – Whether the file to join contains column names.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> join_file = get_example_file("simple", "join")[0] >>> b_gtf = a_gtf.add_attr_from_file(feat="gene", key="gene_id", new_key="bla", inputfile=join_file) >>> b_list = b_gtf.select_by_key("feature", "gene").extract_data("bla", as_list=True, no_na=True, hide_undef=True) >>> assert b_list == ['0.2322', '0.999', '0.5555']
-
add_attr_from_list
(feat=None, key='transcript_id', key_value=[], new_key='new_key', new_key_value=[])¶ Add key/value pairs to the GTF object.
Parameters: - feat – The comma separated list of target feature. If None, all the features.
- key – The name of the key used for joining (e.g ‘transcript_id’).
- key_value – The values for the key (e.g [‘tx_1’, ‘tx_2’,…,’tx_n’])
- new_key – A name for the novel key.
- new_key_value – A name for the novel key (e.g [‘val_tx_1’, ‘val_tx_2’,…,’val_tx_n’]).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b_gtf = a_gtf.add_attr_from_list(feat="gene", key="gene_id", key_value=["G0001", "G0002"], new_key="coding_pot", new_key_value=["0.5", "0.8"]) >>> assert b_gtf.extract_data(keys="coding_pot", as_list=True, no_na=True, hide_undef=True) == ['0.5', '0.8'] >>> b_gtf = a_gtf.add_attr_from_list(feat="gene", key="gene_id", key_value=["G0002", "G0001"], new_key="coding_pot", new_key_value=["0.8", "0.5"]) >>> assert b_gtf.extract_data(keys="coding_pot", as_list=True, no_na=True, hide_undef=True) == ['0.5', '0.8'] >>> key_value = a_gtf.extract_data("transcript_id", no_na=True, as_list=True, nr=True) >>> b=a_gtf.add_attr_from_list(None, key="transcript_id", key_value=key_value, new_key="bla", new_key_value=[str(x) for x in range(len(key_value))])
-
add_attr_from_matrix_file
(feat=None, key='transcript_id', inputfile=None)¶ Add key/value pairs to the GTF file. Expect a matrix with row names as target keys column names as novel key and each cell as value.
Parameters: - feat – The comma separated list of target feature. If None, all the features.
- key – The name of the key used for joining (i.e the key corresponding to value provided in the file).
- inputfile – A two column (e.g transcript_id and new value) file.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> join_file = get_example_file("simple", "join_mat")[0] >>> b_gtf = a_gtf.add_attr_from_matrix_file(feat="gene", key="gene_id", inputfile=join_file) >>> assert b_gtf.extract_data("S1",as_list=True, no_na=True, hide_undef=True) == ['0.2322', '0.999', '0.5555'] >>> assert b_gtf.extract_data("S2",as_list=True, no_na=True, hide_undef=True) == ['0.4', '0.6', '0.7']
-
add_attr_to_pos
(input_file=None, new_key='new_key')¶ Simply add a new column of attribute/value to specific lines.
Parameters: input_file – a two column file. First column contains the position/line (zero-based numbering). Second, the value to be added for ‘new_key’. :param new_key: name of the novel key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import NEWLINE >>> a_path = get_example_file()[0] >>> a_col = get_example_file(ext="tab")[0] >>> a_gtf = GTF(a_path) >>> a_gtf = a_gtf.add_attr_to_pos(open(a_col), new_key='foo') >>> a_list = a_gtf.extract_data('foo', as_list=True) >>> assert a_list[0] == 'AAA' >>> assert a_list[1] == 'BBB' >>> assert a_list[2] == '?' >>> assert a_list[3] == 'CCC'
-
add_exon_number
(key='exon_nbr')¶ Add exon number to exons.
Parameters: key – The key name to use for exon number. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_gtf = a_gtf.add_exon_number() >>> a_list = a_gtf.select_by_key("feature", "exon").select_by_key("transcript_id", "G0004T002").extract_data("exon_nbr", as_list=True) >>> assert a_list == ['1', '2', '3'] >>> a_list = a_gtf.select_by_key("feature", "exon").select_by_key("transcript_id", "G0003T001").extract_data("exon_nbr", as_list=True) >>> assert a_list == ['2', '1']
-
add_prefix
(feat='*', key=None, txt=None, suffix=False)¶ Add a prefix to an attribute value.
Parameters: - feat – The target features (default to all).
- key – The target key.
- txt – The string to be added as a prefix.
- suffix – If True append txt string as a suffix.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_gtf = a_gtf.add_prefix(feat="transcript", key="gene_id", suffix=False, txt="bla") >>> a_list = a_gtf.select_by_key("feature", "transcript").extract_data("gene_id", as_list=True, nr=True) >>> assert all([x.startswith('bla') for x in a_list]) >>> a_gtf = a_gtf.add_prefix(feat="transcript", key="gene_id", suffix=True, txt="bla") >>> a_list = a_gtf.select_by_key("feature", "transcript").extract_data("gene_id", as_list=True, nr=True) >>> assert all([x.endswith('bla') for x in a_list])
-
convert_to_ensembl
(check_gene_chr=True)¶ Convert a GTF file to ensembl format. This method will add gene and transcript features if not encountered in the GTF.
Parameters: check_gene_chr – By default the function raise an error if several chromosomes are associated with the same gene_id. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert len(a_gtf.select_by_key("feature", "gene,transcript", invert_match=True).convert_to_ensembl()) == len(a_gtf)
-
del_attr
(feat='*', keys=None, force=False)¶ Delete an attributes from the GTF file..
Parameters: - feat – The target feature.
- keys – Comma separated list or list of target keys.
- force – If True, transcript_id and gene_id can be deleted.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert a_gtf.del_attr(keys="gene_id,exon_id,ccds_id", force=True).get_attr_list() == ['transcript_id']
-
extract_data
(keys=None, as_list=False, zero_based=False, no_na=False, as_dict_of_values=False, as_dict_of_lists=False, as_dict=False, as_dict_of_merged_list=False, nr=False, hide_undef=False, as_list_of_list=False)¶ Extract attribute values into containers of various types. Note that in case dict are requested, key are force to no_na.
Note: no_na, hide_undef, nr, zero_based are not applied in case a tab object is returned (default).
Parameters: - keys – The name of the basic or extended attributes.
- as_list – if true returns values for the first requested key as a list.
- no_na – Discard ‘.’ (i.e undefined value).
- as_dict – if true returns a dict (where the key corresponds to values for the first requested key and values are set to 1). Note that ‘.’ won’t be set as a key (no need to use no_na).
- as_dict_of_values – if true returns a dict (where the key corresponds to values for the first requested key and the value to value encountered for the second requested key).
- as_dict_of_lists – if true returns a dict (where the key corresponds to values for the first requested key and the value to the last encountered list of other elements).
- as_dict_of_merged_list – if true returns a dict (where the key corresponds to values for the first requested key and the value to the list of other elements encountered throughout any line).
- as_list_of_list – if True returns a list for each line.
- nr – in case a as_list or as_list_of_list is choosen, return a non redondant list.
- hide_undef – if hide_undef, then records for which the key does not exists (value = “?”) are set to are discarded.
- zero_based – If set to True, the start position will be start-1.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_tab = a_gtf.extract_data("gene_id,transcript_id") >>> assert a_tab.ncols == 2 >>> assert len(a_tab) == 70 >>> assert len(a_gtf.extract_data("transcript_id", nr=False, as_list=True,no_na=False, hide_undef=False)) == 70 >>> assert len(a_gtf.extract_data("transcript_id", nr=False, as_list=True,no_na=False, hide_undef=True)) == 60 >>> assert len(a_gtf.select_by_key("feature", "transcript").extract_data("seqid,start")) == 15 >>> assert len(a_gtf.select_by_key("feature", "transcript").extract_data("seqid,start",as_list=True)) >>> assert a_gtf.select_by_key("feature", "transcript").extract_data("seqid,start", as_list=True, nr=True) == ['chr1'] >>> assert a_gtf.select_by_key("feature", "transcript").extract_data("bla", as_list=True, nr=False, hide_undef=False).count('?') == 15 >>> assert a_gtf.select_by_key("feature", "transcript").extract_data("bla", as_list=True, nr=True, hide_undef=False).count('?') == 1 >>> assert len(a_gtf.select_by_key("feature", "transcript").extract_data("start", as_dict=True)) == 11 >>> assert len(a_gtf.select_by_key("feature", "transcript").extract_data("seqid", as_dict=True)) == 1 >>> assert [len(x) for x in a_gtf.select_by_key("feature", "transcript").extract_data("seqid,start", as_list_of_list=True)].count(2) == 15 >>> assert len(a_gtf.select_by_key("feature", "transcript").extract_data("seqid,start", as_list_of_list=True, nr=True)) == 11
-
extract_data_iter_list
(keys=None, zero_based=False, nr=False)¶ Extract attributes of any type and returns an iterator of lists.
Parameters: - keys – The name of the basic or extended attributes.
- zero_based – If set to True, the start position will be start-1.
- nr – if True, return a list with non redundant items.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> gen = a_gtf.extract_data_iter_list("start,gene_id") >>> assert next(gen) == ['125', 'G0001'] >>> gen =a_gtf.extract_data_iter_list("start,gene_id", zero_based=True) >>> assert next(gen) == ['124', 'G0001'] >>> gen = a_gtf.extract_data_iter_list("start,gene_id") >>> assert len([x for x in gen]) == 70 >>> gen = a_gtf.extract_data_iter_list("start,gene_id", nr=True) >>> assert len([x for x in gen]) == 26
-
get_3p_end
(feat_type='transcript', name=['transcript_id'], sep='|', as_dict=False, more_name=[], feature_name=None, explicit=False)¶ Returns a Bedtool object containing the 3’ coordinates of selected features (Bed6 format) or a dict (zero-based coordinate).
Parameters: - feat_type – The feature type.
- name – The key that should be used to computed the ‘name’ column.
- more_name – Additional text to add to the name (a list).
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- as_dict – return the result as a dictionnary.
- feature_name – A feature name to be added to the 4th column.
- explicit – Write explicitly the key name in the 4th column (e.g transcript_id=NM123|gene_name=AGENE|…).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.select_by_key("feature", "exon").get_3p_end(feat_type="exon") >>> assert len(a_bed) == 25 >>> a_bed = a_gtf.select_by_key("feature", "exon").get_3p_end(feat_type="exon", name=["gene_id","exon_id"]) >>> assert len(a_bed) == 25 >>> a_bed = a_gtf.select_by_key("feature", "exon").get_3p_end(feat_type="exon", name=["gene_id","exon_id"], explicit=True) >>> for i in a_bed: break >>> assert i.name == 'gene_id=G0001|exon_id=G0001T002E001'
-
get_5p_end
(feat_type='transcript', name=['transcript_id'], sep='|', as_dict=False, more_name=[], one_based=False, feature_name=None, explicit=False)¶ Returns a Bedtool object containing the 5’ coordinates of selected features (Bed6 format) or a dict (zero-based coordinate).
Parameters: - feat_type – The feature type.
- name – The key that should be used to computed the ‘name’ column.
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- more_name – Additional text to add to the name (a list).
- as_dict – return the result as a dictionnary.
- one_based – if as_dict is requested, return coordinates in on-based format.
- feature_name – A feature name to be added to the 4th column.
- explicit – Write explicitly the key name in the 4th column (e.g transcript_id=NM123|gene_name=AGENE|…).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.select_by_key("feature", "exon").get_5p_end(feat_type="exon") >>> assert len(a_bed) == 25 >>> a_dict = a_gtf.select_by_key("feature", "exon").get_5p_end(feat_type="exon", name=["exon_id"], as_dict=True) >>> assert len(a_dict) == 25 >>> a_dict = a_gtf.select_by_key("feature", "exon").get_5p_end(feat_type="exon", name=["exon_id", "transcript_id"], as_dict=True) >>> assert len(a_dict) == 25 >>> a_bed = a_gtf.select_by_key("feature", "exon").get_5p_end(feat_type="exon", name=["transcript_id", "gene_id"], explicit=True) >>> for i in a_bed: break >>> assert i.name == 'transcript_id=G0001T002|gene_id=G0001'
-
get_attr_list
(add_basic=False, as_dict=False)¶ Get the list of possible attributes from a GTF file..
Parameters: - add_basic – add basic attributes to the list. Won’t work if as_dict is True.
- as_dict – If True, returns a dict with the attribute names as key and number of occurence as values.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> assert "".join([x[0] for x in GTF(a_file).get_attr_list()]) == 'gtec'
-
get_attr_value_list
(key=None, count=False)¶ Get the list of possible values taken by an attributes from a GTF file..
Parameters: - key – The key/attribute name to use.
- count – whether the number of occurences should be also returned (returns a list of list).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert len(a_gtf.get_attr_value_list(key="transcript_id")) == 15 >>> assert a_gtf.get_attr_value_list(key="transcript_id", count=True)[0] == ['G0001T001', '3'] >>> assert a_gtf.get_attr_value_list(key="transcript_id", count=True)[-1] == ['G0010T001', '3']
-
get_chroms
(nr=False, as_dict=False)¶ Returns the chomosome/sequence ID from the GTF.
Parameters: - nr – True to get a non redondant list.
- as_dict – True to get info as a dict.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_chr = GTF(a_file).get_chroms(nr=True) >>> assert len(a_chr) == 1 >>> a_chr = GTF(a_file).get_chroms() >>> assert len(a_chr) == 70
-
get_feature_list
(nr=False)¶ Returns the list of features.
Parameters: nr – True to get a non redondant list. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_feat = a_gtf.get_feature_list(nr=True) >>> assert len(a_feat) == 4 >>> a_feat = a_gtf.get_feature_list(nr=False) >>> assert len(a_feat) == 70
-
get_feature_size
(ft_type='gene', feat_id=None)¶ Returns feature size as a dict.
Parameters: - ft_type – The feature type (3rd column). Could be gene, transcript, exon, CDS…
- feat_id – the string used as feature identifier in the GTF. Could be gene_id, transcript_id, exon_id, CDS_id… By default it is computed as ft_type + ‘_id’
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert len(a_gtf.get_feature_size("exon", "exon_id")) == 25
-
get_gn_ids
(nr=False)¶ Returns all gene ids from the GTF file
Parameters: nr – True to get a non redondant list. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> gn_ids = a_gtf.get_gn_ids() >>> assert len(gn_ids) == 70 >>> gn_ids = a_gtf.get_gn_ids(nr=True) >>> assert len(gn_ids) == 10 >>> a_gtf = GTF(a_file).select_by_key("gene_id","G0001,G0002") >>> gn_ids = a_gtf.get_gn_ids(nr=True) >>> assert len(gn_ids) == 2
-
get_gn_strand
()¶ Returns a dict with gene IDs as keys and strands as values.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> strands = a_gtf.get_gn_strand() >>> assert strands['G0001'] == '+' >>> assert strands['G0001'] == '+' >>> assert strands['G0005'] == '-' >>> assert strands['G0006'] == '-'
-
get_gn_to_tx
(ordered_5p=False, as_dict_of_dict=False)¶ Returns a dict with genes as keys and the list of their associated transcripts as values.
params ordered_5p: If True, returns transcript list ordered based on their TSS coordinates (5’ to 3’). For ties, value are randomly picked. params as_dict_of_dict: If True, returns a dict of dict that maps gene_id to transcript_id and transcript_id to TSS numbering (1 for most 5’, then 2…).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file(datasetname="mini_real",ext="gtf.gz")[0] >>> a_gtf = GTF(a_file) >>> gn_2_tx = a_gtf.get_gn_to_tx(ordered_5p=True) >>> assert gn_2_tx['ENSG00000148337'][0:3] == ['ENST00000372948', 'ENST00000634901', 'ENST00000634501'] >>> assert gn_2_tx['ENSG00000148339'][0:3] == ['ENST00000373068', 'ENST00000373069', 'ENST00000373066'] >>> assert gn_2_tx['ENSG00000148339'][5:] == ['ENST00000472769', 'ENST00000445012', 'ENST00000466983'] >>> gn_2_tx = a_gtf.get_gn_to_tx(as_dict_of_dict=True) >>> assert gn_2_tx['ENSG00000153885']['ENST00000587658'] == 1 >>> assert gn_2_tx['ENSG00000153885']['ENST00000587559'] == 2 >>> assert gn_2_tx['ENSG00000153885']['ENST00000430256'] == 8 >>> assert gn_2_tx['ENSG00000153885']['ENST00000284006'] == gn_2_tx['ENSG00000153885']['ENST00000589786'] >>> assert gn_2_tx['ENSG00000164587']['ENST00000407193'] == 1 >>> assert gn_2_tx['ENSG00000164587']['ENST00000401695'] == 2 >>> assert gn_2_tx['ENSG00000164587']['ENST00000519690'] == 3 >>> assert gn_2_tx['ENSG00000105483']['ENST00000517510'] == gn_2_tx['ENSG00000105483']['ENST00000520007']
-
get_gname_to_tx
()¶ Returns a dict with gene names as keys and the list of their associated transcripts as values.
-
get_intergenic
(chrom_file=None, upstream=0, downstream=0, chr_list=None, feature_name=None)¶ Returns a bedtools object containing the intergenic regions.
Parameters: - chrom_file – file object pointing to a tabulated two-columns file. Chromosomes as column 1 and their sizes as column 2 (default: None)
- upstream – Extend transcript coordinates in 5’ direction.
- downstream – Extend transcript coordinates in 3’ direction.
- chr_list – a list of chromosome to restrict intergenic regions.
- feature_name – A feature name to be added to the 4th column.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> chr_info_path = get_example_file(ext="chromInfo")[0] >>> chr_info_file = open(chr_info_path, "r") >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.get_intergenic(chrom_file=chr_info_file) >>> assert len(a_bed) == 10
-
get_introns
(by_transcript=False, name=['transcript_id', 'gene_id'], sep='|', intron_nb_in_name=False, feat_name=False, feat_name_last=False)¶ Returns a bedtools object containing the intronic regions.
Parameters: - by_transcript – If false (default), merged genic regions with no exonic overlap are returned. Otherwise, the intronic regions corresponding to each transcript are returned (may contain exonic overlap).
- name – The list of ids that should be part of the 4th column of the bed file (if by_transcript is choosen).
- sep – The separator to be used for the name (if by_transcript is choosen).
- intron_nb_in_name – by default the intron number is added to the score column. If selected, write this information in the ‘name’ column of the bed file.
- feat_name – add the feature name (‘intron’) in the name column.
- feat_name_last – put feat_name in last position (but before intron_nb_in_name).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.get_introns() >>> assert len(a_bed) == 7
-
get_midpoints
(name=['transcript_id', 'gene_id'], sep='|')¶ Returns a bedtools object containing the midpoints of features.
Parameters: - name – The key that should be used to computed the ‘name’ column.
- sep – The separator used for the name (e.g ‘gene_id|transcript_id’)”.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.select_by_key("feature", "exon").get_midpoints() >>> assert len(a_bed) == 25
-
get_sequences
(genome=None, intron=False, rev_comp=True)¶ Returns a FASTA object from which various sequences can be retrieved.
param genome: The genome in fasta file. param intron: Set to True to get also intronic regions. Required for some methods of the FASTA object. param rev_comp: By default sequence for genes on minus strand are reversed-complemented. Set to False to avoid default behaviour. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file("simple", "gtf")[0] >>> a_gtf = GTF(a_file) >>> genome_fa = get_example_file("simple", "fa")[0] >>> # No intron, reverse-complement >>> a_fa = a_gtf.get_sequences(genome_fa) >>> assert len(a_fa) == 15 >>> a_dict = dict() >>> for i in a_fa: a_dict[i.header] = i.sequence >>> assert a_dict['>G0001_NA_G0001T001_chr1:125-138_+_mRNA'] == 'cccccgttacgtag' >>> assert a_dict['>G0008_NA_G0008T001_chr1:210-222_-_RC_mRNA'] == 'catgcgct' >>> assert a_dict['>G0004_NA_G0004T001_chr1:65-76_+_mRNA'] == 'atctggcg' >>> # No intron, no reverse-complement >>> a_fa = a_gtf.get_sequences(genome_fa, rev_comp=False) >>> a_dict = dict() >>> for i in a_fa: a_dict[i.header] = i.sequence >>> assert a_dict['>G0001_NA_G0001T001_chr1:125-138_+_mRNA'] == 'cccccgttacgtag' >>> assert a_dict['>G0008_NA_G0008T001_chr1:210-222_-_mRNA'] == 'agcgcatg' >>> assert a_dict['>G0004_NA_G0004T001_chr1:65-76_+_mRNA'] == 'atctggcg' >>> # Intron and Reverse-complement >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True) >>> a_dict = dict() >>> for i in a_fa: a_dict[i.header] = i.sequence >>> assert a_dict['>G0008_NA_G0008T001_chr1:210-222_-_RC'] == 'catatggtgcgct' >>> assert a_dict['>G0004_NA_G0004T001_chr1:65-76_+'] == 'atctcaggggcg' >>> # Intron and no Reverse-complement >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True, rev_comp=False) >>> a_dict = dict() >>> for i in a_fa: a_dict[i.header] = i.sequence >>> assert a_dict['>G0008_NA_G0008T001_chr1:210-222_-'] == 'agcgcaccatatg' >>> assert a_dict['>G0004_NA_G0004T001_chr1:65-76_+'] == 'atctcaggggcg'
-
get_transcript_size
(with_intron=False)¶ Returns a dict with transcript_id as keys and their mature RNA length (i.e exon only) of each transcript.
Parameters: with_intron – If True, add intronic regions. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_path = get_example_file()[0] >>> gtf = GTF(a_path) >>> size_dict = gtf.get_transcript_size() >>> assert(size_dict["G0008T001"] == 8) >>> size_dict = gtf.get_transcript_size(with_intron=True) >>> assert(size_dict["G0008T001"] == 13)
-
get_tss
(name=['transcript_id'], sep='|', as_dict=False, more_name=[], one_based=False, feature_name=None, explicit=False)¶ Returns a Bedtool object containing the TSSs (Bed6 format) or a dict (zero-based coordinate).
Parameters: - name – The key that should be used to computed the ‘name’ column.
- more_name – Additional text to add to the name (a list).
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- as_dict – return the result as a dictionnary.
- one_based – if as_dict is requested, return coordinates in on-based format.
- feature_name – A feature name to be added to the 4th column.
- explicit – Write explicitly the key name in the 4th column (e.g transcript_id=NM123|gene_name=AGENE|…).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.get_tss() >>> assert len(a_bed) == 15
-
get_tts
(name=['transcript_id'], sep='|', as_dict=False, more_name=[], feature_name=None, explicit=False)¶ Returns a Bedtool object containing the TTSs (Bed6 format) or a dict (zero-based coordinate).
Parameters: - name – The key that should be used to computed the ‘name’ column.
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- as_dict – return the result as a dictionnary.
- more_name – Additional text to add to the name (a list).
- feature_name – A feature name to be added to the 4th column.
- explicit – Write explicitly the key name in the 4th column (e.g transcript_id=NM123|gene_name=AGENE|…).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bed = a_gtf.get_tts() >>> assert len(a_bed) == 15
-
get_tx_ids
(nr=False)¶ Returns all transcript_id from the GTF file
Parameters: nr – True to get a non redondant list. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> tx_ids = a_gtf.get_tx_ids() >>> assert len(tx_ids) == 60 >>> tx_ids = a_gtf.get_tx_ids(nr=True) >>> assert len(tx_ids) == 15 >>> a_gtf = GTF(a_file).select_by_key("transcript_id","G0001T002,G0001T001,G0003T001") >>> tx_ids = a_gtf.get_tx_ids(nr=True) >>> assert len(tx_ids) == 3
-
get_tx_strand
()¶ Returns a dict with transcript IDs as keys and strands as values.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> strands = a_gtf.get_tx_strand() >>> assert strands['G0007T001'] == '+' >>> assert strands['G0001T002'] == '+' >>> assert strands['G0009T001'] == '-' >>> assert strands['G0008T001'] == '-'
-
get_tx_to_gn
()¶ Returns a dict with transcripts IDs as keys and gene IDs as values.
-
get_tx_to_gname
()¶ Returns a dict with transcript IDs as keys and gene names as values.
-
head
(nb=6, returned=False)¶ Print the nb first lines of the GTF object.
Parameters: - n – The number of line to display.
- returned – If True, don’t print but returns the message.
-
merge_attr
(feat='*', keys='gene_id, transcript_id', new_key='gn_tx_id', sep='|')¶ Merge a set of source keys (e.g gene_id and transcript_id) into a destination attribute.
Parameters: - feat – The target features.
- keys – The source keys.
- new_key – The destination key.
- sep – The separator.
>>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_list = a_gtf.merge_attr(feat="exon,transcript,CDS", keys="gene_id,transcript_id", new_key="merge").extract_data("merge", hide_undef=True, as_list=True, nr=True) >>> assert a_list[0] == 'G0001|G0001T002'
-
nb_exons
()¶ Return a dict with transcript as key and number of exon as value.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> nb_ex = a_gtf.nb_exons() >>> assert len(nb_ex) == 15 >>> assert nb_ex['G0006T001'] == 3 >>> assert nb_ex['G0004T002'] == 3 >>> assert nb_ex['G0004T002'] == 3 >>> assert nb_ex['G0005T001'] == 2 >>> a_file = get_example_file("simple_04")[0] >>> a_gtf = GTF(a_file) >>> nb_ex = a_gtf.nb_exons() >>> assert nb_ex['G0004T001'] == 4
-
nrow
()¶ Get the number of lines enclosed in the GTF object.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert a_gtf.nrow() == 70 >>> assert a_gtf.select_by_key("feature","gene").nrow() == 10 >>> assert a_gtf.select_by_key("feature","transcript").nrow() == 15 >>> assert a_gtf.select_by_key("feature","exon").nrow() == 25 >>> assert a_gtf.select_by_key("feature","CDS").nrow() == 20
-
select_5p_transcript
()¶ Select the most 5’ transcript of each gene. Note that the genomic boundaries for each gene could differ from that displayed by its associated transcripts. The gene features may be subsequently deleted using select_by_key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file)
-
select_by_key
(key=None, value=None, invert_match=False, file_with_values=None, no_na=True, col=1)¶ Select lines from a GTF based i) on attributes and values 2) on feature type.
Parameters: - key – The key/attribute name to use for selection.
- value – Value for that key/attribute.
- invert_match – Boolean. Select lines that do not match that key and value.
- file_with_values – A file containing values (one column).
- col – The column number (one-based) that contains the values in the file.
- no_na – If invert_match is True then returns only record for which the value is defined.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b_gtf = a_gtf.select_by_key("gene_id", "G0004") >>> assert len(b_gtf) == 13 >>> b_gtf = a_gtf.select_by_key("transcript_id", "G0004T002") >>> assert len(b_gtf) == 7 >>> b_gtf = a_gtf.select_by_key("feature", "transcript") >>> assert len(b_gtf) == 15 >>> b_gtf = a_gtf.select_by_key("feature", "gene") >>> assert len(b_gtf) == 10 >>> assert len(a_gtf.select_by_key("feature", "exon")) == 25
-
select_by_loc
(chr_str, start_str, end_str)¶ Select lines from a GTF based on genomic location.
Parameters: - chr_str – A comma separated list of chromosomes.
- start_str – A comma separated list of starts.
- end_str – A comma separated list of ends.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b = a_gtf.select_by_loc("chr1", "125", "138") >>> assert len(b) == 7
-
select_by_max_exon_nb
()¶ For each gene select the transcript with the highest number of exons. If ties, select the first encountered.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file("simple_04")[0] >>> a_gtf = GTF(a_file) >>> b = a_gtf.select_by_max_exon_nb() >>> l = b.extract_data("transcript_id", as_list=True, nr=True, no_na=True, hide_undef=True) >>> assert "G0005T001" not in l >>> assert "G0004T002" not in l >>> assert "G0006T002" not in l >>> assert len(l) == 10
-
select_by_number_of_exons
(min=0, max=None)¶ Select transcript from a GTF based on the number of exons.
Parameters: - min – Minimum number of exons.
- max – Maximum number of exons.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b = a_gtf.select_by_number_of_exons(3,3).select_by_key("feature","transcript") >>> assert len(b) == 3
-
select_by_numeric_value
(bool_exp=None, na_omit=['.', '?'])¶ Test a numeric value. Select lines using a boolean operation on attributes. The boolean expression must contain attributes enclosed with braces. Example: “cDNA_length > 200 and tx_genomic_length > 2000”. Note that one can refers to the name of the first columns of the gtf using e.g: start, end, score or frame.
Parameters: - bool_exp – A simple boolean operation. And/or operation are not supported at the moment.
- na_omit – Any line for which one of the tested value is in this list wont be evaluated (e.g. [“.”, “?”]).
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import TAB >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b_gtf = a_gtf.add_attr_from_list(feat="transcript", key="transcript_id", key_value=["G0001T001","G0002T001","G0003T001","G0004T001"], new_key="test", new_key_value=["10","11","20","40"]) >>> c_list = b_gtf.select_by_key("feature","transcript").select_by_numeric_value("test > 2").extract_data("test", as_list=True) >>> assert c_list == ['10', '11', '20', '40'] >>> a_file = get_example_file(datasetname="mini_real", ext="gtf.gz")[0] >>> a_gtf = GTF(a_file) >>> tx = a_gtf.select_by_key('feature','transcript') >>> assert len(tx.select_by_key("score", "0")) == len(tx.select_by_numeric_value("score == 0")) >>> assert len(tx.select_by_key("start", "1001145")) == len(tx.select_by_numeric_value("start == 1001145")) >>> assert len(tx.select_by_key("start", "1001145", invert_match=True)) == len(tx.select_by_numeric_value("start != 1001145")) >>> assert len(tx.select_by_key("end", "54801291")) == len(tx.select_by_numeric_value("end == 54801291")) >>> assert len(tx.select_by_numeric_value("end == 54801291 and start == 53802771")) == 3 >>> assert len(tx.select_by_numeric_value("end == 54801291 and (start == 53802771 or start == 53806156)")) == 4 >>> assert len(a_gtf.select_by_numeric_value("phase > 0 and phase < 2", na_omit=".").extract_data('phase', as_list=True, nr=True)) == 1
-
select_by_positions
(pos=None)¶ Select a set of lines by position. Numbering zero-based.
Parameters: pos – Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> assert a_gtf.select_by_positions([0]).extract_data("feature", as_list=True) == ['gene'] >>> assert a_gtf.select_by_positions(list(range(3))).extract_data("feature", as_list=True) == ['gene', 'transcript', 'exon'] >>> assert a_gtf.select_by_positions([3,4,1]).select_by_positions([0,1]).extract_data("feature", as_list=True) == ['CDS', 'transcript']
-
select_by_regexp
(key=None, regexp=None, invert_match=False)¶ Returns lines for which key value match a regular expression.
Parameters: - key – The key/attribute name to use for selection.
- regexp – the regular expression
- invert_match – Select lines are those that do not match.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b_list = a_gtf.select_by_regexp("transcript_id", "^G00[012]+T00[12]$").select_by_key("feature","transcript").extract_data("transcript_id", as_list=True) >>> assert b_list == ['G0001T002', 'G0001T001', 'G0002T001', 'G0010T001'] >>> assert len(a_gtf.select_by_regexp("transcript_id", "^G00\d+T00\d+$"))==60 >>> assert len(a_gtf.select_by_regexp("transcript_id", "^G00\d+T00.*2$").get_tx_ids(nr=True))==5
-
select_by_transcript_size
(min=0, max=1000000000)¶ Select ‘mature/spliced transcript by size.
Parameters: - min – Minimum size for the feature.
- max – Maximum size for the feature.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> tx_kept = a_gtf.select_by_transcript_size(min=5, max=10) >>> assert len(tx_kept) == 45
-
select_longuest_transcripts
()¶ Select the longuest transcript of each gene. The “gene” rows are returned in the results. Note that the genomic boundaries for each gene could differ from that displayed by its associated transcripts. The gene features may be subsequently deleted using select_by_key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file)
-
select_shortest_transcripts
()¶ Select the shortest transcript of each gene. Note that the genomic boundaries for each gene could differ from that displayed by its associated transcripts. The gene features may be subsequently deleted using select_by_key.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file)
-
tail
(nb=6)¶ Print the nb last lines of the GTF object.
Parameters: n – The number of line to display.
-
to_bed
(name=['gene_id'], sep='|', add_feature_type=False, more_name=[])¶ Returns a Bedtool object (Bed6 format).
Parameters: - name – The keys that should be used to computed the ‘name’ column.
- more_name – Additional text to add to the name (a list).
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- add_feature_type – Add the feature type to the name.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> a_bo = a_gtf.select_by_key("feature", "transcript").to_bed() >>> assert a_bo.field_count() == 6 >>> assert len(a_bo) == 15 >>> for i in a_bo: pass >>> a_bo = a_gtf.select_by_key("feature", "transcript").to_bed(name=['gene_id', 'transcript_id']) >>> for i in a_bo: pass >>> a_bo = a_gtf.select_by_key("feature", "transcript").to_bed(name=['gene_id', 'transcript_id'], sep="--") >>> for i in a_bo: pass
-
write
(output, add_chr=0, gc_off=False)¶ write the gtf to a file.
Parameters: - output – A file object where the GTF has to be written.
- add_chr – whether to add ‘chr’ prefix to seqid before printing.
- gc_off – If True, shutdown the garbage collector before printing to avoid passing through free_gtf_data which can be time-consuming.
Example: >>> from pygtftk.utils import simple_line_count >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file) >>> b_file = make_tmp_file() >>> a_gtf.write(b_file) >>> assert simple_line_count(b_file) == 70
-
The FASTA class¶
Class declaration of the FASTA object (may be returned by GTF object methods).
-
class
pygtftk.fasta_interface.
FASTA
(fn, ptr, intron, rev_comp)¶ A class representation of a fasta file connection. A end product from a GTF object.
-
as_dict
(feat='exon')¶ Iterate feature wise
Parameters: feat – The feature type (exon, CDS). Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file("simple", "gtf")[0] >>> a_gtf = GTF(a_file) >>> genome_fa = get_example_file("simple", "fa")[0] >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True) >>> a_dict = a_fa.as_dict(feat="exon") >>> assert a_dict[('G0001', 'G0001T002', 'chr1', 125, 138, '+', 'exon')] == 'cccccgttacgtag' >>> assert a_dict[('G0003', 'G0003T001', 'chr1', 50, 54, '-', 'exon')] == 'gcttg' >>> assert a_dict[('G0003', 'G0003T001', 'chr1', 57, 61, '-', 'exon')] == 'aatta' >>> assert a_dict[('G0004', 'G0004T001', 'chr1', 71, 71, '+', 'exon')] == 'g' >>> assert a_dict[('G0004', 'G0004T001', 'chr1', 65, 68, '+', 'exon')] == 'atct' >>> a_dict = a_fa.as_dict(feat="CDS") >>> assert a_dict[('G0001', 'G0001T002', 'chr1', 125, 130, '+', 'CDS')] == 'cccccg' >>> assert a_dict[('G0003', 'G0003T001', 'chr1', 50, 52, '-', 'CDS')] == 'ttg' >>> assert a_dict[('G0004', 'G0004T001', 'chr1', 65, 67, '+', 'CDS')] == 'atc' >>> assert a_dict[('G0004', 'G0004T002', 'chr1', 71, 71, '+', 'CDS')] == 'g' >>> assert a_dict[('G0004', 'G0004T002', 'chr1', 74, 75, '+', 'CDS')] == 'gc' >>> assert a_dict[('G0006', 'G0006T001', 'chr1', 22, 25, '-', 'CDS')] == 'acat' >>> assert a_dict[('G0006', 'G0006T001', 'chr1', 28, 30, '-', 'CDS')] == 'att'
-
enumerate
(type='any', add_feature=False)¶ Iterate transcript wise and return tx_id and a list of exons
-
iter_features
(feat='exon', no_feat_name=False)¶ Iterate over features.
Parameters: - feat – The feature type (exon, CDS).
- no_feat_name – Don’t write the feature type in the header.
Example: >>> # Load sequence >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file("simple", "gtf")[0] >>> a_gtf = GTF(a_file) >>> genome_fa = get_example_file("simple", "fa")[0] >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True) >>> # Get exons >>> a_list = [] >>> for i in a_fa.iter_features("exon"): a_list += [i] >>> assert len(a_list) == 25 >>> assert [rec.sequence for rec in a_list if rec.transcript_id == 'G0008T001'] == ['cat', 'gcgct'] >>> assert [rec.sequence for rec in a_list if rec.transcript_id == 'G0004T001'] == ['atct', 'g', 'gcg'] >>> # Get exons not reverse-complemented >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True, rev_comp=False) >>> a_list = [] >>> for i in a_fa.iter_features("exon"): a_list += [i] >>> assert len(a_list) == 25 >>> assert [rec.sequence for rec in a_list if rec.transcript_id == 'G0008T001'] == ['agcgc', 'atg'] >>> assert [rec.sequence for rec in a_list if rec.transcript_id == 'G0004T001'] == ['atct', 'g', 'gcg']
-
transcript_as_bioseq_records
()¶ Iterate over transcript sequences with or without intron depending on arguments provided to GTF.get_sequences(). Returns objects of class Bio.SeqRecord.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file("simple", "gtf")[0] >>> a_gtf = GTF(a_file) >>> genome_fa = get_example_file("simple", "fa")[0] >>> a_fa = a_gtf.get_sequences(genome_fa) >>> a_list = [] >>> for i in a_fa.transcript_as_bioseq_records(): a_list += [i] >>> assert len(a_list) == 15 >>> rec = a_list[0]
-
write
(outputfile)¶ Write the sequences to a fasta file.
Parameters: outputfile – Output file (file object). Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import simple_line_count >>> a_file = get_example_file("simple", "gtf")[0] >>> a_gtf = GTF(a_file) >>> genome_fa = get_example_file("simple", "fa")[0] >>> a_fa = a_gtf.get_sequences(genome_fa, intron=True) >>> tmp_file = make_tmp_file() >>> a_fa.write(tmp_file) >>> tmp_file.close() >>> assert simple_line_count(open(tmp_file.name)) == 30
-
The TAB class¶
Class declaration of the TAB object (may be returned by a GTF instance).
-
class
pygtftk.tab_interface.
TAB
(fn, ptr, dll=None)¶ A class representation of a tabulated matrix. An end product from a GTF object.
-
as_data_frame
()¶ Convert the TAB object into a dataframe.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file() >>> a_gtf = GTF(a_file) >>> a_tab = a_gtf.extract_data("gene_id,start") >>> assert a_tab.as_data_frame()["gene_id"].nunique() == 10
-
as_simple_list
(which_col=0)¶ Convert the selected column of a TAB object into a list.
Parameters: which_col – The column number. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file)[("feature", "gene")] >>> a_tab = a_gtf.extract_data("seqid") >>> assert list(set(a_tab.as_simple_list(0))) == ['chr1']
-
iter_as_list
()¶ Iterate over the TAB object and return a list of self.nb_columns elements.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file() >>> a_gtf = GTF(a_file) >>> a_tab = a_gtf.extract_data("gene_id,start") >>> assert next(a_tab.iter_as_list()) == list(a_tab[0])
-
iterate_with_header
()¶ Iterate over FieldSets. The first line contains the field names.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> a_file = get_example_file()[0] >>> a_gtf = GTF(a_file).extract_data("all") >>> assert next(a_gtf.iterate_with_header())[0] == 'seqid' >>> for i in a_gtf.iterate_with_header(): pass >>> assert 'CDS_G0010T001' in list(i)
-
write
(outfile=None, sep='\t')¶ Write a tab object to a file.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.gtf_interface import GTF >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import simple_line_count >>> from pygtftk.utils import simple_nb_column >>> a_file = get_example_file()[0] >>> a_tab = GTF(a_file).extract_data("transcript_id,gene_id") >>> out_file = make_tmp_file() >>> a_tab.write(out_file) >>> out_file.close() >>> assert simple_line_count(out_file) == 70 >>> assert simple_nb_column(out_file) == 2
-
The Line module¶
This section contains representation of GTF, TAB and FASTA records (i.e, ‘lines’). When iterating over GTF, TAB or FASTA instances, the following objects will be returned respectively:
- Feature
- FieldSet
- FastaSequence
-
class
pygtftk.Line.
FastaSequence
(ptr=None, alist=None, feat='transcript', rev_comp=False)¶ A class representating a fasta file line.
-
format
()¶ Return a formated fasta sequence.
Example: >>> from pygtftk.Line import FastaSequence >>> a = FastaSequence(alist=['>bla', 'AATACAGAGAT','chr21','+', 'BLA', 'NM123', 123, 456, 'transcript']) >>> assert a.sequence == 'AATACAGAGAT' >>> assert a.header == '>bla'
-
write
(file_out=None)¶ Write FastaSequence to a file or stdout (if file is None).
Example: >>> from pygtftk.Line import FastaSequence >>> from pygtftk.utils import make_tmp_file >>> a = FastaSequence(alist=['>bla', 'AATACAGAGAT','chr21','+', 'BLA', 'NM123', 123, 456, 'transcript']) >>> f = make_tmp_file() >>> a.write(f) >>> f.close() >>> from pygtftk.utils import simple_line_count >>> assert simple_line_count(f) == 2
-
-
class
pygtftk.Line.
Feature
(ptr=None, alist=None)¶ Class representation of a genomic feature. Corresponds to a GTF line.
-
add_attr
(key, val)¶ Add an attribute to the Feature instance.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> feat.add_attr("foo", "bar") >>> assert feat.get_attr_value('foo') == ['bar']
-
add_attr_and_write
(key, val, outputfile)¶ Add a key/attribute record and write to a file.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> feat = get_example_feature() >>> tmp_file = make_tmp_file() >>> feat.add_attr_and_write("foo", "bar", tmp_file) >>> tmp_file.close() >>> from pygtftk.gtf_interface import GTF >>> gtf = GTF(tmp_file.name, check_ensembl_format=False) >>> assert gtf.extract_data("foo", as_list=True) == ['bar']
-
format
()¶ Returns Feature as a string in gtf file format. Typically to write to a gtf file.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import TAB >>> feat = get_example_feature() >>> assert len(feat.format().split(TAB)) == 9
-
format_tab
(attr_list, sep='\t')¶ Returns Feature as a string in tabulated format.
Parameters: - attr_list – the attributes names.
- sep – separator.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import TAB >>> feat = get_example_feature() >>> a = TAB.join(['chr1', 'Unknown', 'transcript', '100', '200', '.', '+', '.', 'g1t1']) >>> assert feat.format_tab('transcript_id') == a
-
classmethod
from_list
(a_list)¶ Create a Feature object from a list.
Params alist: a list. Example: >>> from pygtftk.Line import Feature >>> alist = ['chr1','Unknown','transcript', 100, 200] >>> from collections import OrderedDict >>> d = OrderedDict() >>> d['transcript_id'] = 'g1t1' >>> d['gene_id'] = 'g1' >>> alist +=['.','+','.',d] >>> a = Feature.from_list(alist) >>> assert a.attr['transcript_id'] == 'g1t1' >>> assert a.attr['gene_id'] == 'g1'
-
get_3p_end
()¶ Get the 3’ end of the feature. Returns ‘end’ if on ‘+’ strand ‘start’ otherwise (one based).
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_3p_end() == 200
-
get_5p_end
()¶ Get the 5’ end of the feature. Returns ‘start’ if on ‘+’ strand ‘end’ otherwise (one based).
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_5p_end() == 100 >>> from pygtftk.Line import Feature >>> from collections import OrderedDict >>> a_feat = Feature.from_list(['1','pygtftk', 'exon', '100', '200', '0', '-', '1', OrderedDict()]) >>> assert a_feat.get_5p_end() == 200 >>> a_feat = Feature.from_list(['1','pygtftk', 'exon', '100', '200', '0', '+', '1', OrderedDict()]) >>> assert a_feat.get_5p_end() == 100
-
get_attr_names
()¶ Returns the attribute names from the Feature.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_attr_names() == ['transcript_id', 'gene_id']
-
get_attr_value
(attr_name, upon_none='continue')¶ Get the value of a basic or extended attribute.
Parameters: - attr_name – Name of the attribute/key.
- upon_none – Wether we should ‘continue’, ‘raise’ an error or ‘set_na’.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_attr_value('transcript_id') == ['g1t1'] >>> assert feat.get_attr_value('chrom') == ['chr1'] >>> assert feat.get_attr_value('end') == [200] >>> assert feat.get_attr_value('bla', upon_none='continue') == [None] >>> assert feat.get_attr_value('bla', upon_none='set_na') == ['.']
-
get_gn_id
()¶ Get value for ‘gene_id’ attribute.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_gn_id() == 'g1'
-
get_tx_id
()¶ Get value for ‘transcript_id’ attribute.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> assert feat.get_tx_id() == 'g1t1'
-
set_attr
(key, val, upon_none='continue')¶ Set an attribute value of the Feature instance.
Parameters: - key – The key.
- val – The value.
- upon_none – Wether we should ‘continue’, ‘raise’ an error or ‘set_na’.
Example: >>> from pygtftk.utils import get_example_feature >>> feat = get_example_feature() >>> feat.set_attr("gene_id", "bla") >>> assert feat.get_gn_id() == 'bla'
-
write
(file_out='-')¶ Write Feature to a file or stdout (if file is ‘-‘).
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> feat = get_example_feature() >>> tmp_file = make_tmp_file() >>> feat.write(tmp_file) >>> tmp_file.close() >>> from pygtftk.utils import simple_line_count >>> assert simple_line_count(tmp_file) == 1
-
write_bed
(name=None, format='bed6', outputfile=None)¶ Write the Feature instance in bed format (Zero-based).
Parameters: - name – A string to use as name (Column 4).
- format – Format should be one of ‘bed/bed6’ or ‘bed3’. Default to bed6.
- outputfile – the file object were data should be printed.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import TAB >>> from pygtftk.utils import simple_line_count >>> tmp_file = make_tmp_file() >>> feat = get_example_feature() >>> feat.write_bed(name="foo", outputfile=tmp_file) >>> tmp_file.close() >>> for line in open(tmp_file.name): line= line.split(TAB) >>> assert line[3] == 'foo' >>> assert simple_line_count(tmp_file) == 1
-
write_bed_3p_end
(name=None, format='bed6', outputfile=None)¶ Write the 3p end coordinates of feature in bed format (Zero-based).
Parameters: - outputfile – The output file name.
- name – The string to be used as fourth column.
- format – Format should be one of ‘bed/bed6’ or ‘bed3’. Default to bed6.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import TAB >>> from pygtftk.utils import simple_line_count >>> tmp_file = make_tmp_file() >>> feat = get_example_feature() >>> feat.write_bed_3p_end(name='foo', outputfile=tmp_file) >>> tmp_file.close() >>> for line in open(tmp_file.name): line= line.split(TAB) >>> assert line[1] == '199' >>> assert line[2] == '200' >>> assert simple_line_count(tmp_file) == 1
-
write_bed_5p_end
(name=None, format='bed6', outputfile=None)¶ Write the 5p end coordinates of feature in bed format (Zero-based).
Parameters: - outputfile – the output file.
- name – The keys that should be used to computed the ‘name’ column.
- format – Format should be one of ‘bed/bed6’ or ‘bed3’. Default to bed6.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import TAB >>> from pygtftk.utils import simple_line_count >>> tmp_file = make_tmp_file() >>> feat = get_example_feature() >>> feat.write_bed_5p_end(name='foo', outputfile=tmp_file) >>> tmp_file.close() >>> for line in open(tmp_file.name): line= line.split(TAB) >>> assert line[1] == '99' >>> assert line[2] == '100' >>> assert simple_line_count(tmp_file) == 1
-
write_gtf_to_bed6
(name=['transcript_id', 'gene_id'], sep='|', add_feature_type=False, outputfile=None)¶ Write the Feature instance in bed format (Zero-based). Select the key/values to add as a name.
Parameters: - name – The keys that should be used to computed the ‘name’ column.
- sep – The separator used for the name (e.g ‘gene_id|transcript_id”.
- add_feature_type – Add the feature type to the name.
- outputfile – the file object were data should be printed.
Example: >>> from pygtftk.utils import get_example_feature >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import TAB >>> from pygtftk.utils import simple_line_count >>> tmp_file = make_tmp_file() >>> feat = get_example_feature() >>> feat.write_gtf_to_bed6(outputfile=tmp_file) >>> tmp_file.close() >>> for line in open(tmp_file.name): line= line.split(TAB) >>> assert len(line) == 6 >>> assert line[3] == 'g1t1|g1' >>> assert simple_line_count(tmp_file) == 1
-
-
class
pygtftk.Line.
FieldSet
(ptr=None, size=None, alist=None, ft_type=None)¶ A class representating a set of Fields obtained from a splitted line.
-
format
(separator='\t')¶ Return a tabulated string.
Parameters: separator – output field separator. Example: >>> from pygtftk.Line import FieldSet >>> from pygtftk.utils import TAB >>> a = FieldSet(alist=['chr1', '123', '456']) >>> assert len(a.format().split(TAB)) == 3
-
write
(file_out=None, separator='\t')¶ Write FieldSet to a file or stdout (if file is None).
Parameters: - separator – output field separator.
- file_out – output file object.
Example: >>> from pygtftk.Line import FieldSet >>> from pygtftk.utils import make_tmp_file >>> from pygtftk.utils import simple_line_count >>> from pygtftk.utils import simple_nb_column >>> a = FieldSet(alist=['chr1', '123', '456']) >>> f = make_tmp_file() >>> a.write(f) >>> f.close() >>> assert simple_line_count(f) == 1 >>> assert simple_nb_column(f) == 3
-
The gtftk.utils module¶
A set of useful functions.
-
exception
pygtftk.utils.
GTFtkError
(value)¶
-
exception
pygtftk.utils.
GTFtkInteractiveError
¶
-
pygtftk.utils.
add_prefix_to_file
(infile, prefix=None)¶ Returns a file object or path (as string) in which the base name is prefixed with ‘prefix’. Returns a file object (write mode) or path depending on input type.
Parameters: - infile – A file object or path.
- prefix – the prefix to add.
Example: >>> from pygtftk.utils import add_prefix_to_file >>> result = 'data/simple/bla_simple.gtf' >>> assert add_prefix_to_file('data/simple/simple.gtf', "bla_") == result
-
pygtftk.utils.
add_r_lib
(cmd=None, libs=None)¶ Declare a new set of required R libraries.
Parameters: - libs – Comma separated list of R libraries.
- cmd – The target command.
-
pygtftk.utils.
check_file_or_dir_exists
(file_or_dir=None)¶ Check if a file/directory or a list of files/directories exist. Raise error if a file is not found.
Parameters: - file_or_dir – file object or a list of file object.
- is_list – Is file_or_dir a list ?
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import check_file_or_dir_exists >>> assert check_file_or_dir_exists(get_example_file()[0]) >>> assert check_file_or_dir_exists(get_example_file())
-
pygtftk.utils.
check_r_installed
()¶ Check R is installed.
Example: >>> from pygtftk.utils import check_r_installed >>> # check_r_installed()
-
pygtftk.utils.
check_r_packages
(r_pkg_list=None, no_error=True)¶ Return True if R packages are installed. Return False otherwise.
Parameters: - no_error – don’t raise an error if some packages are not found.
- r_pkg_list – the list of R packages.
Example: >>> from pygtftk.utils import check_r_packages
-
pygtftk.utils.
chomp
(string)¶ Delete carriage return and line feed from end of string.
Example: >>> from pygtftk.utils import chomp >>> from pygtftk.utils import NEWLINE >>> assert NEWLINE not in chomp("blabla\r\n")
-
pygtftk.utils.
chrom_info_as_dict
(chrom_info_file)¶ Check the format of the chromosome info file and return a dict.
Parameters: chrom_info_file – A file object. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import chrom_info_as_dict >>> a = get_example_file(ext='chromInfo')[0] >>> b = chrom_info_as_dict(open(a, "r")) >>> assert b['chr1'] == 300 >>> assert b['all_chrom'] == 900
-
pygtftk.utils.
chrom_info_to_bed_file
(chrom_file, chr_list=None)¶ Return a bed file (file object) from a chrom info file.
Parameters: - chrom_file – A file object.
- chr_list – A list of chromosome to be printed to bed file.
>>> from pygtftk.utils import chrom_info_to_bed_file >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import PY3 >>> from pygtftk.utils import PY2 >>> from pybedtools import BedTool >>> a = get_example_file(ext='chromInfo') >>> b = chrom_info_to_bed_file(open(a[0], 'r')) >>> c = BedTool(b.name) >>> d = c.__iter__() >>> i = next(d) >>> if PY2: assert str(i.chrom) == 'chr2' >>> if PY2: assert i.start == 0 >>> if PY2: assert i.end == 600 >>> if PY3: assert str(i.chrom) == 'chr1' >>> if PY3: assert i.start == 0 >>> if PY3: assert i.end == 300 >>> i = next(d) >>> if PY3: assert str(i.chrom) == 'chr2' >>> if PY3: assert i.start == 0 >>> if PY3: assert i.end == 600 >>> if PY2: assert str(i.chrom) == 'chr1' >>> if PY2: assert i.start == 0 >>> if PY2: assert i.end == 300
-
pygtftk.utils.
close_properly
(*args)¶ Close a set of file if they are not None.
Example: >>> from pygtftk.utils import close_properly
-
pygtftk.utils.
flatten_list
(x, outlist=[])¶ Flatten a list of lists.
Parameters: x – a list or list of list. Example: >>> from pygtftk.utils import flatten_list >>> b = ["A", "B", "C"] >>> a = ["a", "b", "c"] >>> c = ['a', 'b', 'c', 'A', 'B', 'C', 'string'] >>> # Call it with an empty list (outlist) >>> # otherwise, element will be added to the existing >>> # outlist variable >>> assert flatten_list([a,b, "string"], outlist=[]) == c >>> assert flatten_list("string", outlist=[]) == ['string']
-
pygtftk.utils.
flatten_list_recur
(a_list, sep=' ')¶ Join a list of list. :param a_list: a list. :param sep: the separator.
-
pygtftk.utils.
get_example_feature
()¶ Returns an example Feature (i.e GTF line).
Example: >>> from pygtftk.utils import get_example_feature >>> a = get_example_feature()
-
pygtftk.utils.
get_example_file
(datasetname='simple', ext='gtf')¶ Get the path to a gtf file example located in pygtftk library path.
Parameters: - datasetname – The name of the dataset. One of ‘simple’, ‘mini_real’, ‘simple_02, ‘simple_03’.
- ext – Extension. For ‘simple’ dataset, can be one of ‘bw’, ‘fa’, ‘fa.fai’, ‘chromInfo’, ‘bt*’, ‘fq’, ‘gtf’ or ‘.*’.
Example: >>> from pygtftk.utils import get_example_file >>> a= get_example_file() >>> assert a[0].endswith('gtf') >>> a= get_example_file(ext="bam") >>> assert a[0].endswith('bam') >>> a= get_example_file(ext="bw") >>> assert a[0].endswith('bw')
-
pygtftk.utils.
head_file
(afile=None, nb_line=6)¶ Display the first lines of a file (debug).
Parameters: - afile – an input file.
- nb_line – the number of lines to print.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import head_file >>> my_file = open(get_example_file()[0], "r") >>> #head_file(my_file)
-
pygtftk.utils.
intervals
(l, n, silent=False)¶ Returns a list of tuple that correspond to n ~ equally spaced intervals between 0 and l. Note that n must be lower than len(l).
Parameters: - silent – use a silent mode.
- l – a range
- n – number of equally spaced intervals.
Example: >>> from pygtftk.utils import intervals >>> l = range(10) >>> a = intervals(l, 3) >>> assert a == [(0, 3), (3, 6), (6, 10)]
-
pygtftk.utils.
is_comment
(string)¶ Check wether a string is a comment (starts with #…).
Parameters: string – The string to be tested. Example: >>> from pygtftk.utils import is_comment >>> assert is_comment("#bla")
-
pygtftk.utils.
is_empty
(string)¶ Is the string empty.
Parameters: string – The string to be tested. Example: >>> from pygtftk.utils import is_empty >>> assert is_empty("")
-
pygtftk.utils.
is_exon
(string)¶ Does the string contains ‘exon’ preceded or not by spaces.
Parameters: string – The string to be tested. Example: >>> from pygtftk.utils import is_exon >>> assert is_exon("Exon") and is_exon("exon")
-
pygtftk.utils.
is_fasta_header
(string)¶ Check if the line is a fasta header.
Parameters: string – a character string. Example: >>> from pygtftk.utils import is_fasta_header >>> assert is_fasta_header(">DFTDFTD")
-
pygtftk.utils.
make_outdir_and_file
(out_dir=None, alist=None, force=False)¶ Create output directory and a set of associated files (alist). Return a list of file handler (write mode) for the requested files.
Parameters: - out_dir – the directory where file will be located.
- alist – the list of requested file names.
- force – if true, don’t abort if directory already exists.
Example: >>> from pygtftk.utils import make_outdir_and_file
-
pygtftk.utils.
make_tmp_file
(prefix='tmp', suffix='', store=True, dir=None)¶ This function should be call to create a temporary file as all files declared in TMP_FILE_LIST will be remove upon command exit.
Parameters: - prefix – a prefix for the temporary file (all of them will contain ‘pygtftk’).
- suffix – a suffix (default to ‘’).
- store – declare the temporary file in utils.TMP_FILE_LIST. In pygtftk, the deletion of these files upon exit is controled through -k.
- dir – a target directory.
Example: >>> from pygtftk.utils import make_tmp_file >>> tmp_file = make_tmp_file() >>> assert os.path.exists(tmp_file.name) >>> tmp_file = make_tmp_file(prefix="pref") >>> assert os.path.exists(tmp_file.name) >>> tmp_file = make_tmp_file(suffix="suf") >>> assert os.path.exists(tmp_file.name)
-
pygtftk.utils.
median_comp
(alist)¶ Compute the median from a list.
Parameters: alist – a list. Example: >>> from pygtftk.utils import median_comp >>> a = [10,20,30,40,50] >>> assert median_comp(a) == 30 >>> a = [10,20,40,50] >>> assert median_comp(a) == 30
-
pygtftk.utils.
message
(msg, nl=True, type='INFO', force=False)¶ Send a formated message on STDERR.
Parameters: - msg – the message to be send.
- nl – if True, add a newline.
- type – Could be INFO, ERROR, WARNING.
- force – Force message, whatever the verbosity level.
>>> from pygtftk.utils import message
-
pygtftk.utils.
mkdir_p
(path)¶ Create a directory silently if already exists.
Example: >>> from pygtftk.utils import check_file_or_dir_exists >>> from pygtftk.utils import mkdir_p >>> mkdir_p('/tmp/test_gtftk_mkdir_p') >>> assert check_file_or_dir_exists('/tmp/test_gtftk_mkdir_p')
-
pygtftk.utils.
random_string
(n)¶ Returns a random string (alpha numeric) of length n.
-
pygtftk.utils.
silentremove
(filename)¶ Remove silently (without error and warning) a file.
Parameters: filename – a file name (or file object). >>> from pygtftk.utils import silentremove >>> from pygtftk.utils import make_tmp_file >>> a = make_tmp_file() >>> assert os.path.exists(a.name) >>> silentremove(a.name) >>> assert not os.path.exists(a.name)
-
pygtftk.utils.
simple_line_count
(afile)¶ Count the number of lines in a file.
Parameters: afile – A file object. Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import simple_line_count >>> my_file = get_example_file(datasetname="simple", ext="gtf") >>> my_file_h = open(my_file[0], "r") >>> assert simple_line_count(my_file_h) == 70
-
pygtftk.utils.
simple_nb_column
(afile, separator='\t')¶ Count the maximum number of columns in a file.
Parameters: - separator – the separator (default ” “).
- afile – file name.
Example: >>> from pygtftk.utils import get_example_file >>> from pygtftk.utils import simple_nb_column >>> my_file = get_example_file(datasetname="simple", ext="gtf") >>> my_file_h = open(my_file[0], "r") >>> assert simple_nb_column(my_file_h) == 9
-
pygtftk.utils.
sort_2_lists
(list1, list2)¶ Sort list1 in increasing order and list2 occordingly.
Parameters: - list1 – the first list.
- list2 – the second list.
Example: >>> from pygtftk.utils import sort_2_lists >>> a = ["c", "a", "b"] >>> b = ["A", "B", "C"] >>> c = sort_2_lists(a, b) >>> assert c == [('a', 'b', 'c'), ('B', 'C', 'A')]
-
pygtftk.utils.
tab_line
(token=None, newline=False)¶ Returns a tabulated string from an input list with an optional newline
Parameters: - token – an input list.
- newline – if True a newline character is added to the string.
Example: >>> from pygtftk.utils import tab_line >>> from pygtftk.utils import TAB >>> assert tab_line(['a','b', 'c']) == 'a' + TAB + 'b' + TAB + 'c'
-
pygtftk.utils.
to_alphanum
(string)¶ Returns a new string in which non alphanumeric character have been replaced by ‘_’. If non alphanumeric characters are found at the beginning or end of the string, they are deleted.
Parameters: - string – A character string in which non alphanumeric char have to be replaced.
- replacement_list –
:Example :
>>> from pygtftk.utils import to_alphanum >>> assert to_alphanum("%gtf.bla") == 'gtf_bla'
-
pygtftk.utils.
write_properly
(a_string, afile)¶ Write a string to a file. If file is None, write string to stdout.
Parameters: - a_string – a character string.
- afile – a file object.
>>> from pygtftk.utils import write_properly