Function Documentation¶
Physcraper module
The core blasting and new sequence integration module
-
class
physcraper.scrape.
PhyscraperScrape
(data_obj, ids_obj=None, search_taxon=None)[source]¶ This is the class that does the perpetual updating
To build the class the following is needed:
- data_obj: Object of class ATT (see above)
- ids_obj: Object of class IdDict (see above)
During the initializing process the following self.objects are generated:
- self.workdir: path to working directory retrieved from ATT object = data_obj.workdir
- self.logfile: path of logfile
- self.data: ATT object
- self.ids: IdDict object
- self.config: Config object
- self.new_seqs: dictionary that contains the newly found seq using blast:
- key: gi id
- value: corresponding seq
- self.new_seqs_otu_id: dictionary that contains
the new sequences that passed the remove_identical_seq() step:
- key: otu_id
- value: see otu_dict, is a subset of the otu_dict, all sequences that will be newly added to aln and tre
- self.mrca_ncbi: int ncbi identifier of mrca
- self.blast_subdir: path to folder that contains the files writen during blast
- self.newseqs_file: filename of files that contains the sequences from self.new_seqs_otu_id
- self.date: Date of the run - may lag behind real date!
- self.repeat: either 1 or 0, it is used to determine if we continue updating the tree,
no new seqs found = 0 * self.newseqs_acc: list of all gi_ids that were passed into remove_identical_seq(). Used to speed up adding process * self.blocklist: list of gi_id of sequences that shall not be added or need to be removed. Supplied by user. * self.seq_filter: list of words that may occur in otu_dict.status and which shall not be used in the building of FilterBlast.sp_d
(that’s the main function), but it is also used as assert statement to make sure unwanted seqs are not added.- self.unpublished: True/False. Used to look for local unpublished seq that shall be added if True.
- self.path_to_local_seq: Usually False, contains path to unpublished sequences if option is used.
Following functions are called during the init-process:
- self.reset_markers(): adds things to self: I think they are used to make sure certain function run,
- if program crashed and pickle file is read in.
- self._blasted: 0/1, if run_blast_wrapper() was called, it is set to 1 for the round.
- self._blast_read: 0/1, if read_blast_wrapper() was called, it is set to 1 for the round.
- self._identical_removed: 0
- self._query_seqs_written: 0/1, if write_query_seqs() was called, it is set to 1 for the round.
- self._query_seqs_aligned: 0
- self._query_seqs_placed: 0/1, if place_query_seqs() was called, it is set to 1 for the round.
- self._reconciled: 0
- self._full_tree_est: 0/1, if est_full_tree() was called, it is set to 1 for the round.
-
calculate_bootstrap
(alignment='default', num_reps='100')[source]¶ Calculates bootstrap and consensus trees.
-p: random seed -s: aln file -n: output fn -t: starting tree -b: bootstrap random seed -#: bootstrap stopping criteria -z: specifies file with multiple trees
-
calculate_final_tree
(boot_reps=100)[source]¶ Calculates the final tree using a trimmed alignment.
Returns: final PS data
-
check_complement
(match, seq, gb_id)[source]¶ Double check if blast match is to sequence, complement or reverse complement, and return correct seq
-
est_full_tree
(alignment='default', startingtree=None)[source]¶ Full RAxML run from the placement tree as starting tree. The PTHREAD version is the faster one, hopefully people install it if not it falls back to the normal RAxML.
-
filter_seqs
(tmp_dict, selection='random', threshold=None)[source]¶ Subselect from sequences to a threshold of number of seqs per species
-
get_full_seq
(gb_id, blast_seq)[source]¶ Get full sequence from gb_acc that was retrieved via blast.
Currently only used for local searches, Genbank database sequences are retrieving them in batch mode, which is hopefully faster.
Parameters: - gb_acc – unique sequence identifier (often genbank accession number)
- blast_seq – sequence retrived by blast,
Returns: full sequence, the whole submitted sequence, not only the part that matched the blast query sequence
-
read_blast_wrapper
(blast_dir=None)[source]¶ reads in and processes the blast xml files
Parameters: blast_dir – path to directory which contains blast files Returns: fills different dictionaries with information from blast files
-
read_local_blast_query
(fn_path)[source]¶ Implementation to read in results of local blast searches.
Parameters: fn_path – path to file containing the local blast searches Returns: updated self.new_seqs and self.data.gb_dict dictionaries
-
read_webbased_blast_query
(fn_path)[source]¶ Implementation to read in results of web blast searches.
Parameters: fn_path – path to file containing the local blast searches Returns: updated self.new_seqs and self.data.gb_dict dictionaries
-
remove_blocklistitem
()[source]¶ This removes items from aln, and tree, if the corresponding Genbank identifer were added to the blocklist.
Note, that seq that were not added because they were similar to the one being removed here, are lost (that should not be a major issue though, as in a new blast_run, new seqs from the taxon can be added.)
-
remove_identical_seqs
()[source]¶ goes through the new seqs pulled down, and removes ones that are shorter than LENGTH_THRESH percent of the orig seq lengths, and chooses the longer of two that are other wise identical, and puts them in a dict with new name as gi_ott_id.
-
replace_aln
(filename, schema='fasta')[source]¶ Replace the alignment in the data object with the new alignment
-
replace_tre
(filename, schema='newick')[source]¶ Replace the tree in the data object with the new tree
-
run_blast_wrapper
()[source]¶ generates the blast queries and saves them depending on the blasting method to different file formats
It runs blast if the sequences was not blasted since the user defined threshold in the config file (delay).
Returns: writes blast queries to file
-
run_local_blast_cmd
(query, taxon_label, fn_path)[source]¶ Contains the cmds used to run a local blast query, which is different from the web-queries.
Parameters: - query – query sequence
- taxon_label – corresponding taxon name for query sequence
- fn_path – path to output file for blast query result
Returns: runs local blast query and writes it to file
-
run_muscle
(input_aln_path=None, new_seqs_path=None, outname='all_align')[source]¶ Aligns the new sequences and the profile aligns to the exsiting alignment
-
run_web_blast_query
(query, equery, fn_path)[source]¶ Equivalent to run_local_blast_cmd() but for webqueries, that need to be implemented differently.
Parameters: - query – query sequence
- equery – method to limit blast query to mrca
- fn_path – path to output file for blast query result
Returns: runs web blast query and writes it to file
-
select_seq_at_random
(otu_list, count)[source]¶ Selects sequences at random if there are more than the threshold.
-
seq_dict_build
(seq, new_otu_label, seq_dict)[source]¶ takes a sequence, a label (the otu_id) and a dictionary and adds the sequence to the dict only if it is not a subsequence of a sequence already in the dict. If the new sequence is a super sequence of one in the dict, it removes that sequence and replaces it
Parameters: - seq – sequence as string, which shall be compared to existing sequences
- label – otu_label of corresponding seq
- seq_dict – the tmp_dict generated in add_otu()
Returns: updated seq_dict
AlignTreeTax: The core data object for Physcraper. Holds and links name spaces for a tree, an alignment, the taxa and their metadata.
-
class
physcraper.aligntreetax.
AlignTreeTax
(tree, otu_dict, alignment, search_taxon, workdir, configfile=None, tree_schema='newick', aln_schema='fasta', tag=None)[source]¶ Wrap up the key parts together, requires OTT_id, and names must already match. Hypothetically, all the keys in the otu_dict should be clean.
- To build the class the following is needed:
- newick: dendropy.tre.as_string(schema=schema_trf) object
- otu_dict: json file including the otu_dict information generated earlier
- alignment: dendropy :class:`DnaCharacterMatrix
<dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` object * search_taxon: OToL identifier of the group of interest, either subclade as defined by user or of all tip labels in the phylogeny * workdir: the path to the corresponding working directory * config_obj: Config class * schema: optional argument to define tre file schema, if different from “newick”
- During the initializing process the following self objects are generated:
self.aln: contains the alignment and which will be updated during the run
self.tre: contains the phylogeny, which will be updated during the run
- self.otu_dict: dictionary with taxon information and physcraper relevant stuff
key: otu_id, a unique identifier
- value: dictionary with the following key:values:
- ‘^ncbi:gi’: GenBank identifier - deprecated by Genbank - only older sequences will have it
- ‘^ncbi:accession’: Genbanks accession number
- ‘^ncbi:title’: title of Genbank sequence submission
- ‘^ncbi:taxon’: ncbi taxon identifier
- ‘^ot:ottId’: OToL taxon identifier
- ‘^physcraper:status’: contains information if it
was ‘original’, ‘queried’, ‘removed’, ‘added during filtering process’ * ‘^ot:ottTaxonName’: OToL taxon name * ‘^physcraper:last_blasted’: contains the date when the sequence was blasted. * ‘^user:TaxonName’: optional, user given label from OtuJsonDict * “^ot:originalLabel” optional, user given tip label of phylogeny
self.ps_otu: iterator for new otu IDs, is used as key for self.otu_dict
self.workdir: contains the path to the working directory, if folder does not exists it is generated.
self.mrca_ott: OToL taxon Id for the most recent common ancestor of the ingroup
self.orig_seqlen: list of the original sequence length of the input data
self.gi_dict: dictionary, that has all information from sequences found during the blasting. * key: GenBank sequence identifier * value: dictionary, content depends on blast option, differs between webquery and local blast queries
- keys - value pairs for local blast:
- ‘^ncbi:gi’: GenBank sequence identifier
- ‘accession’: GenBank accession number
- ‘staxids’: Taxon identifier
- ‘sscinames’: Taxon species name
- ‘pident’: Blast percentage of identical matches
- ‘evalue’: Blast e-value
- ‘bitscore’: Blast bitscore, used for FilterBlast
- ‘sseq’: corresponding sequence
- ‘title’: title of Genbank sequence submission
- key - values for web-query:
- ‘accession’:Genbank accession number
- ‘length’: length of sequence
- ‘title’: string combination of hit_id and hit_def
- ‘hit_id’: string combination of gi id and accession number
- ‘hsps’: Bio.Blast.Record.HSP object
- ‘hit_def’: title from GenBank sequence
- optional key - value pairs for unpublished option:
- ‘localID’: local sequence identifier
self._reconciled: True/False,
self.unpubl_otu_json: optional, will contain the OTU-dict for unpublished data, if that option is used
- Following functions are called during the init-process:
- self._reconcile():
- removes taxa, that are not found in both, the phylogeny and the aln
- self._reconcile_names():
- is used for the own file stuff, it removes the character ‘n’ from tip names that start with a number
- The physcraper class is then updating:
- self.aln, self.tre and self.otu_dict, self.ps_otu, self.gi_dict
-
add_otu
(gb_id, ids_obj)[source]¶ Generates an otu_id for new sequences and adds them into self.otu_dict. Needs to be passed an IdDict to do the mapping.
Parameters: - gb_id – the Genbank identifier/ or local unpublished
- ids_obj – needs to IDs class to have access to the taxonomic information
Returns: the unique otu_id - the key from self.otu_dict of the corresponding sequence
-
check_tre_in_aln
()[source]¶ Makes sure that everything which is in tre is also found in aln.
Extracted method from trim. Not sure we actually need it there.
-
get_otu_for_acc
(gb_id)[source]¶ A reverse search to find the unique OTU ID for a given accession number :param gb_id: the Genbank identifier
-
prune_short
()[source]¶ Prunes sequences from alignment if they are shorter than specified in the config file, or if tip is only present in tre.
Sometimes in the de-concatenating of the original alignment taxa with no sequence are generated or in general if certain sequences are really short. This removes those from both the tre and the alignment.
has test: test_prune_short.py
Returns: prunes aln and tre
-
read_in_tree
(tree, tree_schema=None)[source]¶ Imports a tree either from a file or a dendropy data object. Adds records in OTU dictionary if not already present.
-
remove_taxa_aln_tre
(taxon_label)[source]¶ Removes taxa from aln and tre and updates otu_dict, takes a single taxon_label as input.
note: has test, test_remove_taxa_aln_tre.py
Parameters: taxon_label – taxon_label from dendropy object - aln or phy Returns: removes information/data from taxon_label
-
trim
(min_taxon_perc)[source]¶ It removes bases at the start and end of alignments, if they are represented by less than the value specified. E.g. 0.75 that 75% of the sequences need to have a base present.
Ensures, that not whole chromosomes get dragged in. It’s cutting the ends of long sequences.
has test: test_trim.py
-
write_aln
(filename=None, alnschema='fasta', direc='workdir')[source]¶ Output alignment with unique otu ids as labels.
-
write_files
(treefilename=None, treeschema='newick', alnfilename=None, alnschema='fasta', direc='workdir')[source]¶ Outputs both the streaming files, labeled with OTU ids. Can be mapped to original labels using otu_dict.json or otu_seq_info.csv
-
write_labelled
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶ Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs
Parameters: - label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
- treepath – optional: full file name (including path) for phylogeny
- alnpath – optional: full file name (including path) for alignment
- norepeats – optional: if there shall be no duplicate names in the labelled output files
- add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns: writes out labelled phylogeny and alignment to file
-
write_labelled_aln
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶ A wrapper for the write_labelled aln function to maintain older functionalities
-
write_labelled_tree
(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]¶ A wrapper for the write_labelled tree function to maintain older functionalities
-
write_otus
(filename='otu_info', schema='table', direc='workdir')[source]¶ Output all of the OTU information as either json or csv
-
write_papara_files
(treefilename='random_resolve.tre', alnfilename='aln_ott.phy')[source]¶ This writes out needed files for papara (except query sequences). Papara is finicky about trees and needs phylip format for the alignment.
NOTE: names for tree and aln files should not be changed, as they are hardcoded in align_query_seqs().
Is only used within func align_query_seqs.
-
physcraper.aligntreetax.
generate_ATT_from_files
(workdir, configfile, alnfile, aln_schema, treefile, otu_json, tree_schema, search_taxon=None)[source]¶ Build an ATT object without phylesystem, use your own files instead.
Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.
Note: has test -> test_owndata.py
Parameters: - alnfile – path to sequence alignment
- aln_schema – string containing format of sequence alignment
- workdir – path to working directory
- config_obj – config class including the settings
- treefile – path to phylogeny
- otu_json – path to json file containing the translation of tip names to taxon names, or to an otu_dictionary
- tree_schema – a string defining the format of the input phylogeny
- search_taxon – optional - OToL ID of the mrca of the clade of interest. If no search mrca ott_id is provided, will use all taxa in tree to calc mrca.
Returns: object of class ATT
-
physcraper.aligntreetax.
generate_ATT_from_run
(workdir, start_files='output', tag=None, configfile=None, run=True)[source]¶ Build an ATT object without phylesystem, use your own files instead. :return: object of class ATT
-
physcraper.aligntreetax.
write_labelled_aln
(aligntreetax, label, filepath, schema='fasta', norepeats=True, add_gb_id=False)[source]¶ Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs.
Parameters: - label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
- treepath – optional: full file name (including path) for phylogeny
- alnpath – optional: full file name (including path) for alignment
- norepeats – optional: if there shall be no duplicate names in the labelled output files
- add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns: writes out labelled phylogeny and alignment to file
-
physcraper.aligntreetax.
write_labelled_tree
(treetax, label, filepath, schema='newick', norepeats=True, add_gb_id=False)[source]¶ Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs.
Parameters: - label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
- treepath – optional: full file name (including path) for phylogeny
- alnpath – optional: full file name (including path) for alignment
- norepeats – optional: if there shall be no duplicate names in the labelled output files
- add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns: writes out labelled phylogeny and alignment to file
-
physcraper.aligntreetax.
write_otu_file
(treetax, filepath, schema='table')[source]¶ Writes out OTU dict as json or table. :param treetax: eitehr a treetaxon object or an alignment tree taxon object :param filename: filename :param schema: either table or json format :return: writes out otu_dict to file
Linker Functions to get data from OpenTree
-
physcraper.opentree_helpers.
OtuJsonDict
(id_to_spn, id_dict)[source]¶ Makes an OTU json dictionary, which is also produced within the openTreeLife-query.
This function is used, if files that shall be updated are not part of the OpenTreeofLife project. It reads in the file that contains the tip names and the corresponding species names. It then tries to get the unique identifier from the OpenTree project or from NCBI.
Reads input file into the var sp_info_dict, translates using an IdDict object using web to call OpenTree, then NCBI if not found.
Parameters: - id_to_spn – User file, that contains tip name and corresponding sp name for input files.
- id_dict – Uses the id_dict generated earlier
Returns: dictionary with key: “otu_tiplabel” and value is another dict with the keys ‘^ncbi:taxon’, ‘^ot:ottTaxonName’, ‘^ot:ottId’, ‘^ot:originalLabel’, ‘^user:TaxonName’, ‘^physcraper:status’, ‘^physcraper:last_blasted’
-
physcraper.opentree_helpers.
bulk_tnrs_load
(filename)[source]¶ Read in outputs from OpenTree Bulk TNRS, translates to a Physcraper OTU dictionary. :param filename: input json file
-
physcraper.opentree_helpers.
check_if_ottid_in_synth
(ottid)[source]¶ Web call to check if OTT id in synthetic tree. NOT USED.
-
physcraper.opentree_helpers.
conflict_tree
(inputtree, otu_dict)[source]¶ Write out a tree with labels that work for the OpenTree Conflict API
-
physcraper.opentree_helpers.
count_match_tree_to_aln
(tree, dataset)[source]¶ Assess how many taxa match between multiple genes in an alignment data set and input tree.
-
physcraper.opentree_helpers.
deconcatenate_aln
(aln_obj, filename, direc)[source]¶ Split out separate concatended alignments. NOT TESTED
-
physcraper.opentree_helpers.
generate_ATT_from_phylesystem
(alnfile, aln_schema, workdir, configfile, study_id, tree_id, search_taxon=None, tip_label='^ot:originalLabel')[source]¶ Gathers together tree, alignment, and study info; forces names to OTT ids.
Study and tree ID’s can be obtained by using python ./scripts/find_trees.py LINEAGE_NAME
Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.
Parameters: aln – dendropy :class:`DnaCharacterMatrix <dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` alignment object. :param workdir: Path to working directory. :param config_obj: Config class containing the settings. :param study_id: OpenTree study id of the phylogeny to update. :param tree_id: OpenTree tree id of the phylogeny to update, some studies have several phylogenies. :param phylesystem_loc: Access the GitHub version of the OpenTree data store, or a local clone. :param search_taxon: optional. OTT id of the MRCA of the clade that shall be updated. :return: Object of class ATT.
-
physcraper.opentree_helpers.
get_citations_from_json
(synth_response, citations_file)[source]¶ Get ciattions for studies in an induced synthetic tree repsonse. :param synth_response: Web service call record :param citations_file: Output file
-
physcraper.opentree_helpers.
get_dataset_from_treebase
(study_id)[source]¶ Given a phylogeny in OpenTree with mapped tip labels, this function gets an alignment from the corresponding study on TreeBASE, if available. By default, it first tries getting the alignment from the supertreebase repository at https://github.com/TreeBASE/supertreebase. If that fials, it tries getting the alignment directly form TreeBASE at https://treebase.org If both fail, it exits with a message.
-
physcraper.opentree_helpers.
get_max_match_aln
(tree, dataset, min_match=3)[source]¶ Select an alignment from a DNA dataset
-
physcraper.opentree_helpers.
get_mrca_ott
(ott_ids)[source]¶ Finds the MRCA of taxa in the ingroup of the original tree. The BLAST search later is limited to descendants of this MRCA according to the NCBI taxonomy.
Only used in the functions that generate the ATT object.
Parameters: ott_ids – List of all OTT ids for tip labels in phylogeny Returns: OTT id of most recent common ancestor
-
physcraper.opentree_helpers.
get_ott_taxon_info
(spp_name)[source]¶ Get OTT id, taxon name, and NCBI id (if present) from the OpenTree Taxonomy. Only works with version 3 of OpenTree APIs
Parameters: spp_name – Species name Returns:
-
physcraper.opentree_helpers.
get_ottid_from_gbifid
(gbif_id)[source]¶ Returns a dictionary mapping GBIF ids to OTT ids. ott_id is set to ‘None’ if the GBIF id is not found in the Open Tree Taxanomy
-
physcraper.opentree_helpers.
get_tree_from_study
(study_id, tree_id, label_format='ot:originallabel')[source]¶ Create a dendropy Tree object from OpenTree data. :param study_id: OpenTree Study Id :param tree_id: OpenTree tree id :param label_format: One of ‘id’, ‘name’, “ot:originallabel”, “ot:ottid”, “ot:otttaxonname”. defaults to “ot:originallabel”
-
physcraper.opentree_helpers.
get_tree_from_synth
(ott_ids, label_format='name', citation='cites.txt')[source]¶ Wrapper for OT.synth_induced_tree that also pulls citations
-
physcraper.opentree_helpers.
ottids_in_synth
(synthfile=None)[source]¶ Checks if OTT ids are present in current synthetic tree, using a file listing all current OTT ids in synth (v12.3) :param synthfile: defaults to taxonomy/ottids_in_synth.txt
-
physcraper.opentree_helpers.
root_tree_from_synth
(tree, otu_dict, base='ott')[source]¶ Uses information from OpenTree of Life to suggest a root. :param tree: dendropy
Tree :param otu_dict: a dictionary of tip label metadata, inculding an '^ot:ottId'attribute 'param base: either `synth
or ott. If synth will use OpenTree synthetic tree relationships to root input tree, if ott will use OpenTree taxonomy.
-
physcraper.opentree_helpers.
scraper_from_opentree
(study_id, tree_id, alnfile, workdir, aln_schema, configfile=None)[source]¶ Pull tree from OpenTree to create a physcraper object.
Physcraper run Configuration object generator
-
class
physcraper.configobj.
ConfigObj
(configfile=None, run=True)[source]¶ To build the class the following is needed:
- configfi: a configuration file in a specific format, e.g. to read in self.e_value_thresh.
During the initializing process the following self objects are generated:
- self.e_value_thresh: the defined threshold for the e-value during Blast searches,
check out: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ
self.hitlist_size: the maximum number of sequences retrieved by a single blast search
self.minlen: value from 0 to 1. Defines how much shorter new seq can be compared to input
- self.trim_perc: value that determines how many seq need to be present before the beginning
and end of alignment will be trimmed
self.maxlen: max length for values to add to aln
self.get_ncbi_taxonomy: Path to sh file doing something…
self.ott_ncbi: file containing OTT id, ncbi and taxon name (??)
self.email: email address used for blast queries
self.blast_loc: defines which blasting method to use:
- either web-query (=remote)
- from a local blast database (=local)
self.num_threads: number of cores to be used during a run
self.url_base:
- if blastloc == remote: it defines the url for the blast queries.
- if blastloc == local: url_base = None
self.delay: defines when to reblast sequences in days
optional self.objects:
if blastloc == local:
- self.blastdb: this defines the path to the local blast database
- self.ncbi_nodes: path to ‘nodes.dmp’ file, that contains the hierarchical information
- self.ncbi_names: path to ‘names.dmp’ file, that contains the different ID’s
-
check_taxonomy
()[source]¶ Locates a taxonomy directory in tthe phyysraper repo, or if not avail (often because module was pip installed), genertes one.
-
config_str
()[source]¶ Write out the current config values. DOES NOT INCUDE SOME HIDDEN CONFIGUREABLE ATTRIBUTES
Link together NCBI and Open Tree identifiers and names, with Gen Bank information for updated sequences
-
class
physcraper.ids.
IdDicts
(configfile=None)[source]¶ Class contains different taxonomic identifiers and helps to find the corresponding ids between ncbi and OToL
To build the class the following is needed:
- config_obj: Object of class config (see above)
- workdir: the path to the assigned working directory
During the initializing process the following self objects are generated:
self.workdir: contains path of working directory
self.config: contains the Config class object
self.ott_to_ncbi: dictionary
- key: OToL taxon identifier
- value: ncbi taxon identifier
self.ncbi_to_ott: dictionary
- key: OToL taxon identifier
- value: ncbi taxon identifier
self.ott_to_name: dictionary
- key: OToL taxon identifier
- value: OToL taxon name
self.acc_ncbi_dict: dictionary
- key: Genbank identifier
- value: ncbi taxon identifier
self.spn_to_ncbiid: dictionary
- key: OToL taxon name
- value: ncbi taxon identifier
self.ncbiid_to_spn: dictionary
- key: ncbi taxon identifier
- value: ncbi taxon name
user defined list of mrca OTT-ID’s #TODO this is flipped form the dat aobj .ott_mrca. On purpose?
#reomved mrca’s from ida, and put them into scrape object
- Optional:
- depending on blasting method:
- self.ncbi_parser: for local blast,
- initializes the ncbi_parser class, that contains information about rank and identifiers
-
entrez_efetch
(gb_id)[source]¶ - Wrapper function around efetch from ncbi to get taxonomic information if everything else is failing.
- Also used when the local blast files have redundant information to access the taxon info of those sequences.
It adds information to various id_dicts.
Parameters: gb_id – Genbank identifier Returns: read_handle
uses ncbi databases to easily retrieve taxonomic information.
parts are altered from https://github.com/zyxue/ncbitax2lin/blob/master/ncbitax2lin.py
-
class
physcraper.ncbi_data_parser.
Parser
(names_file, nodes_file)[source]¶ Reads in databases from ncbi to connect species names with the taxonomic identifier and the corresponding hierarchical information. It provides a much faster way to get those information then using web queries. We use those files to get independent from web requests to find those information (the implementation of it in BioPython was not really reliable). Nodes includes the hierarchical information, names the scientific names and ID’s. The files need to be updated regularly, best way to always do it when a new blast database was loaded.
-
get_downtorank_id
(tax_id, downtorank='species')[source]¶ Recursive function to find the parent id of a taxon as defined by downtorank.
-
-
physcraper.ncbi_data_parser.
get_acc_from_blast
(query_string)[source]¶ Get the accession number from a blast query. :param query_string: string that contains acc and gi from local blast query result :return: gb_acc
-
physcraper.ncbi_data_parser.
get_gi_from_blast
(query_string)[source]¶ Get the gi number from a blast query. Get acc is more difficult now, as new seqs not always have gi number, then query changes.
If not available return None.
Parameters: query_string – string that contains acc and gi from local blast query result Returns: gb_id if available
-
physcraper.ncbi_data_parser.
get_ncbi_tax_id
(handle)[source]¶ Get the taxon ID from ncbi. ONly used for web queries
Parameters: handle – NCBI read.handle Returns: ncbi_id
-
physcraper.ncbi_data_parser.
get_ncbi_tax_name
(handle)[source]¶ Get the sp name from ncbi. Could be replaced by direct lookup to ott_ncbi.
Parameters: handle – NCBI read.handle Returns: ncbi_spn
-
physcraper.ncbi_data_parser.
get_tax_info_from_acc
(gb_id, ids_obj)[source]¶ takes an accession number and returns the ncbi_id and the taxon name
-
physcraper.ncbi_data_parser.
load_names
(names_file)[source]¶ Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are accepted as scientific name by ncbi.
-
physcraper.ncbi_data_parser.
load_nodes
(nodes_file)[source]¶ Loads nodes.dmp and converts it into a pandas.DataFrame. Contains the information about the taxonomic hierarchy of names.
-
physcraper.ncbi_data_parser.
load_synonyms
(names_file)[source]¶ Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are viewed as synonym by ncbi.
-
physcraper.ncbi_data_parser.
strip
(inputstr)[source]¶ Strips of blank characters from string in pd dataframe.
Work in progress to pull apart the linked tree and taxon objects from the alignemnt based ATT object
-
class
physcraper.treetaxon.
TreeTax
(otu_json, treefrom, schema='newick')[source]¶ wrap up the key parts together, requires OTT_id, and names must already match.
-
write_labelled
(label, path, norepeats=True, add_gb_id=False)[source]¶ output tree and alignment with human readable labels Jumps through a bunch of hoops to make labels unique.
NOT MEMORY EFFICIENT AT ALL
Has different options available for different desired outputs
Parameters: - label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
- treepath – optional: full file name (including path) for phylogeny
- alnpath – optional: full file name (including path) for alignment
- norepeats – optional: if there shall be no duplicate names in the labelled output files
- add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns: writes out labelled phylogeny and alignment to file
-