Function Documentation

Physcraper module

The core blasting and new sequence integration module

class physcraper.scrape.PhyscraperScrape(data_obj, ids_obj=None, search_taxon=None)[source]

This is the class that does the perpetual updating

To build the class the following is needed:

  • data_obj: Object of class ATT (see above)
  • ids_obj: Object of class IdDict (see above)

During the initializing process the following self.objects are generated:

  • self.workdir: path to working directory retrieved from ATT object = data_obj.workdir
  • self.logfile: path of logfile
  • self.data: ATT object
  • self.ids: IdDict object
  • self.config: Config object
  • self.new_seqs: dictionary that contains the newly found seq using blast:
    • key: gi id
    • value: corresponding seq
  • self.new_seqs_otu_id: dictionary that contains

the new sequences that passed the remove_identical_seq() step:

  • key: otu_id
  • value: see otu_dict, is a subset of the otu_dict, all sequences that will be newly added to aln and tre
  • self.mrca_ncbi: int ncbi identifier of mrca
  • self.blast_subdir: path to folder that contains the files writen during blast
  • self.newseqs_file: filename of files that contains the sequences from self.new_seqs_otu_id
  • self.date: Date of the run - may lag behind real date!
  • self.repeat: either 1 or 0, it is used to determine if we continue updating the tree,

no new seqs found = 0 * self.newseqs_acc: list of all gi_ids that were passed into remove_identical_seq(). Used to speed up adding process * self.blocklist: list of gi_id of sequences that shall not be added or need to be removed. Supplied by user. * self.seq_filter: list of words that may occur in otu_dict.status and which shall not be used in the building of FilterBlast.sp_d

(that’s the main function), but it is also used as assert statement to make sure unwanted seqs are not added.
  • self.unpublished: True/False. Used to look for local unpublished seq that shall be added if True.
  • self.path_to_local_seq: Usually False, contains path to unpublished sequences if option is used.

Following functions are called during the init-process:

  • self.reset_markers(): adds things to self: I think they are used to make sure certain function run,
if program crashed and pickle file is read in.
  • self._blasted: 0/1, if run_blast_wrapper() was called, it is set to 1 for the round.
  • self._blast_read: 0/1, if read_blast_wrapper() was called, it is set to 1 for the round.
  • self._identical_removed: 0
  • self._query_seqs_written: 0/1, if write_query_seqs() was called, it is set to 1 for the round.
  • self._query_seqs_aligned: 0
  • self._query_seqs_placed: 0/1, if place_query_seqs() was called, it is set to 1 for the round.
  • self._reconciled: 0
  • self._full_tree_est: 0/1, if est_full_tree() was called, it is set to 1 for the round.
align_new_seqs(aligner='muscle')[source]

Align the new sequences against each other

calculate_bootstrap(alignment='default', num_reps='100')[source]

Calculates bootstrap and consensus trees.

-p: random seed -s: aln file -n: output fn -t: starting tree -b: bootstrap random seed -#: bootstrap stopping criteria -z: specifies file with multiple trees

calculate_final_tree(boot_reps=100)[source]

Calculates the final tree using a trimmed alignment.

Returns:final PS data
check_complement(match, seq, gb_id)[source]

Double check if blast match is to sequence, complement or reverse complement, and return correct seq

est_full_tree(alignment='default', startingtree=None)[source]

Full RAxML run from the placement tree as starting tree. The PTHREAD version is the faster one, hopefully people install it if not it falls back to the normal RAxML.

filter_seqs(tmp_dict, selection='random', threshold=None)[source]

Subselect from sequences to a threshold of number of seqs per species

get_full_seq(gb_id, blast_seq)[source]

Get full sequence from gb_acc that was retrieved via blast.

Currently only used for local searches, Genbank database sequences are retrieving them in batch mode, which is hopefully faster.

Parameters:
  • gb_acc – unique sequence identifier (often genbank accession number)
  • blast_seq – sequence retrived by blast,
Returns:

full sequence, the whole submitted sequence, not only the part that matched the blast query sequence

make_sp_dict(otu_list=None)[source]

Makes dict of OT_ids by species

map_taxa_to_ncbi()[source]

Find NCBI ids for taxa from OpenTree

read_blast_wrapper(blast_dir=None)[source]

reads in and processes the blast xml files

Parameters:blast_dir – path to directory which contains blast files
Returns:fills different dictionaries with information from blast files
read_local_blast_query(fn_path)[source]

Implementation to read in results of local blast searches.

Parameters:fn_path – path to file containing the local blast searches
Returns:updated self.new_seqs and self.data.gb_dict dictionaries
read_webbased_blast_query(fn_path)[source]

Implementation to read in results of web blast searches.

Parameters:fn_path – path to file containing the local blast searches
Returns:updated self.new_seqs and self.data.gb_dict dictionaries
remove_blocklistitem()[source]

This removes items from aln, and tree, if the corresponding Genbank identifer were added to the blocklist.

Note, that seq that were not added because they were similar to the one being removed here, are lost (that should not be a major issue though, as in a new blast_run, new seqs from the taxon can be added.)

remove_identical_seqs()[source]

goes through the new seqs pulled down, and removes ones that are shorter than LENGTH_THRESH percent of the orig seq lengths, and chooses the longer of two that are other wise identical, and puts them in a dict with new name as gi_ott_id.

replace_aln(filename, schema='fasta')[source]

Replace the alignment in the data object with the new alignment

replace_tre(filename, schema='newick')[source]

Replace the tree in the data object with the new tree

reset_markers()[source]

set completion markers back to 0 for a re-run

run_blast_wrapper()[source]

generates the blast queries and saves them depending on the blasting method to different file formats

It runs blast if the sequences was not blasted since the user defined threshold in the config file (delay).

Returns:writes blast queries to file
run_local_blast_cmd(query, taxon_label, fn_path)[source]

Contains the cmds used to run a local blast query, which is different from the web-queries.

Parameters:
  • query – query sequence
  • taxon_label – corresponding taxon name for query sequence
  • fn_path – path to output file for blast query result
Returns:

runs local blast query and writes it to file

run_muscle(input_aln_path=None, new_seqs_path=None, outname='all_align')[source]

Aligns the new sequences and the profile aligns to the exsiting alignment

run_web_blast_query(query, equery, fn_path)[source]

Equivalent to run_local_blast_cmd() but for webqueries, that need to be implemented differently.

Parameters:
  • query – query sequence
  • equery – method to limit blast query to mrca
  • fn_path – path to output file for blast query result
Returns:

runs web blast query and writes it to file

select_seq_at_random(otu_list, count)[source]

Selects sequences at random if there are more than the threshold.

seq_dict_build(seq, new_otu_label, seq_dict)[source]

takes a sequence, a label (the otu_id) and a dictionary and adds the sequence to the dict only if it is not a subsequence of a sequence already in the dict. If the new sequence is a super sequence of one in the dict, it removes that sequence and replaces it

Parameters:
  • seq – sequence as string, which shall be compared to existing sequences
  • label – otu_label of corresponding seq
  • seq_dict – the tmp_dict generated in add_otu()
Returns:

updated seq_dict

summarize_boot(besttreepath, bootpath, min_clade_freq=0.2)[source]

Summarize the bootstrap proportions onto the ML tree

write_mrca()[source]

Write out search info to file

write_new_seqs(filename='date')[source]

writes out the query sequence file

physcraper.scrape.debug(msg)[source]

short debugging command

physcraper.scrape.set_verbose()[source]

Set output to verbose

AlignTreeTax: The core data object for Physcraper. Holds and links name spaces for a tree, an alignment, the taxa and their metadata.

class physcraper.aligntreetax.AlignTreeTax(tree, otu_dict, alignment, search_taxon, workdir, configfile=None, tree_schema='newick', aln_schema='fasta', tag=None)[source]

Wrap up the key parts together, requires OTT_id, and names must already match. Hypothetically, all the keys in the otu_dict should be clean.

To build the class the following is needed:
  • newick: dendropy.tre.as_string(schema=schema_trf) object
  • otu_dict: json file including the otu_dict information generated earlier
  • alignment: dendropy :class:`DnaCharacterMatrix

<dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` object * search_taxon: OToL identifier of the group of interest, either subclade as defined by user or of all tip labels in the phylogeny * workdir: the path to the corresponding working directory * config_obj: Config class * schema: optional argument to define tre file schema, if different from “newick”

During the initializing process the following self objects are generated:
  • self.aln: contains the alignment and which will be updated during the run

  • self.tre: contains the phylogeny, which will be updated during the run

  • self.otu_dict: dictionary with taxon information and physcraper relevant stuff
    • key: otu_id, a unique identifier

    • value: dictionary with the following key:values:
      • ‘^ncbi:gi’: GenBank identifier - deprecated by Genbank - only older sequences will have it
      • ‘^ncbi:accession’: Genbanks accession number
      • ‘^ncbi:title’: title of Genbank sequence submission
      • ‘^ncbi:taxon’: ncbi taxon identifier
      • ‘^ot:ottId’: OToL taxon identifier
      • ‘^physcraper:status’: contains information if it

      was ‘original’, ‘queried’, ‘removed’, ‘added during filtering process’ * ‘^ot:ottTaxonName’: OToL taxon name * ‘^physcraper:last_blasted’: contains the date when the sequence was blasted. * ‘^user:TaxonName’: optional, user given label from OtuJsonDict * “^ot:originalLabel” optional, user given tip label of phylogeny

  • self.ps_otu: iterator for new otu IDs, is used as key for self.otu_dict

  • self.workdir: contains the path to the working directory, if folder does not exists it is generated.

  • self.mrca_ott: OToL taxon Id for the most recent common ancestor of the ingroup

  • self.orig_seqlen: list of the original sequence length of the input data

  • self.gi_dict: dictionary, that has all information from sequences found during the blasting. * key: GenBank sequence identifier * value: dictionary, content depends on blast option, differs between webquery and local blast queries

    • keys - value pairs for local blast:
      • ‘^ncbi:gi’: GenBank sequence identifier
      • ‘accession’: GenBank accession number
      • ‘staxids’: Taxon identifier
      • ‘sscinames’: Taxon species name
      • ‘pident’: Blast percentage of identical matches
      • ‘evalue’: Blast e-value
      • ‘bitscore’: Blast bitscore, used for FilterBlast
      • ‘sseq’: corresponding sequence
      • ‘title’: title of Genbank sequence submission
    • key - values for web-query:
      • ‘accession’:Genbank accession number
      • ‘length’: length of sequence
      • ‘title’: string combination of hit_id and hit_def
      • ‘hit_id’: string combination of gi id and accession number
      • ‘hsps’: Bio.Blast.Record.HSP object
      • ‘hit_def’: title from GenBank sequence
    • optional key - value pairs for unpublished option:
      • ‘localID’: local sequence identifier
  • self._reconciled: True/False,

  • self.unpubl_otu_json: optional, will contain the OTU-dict for unpublished data, if that option is used

Following functions are called during the init-process:
  • self._reconcile():
    removes taxa, that are not found in both, the phylogeny and the aln
  • self._reconcile_names():
    is used for the own file stuff, it removes the character ‘n’ from tip names that start with a number
The physcraper class is then updating:
  • self.aln, self.tre and self.otu_dict, self.ps_otu, self.gi_dict
add_otu(gb_id, ids_obj)[source]

Generates an otu_id for new sequences and adds them into self.otu_dict. Needs to be passed an IdDict to do the mapping.

Parameters:
  • gb_id – the Genbank identifier/ or local unpublished
  • ids_obj – needs to IDs class to have access to the taxonomic information
Returns:

the unique otu_id - the key from self.otu_dict of the corresponding sequence

check_tre_in_aln()[source]

Makes sure that everything which is in tre is also found in aln.

Extracted method from trim. Not sure we actually need it there.

get_otu_for_acc(gb_id)[source]

A reverse search to find the unique OTU ID for a given accession number :param gb_id: the Genbank identifier

prune_short()[source]

Prunes sequences from alignment if they are shorter than specified in the config file, or if tip is only present in tre.

Sometimes in the de-concatenating of the original alignment taxa with no sequence are generated or in general if certain sequences are really short. This removes those from both the tre and the alignment.

has test: test_prune_short.py

Returns:prunes aln and tre
read_in_aln(alignment, aln_schema)[source]

Reads in an alignment to the object taxon namespace.

read_in_tree(tree, tree_schema=None)[source]

Imports a tree either from a file or a dendropy data object. Adds records in OTU dictionary if not already present.

remove_taxa_aln_tre(taxon_label)[source]

Removes taxa from aln and tre and updates otu_dict, takes a single taxon_label as input.

note: has test, test_remove_taxa_aln_tre.py

Parameters:taxon_label – taxon_label from dendropy object - aln or phy
Returns:removes information/data from taxon_label
trim(min_taxon_perc)[source]

It removes bases at the start and end of alignments, if they are represented by less than the value specified. E.g. 0.75 that 75% of the sequences need to have a base present.

Ensures, that not whole chromosomes get dragged in. It’s cutting the ends of long sequences.

has test: test_trim.py

write_aln(filename=None, alnschema='fasta', direc='workdir')[source]

Output alignment with unique otu ids as labels.

write_files(treefilename=None, treeschema='newick', alnfilename=None, alnschema='fasta', direc='workdir')[source]

Outputs both the streaming files, labeled with OTU ids. Can be mapped to original labels using otu_dict.json or otu_seq_info.csv

write_labelled(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]

Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.

NOT MEMORY EFFICIENT AT ALL

Has different options available for different desired outputs

Parameters:
  • label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
  • treepath – optional: full file name (including path) for phylogeny
  • alnpath – optional: full file name (including path) for alignment
  • norepeats – optional: if there shall be no duplicate names in the labelled output files
  • add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns:

writes out labelled phylogeny and alignment to file

write_labelled_aln(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]

A wrapper for the write_labelled aln function to maintain older functionalities

write_labelled_tree(label, filename='labelled', direc='workdir', norepeats=True, add_gb_id=False)[source]

A wrapper for the write_labelled tree function to maintain older functionalities

write_otus(filename='otu_info', schema='table', direc='workdir')[source]

Output all of the OTU information as either json or csv

write_papara_files(treefilename='random_resolve.tre', alnfilename='aln_ott.phy')[source]

This writes out needed files for papara (except query sequences). Papara is finicky about trees and needs phylip format for the alignment.

NOTE: names for tree and aln files should not be changed, as they are hardcoded in align_query_seqs().

Is only used within func align_query_seqs.

write_random_resolve_tre(treefilename='random_resolve.tre', direc='workdir')[source]

Randomly resolve polytomies, because some downstream approaches require that, e.g. Papara.

physcraper.aligntreetax.generate_ATT_from_files(workdir, configfile, alnfile, aln_schema, treefile, otu_json, tree_schema, search_taxon=None)[source]

Build an ATT object without phylesystem, use your own files instead.

Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.

Note: has test -> test_owndata.py

Parameters:
  • alnfile – path to sequence alignment
  • aln_schema – string containing format of sequence alignment
  • workdir – path to working directory
  • config_obj – config class including the settings
  • treefile – path to phylogeny
  • otu_json – path to json file containing the translation of tip names to taxon names, or to an otu_dictionary
  • tree_schema – a string defining the format of the input phylogeny
  • search_taxon – optional - OToL ID of the mrca of the clade of interest. If no search mrca ott_id is provided, will use all taxa in tree to calc mrca.
Returns:

object of class ATT

physcraper.aligntreetax.generate_ATT_from_run(workdir, start_files='output', tag=None, configfile=None, run=True)[source]

Build an ATT object without phylesystem, use your own files instead. :return: object of class ATT

physcraper.aligntreetax.set_verbose()[source]

Set verbosity of outputs

physcraper.aligntreetax.write_labelled_aln(aligntreetax, label, filepath, schema='fasta', norepeats=True, add_gb_id=False)[source]

Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.

NOT MEMORY EFFICIENT AT ALL

Has different options available for different desired outputs.

Parameters:
  • label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
  • treepath – optional: full file name (including path) for phylogeny
  • alnpath – optional: full file name (including path) for alignment
  • norepeats – optional: if there shall be no duplicate names in the labelled output files
  • add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns:

writes out labelled phylogeny and alignment to file

physcraper.aligntreetax.write_labelled_tree(treetax, label, filepath, schema='newick', norepeats=True, add_gb_id=False)[source]

Output tree and alignment with human readable labels. Jumps through a bunch of hoops to make labels unique.

NOT MEMORY EFFICIENT AT ALL

Has different options available for different desired outputs.

Parameters:
  • label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
  • treepath – optional: full file name (including path) for phylogeny
  • alnpath – optional: full file name (including path) for alignment
  • norepeats – optional: if there shall be no duplicate names in the labelled output files
  • add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns:

writes out labelled phylogeny and alignment to file

physcraper.aligntreetax.write_otu_file(treetax, filepath, schema='table')[source]

Writes out OTU dict as json or table. :param treetax: eitehr a treetaxon object or an alignment tree taxon object :param filename: filename :param schema: either table or json format :return: writes out otu_dict to file

Linker Functions to get data from OpenTree

physcraper.opentree_helpers.OtuJsonDict(id_to_spn, id_dict)[source]

Makes an OTU json dictionary, which is also produced within the openTreeLife-query.

This function is used, if files that shall be updated are not part of the OpenTreeofLife project. It reads in the file that contains the tip names and the corresponding species names. It then tries to get the unique identifier from the OpenTree project or from NCBI.

Reads input file into the var sp_info_dict, translates using an IdDict object using web to call OpenTree, then NCBI if not found.

Parameters:
  • id_to_spn – User file, that contains tip name and corresponding sp name for input files.
  • id_dict – Uses the id_dict generated earlier
Returns:

dictionary with key: “otu_tiplabel” and value is another dict with the keys ‘^ncbi:taxon’, ‘^ot:ottTaxonName’, ‘^ot:ottId’, ‘^ot:originalLabel’, ‘^user:TaxonName’, ‘^physcraper:status’, ‘^physcraper:last_blasted’

physcraper.opentree_helpers.bulk_tnrs_load(filename)[source]

Read in outputs from OpenTree Bulk TNRS, translates to a Physcraper OTU dictionary. :param filename: input json file

physcraper.opentree_helpers.check_if_ottid_in_synth(ottid)[source]

Web call to check if OTT id in synthetic tree. NOT USED.

physcraper.opentree_helpers.conflict_tree(inputtree, otu_dict)[source]

Write out a tree with labels that work for the OpenTree Conflict API

physcraper.opentree_helpers.count_match_tree_to_aln(tree, dataset)[source]

Assess how many taxa match between multiple genes in an alignment data set and input tree.

physcraper.opentree_helpers.debug(msg)[source]

short debugging command

physcraper.opentree_helpers.deconcatenate_aln(aln_obj, filename, direc)[source]

Split out separate concatended alignments. NOT TESTED

physcraper.opentree_helpers.generate_ATT_from_phylesystem(alnfile, aln_schema, workdir, configfile, study_id, tree_id, search_taxon=None, tip_label='^ot:originalLabel')[source]

Gathers together tree, alignment, and study info; forces names to OTT ids.

Study and tree ID’s can be obtained by using python ./scripts/find_trees.py LINEAGE_NAME

Spaces vs underscores kept being an issue, so all spaces are coerced to underscores when data are read in.

Parameters:aln – dendropy :class:`DnaCharacterMatrix

<dendropy.datamodel.charmatrixmodel.DnaCharacterMatrix>` alignment object. :param workdir: Path to working directory. :param config_obj: Config class containing the settings. :param study_id: OpenTree study id of the phylogeny to update. :param tree_id: OpenTree tree id of the phylogeny to update, some studies have several phylogenies. :param phylesystem_loc: Access the GitHub version of the OpenTree data store, or a local clone. :param search_taxon: optional. OTT id of the MRCA of the clade that shall be updated. :return: Object of class ATT.

physcraper.opentree_helpers.get_citations_from_json(synth_response, citations_file)[source]

Get ciattions for studies in an induced synthetic tree repsonse. :param synth_response: Web service call record :param citations_file: Output file

physcraper.opentree_helpers.get_dataset_from_treebase(study_id)[source]

Given a phylogeny in OpenTree with mapped tip labels, this function gets an alignment from the corresponding study on TreeBASE, if available. By default, it first tries getting the alignment from the supertreebase repository at https://github.com/TreeBASE/supertreebase. If that fials, it tries getting the alignment directly form TreeBASE at https://treebase.org If both fail, it exits with a message.

physcraper.opentree_helpers.get_max_match_aln(tree, dataset, min_match=3)[source]

Select an alignment from a DNA dataset

physcraper.opentree_helpers.get_mrca_ott(ott_ids)[source]

Finds the MRCA of taxa in the ingroup of the original tree. The BLAST search later is limited to descendants of this MRCA according to the NCBI taxonomy.

Only used in the functions that generate the ATT object.

Parameters:ott_ids – List of all OTT ids for tip labels in phylogeny
Returns:OTT id of most recent common ancestor
physcraper.opentree_helpers.get_nexson(study_id)[source]

Grabs nexson from phylesystem.

physcraper.opentree_helpers.get_ott_taxon_info(spp_name)[source]

Get OTT id, taxon name, and NCBI id (if present) from the OpenTree Taxonomy. Only works with version 3 of OpenTree APIs

Parameters:spp_name – Species name
Returns:
physcraper.opentree_helpers.get_ottid_from_gbifid(gbif_id)[source]

Returns a dictionary mapping GBIF ids to OTT ids. ott_id is set to ‘None’ if the GBIF id is not found in the Open Tree Taxanomy

physcraper.opentree_helpers.get_tree_from_study(study_id, tree_id, label_format='ot:originallabel')[source]

Create a dendropy Tree object from OpenTree data. :param study_id: OpenTree Study Id :param tree_id: OpenTree tree id :param label_format: One of ‘id’, ‘name’, “ot:originallabel”, “ot:ottid”, “ot:otttaxonname”. defaults to “ot:originallabel”

physcraper.opentree_helpers.get_tree_from_synth(ott_ids, label_format='name', citation='cites.txt')[source]

Wrapper for OT.synth_induced_tree that also pulls citations

physcraper.opentree_helpers.ottids_in_synth(synthfile=None)[source]

Checks if OTT ids are present in current synthetic tree, using a file listing all current OTT ids in synth (v12.3) :param synthfile: defaults to taxonomy/ottids_in_synth.txt

physcraper.opentree_helpers.root_tree_from_synth(tree, otu_dict, base='ott')[source]

Uses information from OpenTree of Life to suggest a root. :param tree: dendropy Tree :param otu_dict: a dictionary of tip label metadata, inculding an '^ot:ottId'attribute 'param base: either `synth or ott. If synth will use OpenTree synthetic tree relationships to root input tree, if ott will use OpenTree taxonomy.

physcraper.opentree_helpers.scraper_from_opentree(study_id, tree_id, alnfile, workdir, aln_schema, configfile=None)[source]

Pull tree from OpenTree to create a physcraper object.

physcraper.opentree_helpers.set_verbose()[source]

Set output verbosity

Physcraper run Configuration object generator

class physcraper.configobj.ConfigObj(configfile=None, run=True)[source]

To build the class the following is needed:

  • configfi: a configuration file in a specific format, e.g. to read in self.e_value_thresh.

During the initializing process the following self objects are generated:

  • self.e_value_thresh: the defined threshold for the e-value during Blast searches,

    check out: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ

  • self.hitlist_size: the maximum number of sequences retrieved by a single blast search

  • self.minlen: value from 0 to 1. Defines how much shorter new seq can be compared to input

  • self.trim_perc: value that determines how many seq need to be present before the beginning

    and end of alignment will be trimmed

  • self.maxlen: max length for values to add to aln

  • self.get_ncbi_taxonomy: Path to sh file doing something…

  • self.ott_ncbi: file containing OTT id, ncbi and taxon name (??)

  • self.email: email address used for blast queries

  • self.blast_loc: defines which blasting method to use:

    • either web-query (=remote)
    • from a local blast database (=local)
  • self.num_threads: number of cores to be used during a run

  • self.url_base:

    • if blastloc == remote: it defines the url for the blast queries.
    • if blastloc == local: url_base = None
  • self.delay: defines when to reblast sequences in days

  • optional self.objects:

    • if blastloc == local:

      • self.blastdb: this defines the path to the local blast database
      • self.ncbi_nodes: path to ‘nodes.dmp’ file, that contains the hierarchical information
      • self.ncbi_names: path to ‘names.dmp’ file, that contains the different ID’s
check_taxonomy()[source]

Locates a taxonomy directory in tthe phyysraper repo, or if not avail (often because module was pip installed), genertes one.

config_str()[source]

Write out the current config values. DOES NOT INCUDE SOME HIDDEN CONFIGUREABLE ATTRIBUTES

read_config(configfi)[source]

Reads configfile, and sets configuration params. any params not listed will be set to dafault values in set_default() * configfile: path to input file.

set_defaults()[source]

In the absence of an input configuration file, sets default values.

set_local()[source]

Checks that all appropriate files etc are in place for local blast db.

write_file(direc, filename='run.config')[source]

writes config params to file * direc: path to write file * filename: filename to use. Default = run.config

physcraper.configobj.is_number(inputstr)[source]

Test if string can be coerced to float

Link together NCBI and Open Tree identifiers and names, with Gen Bank information for updated sequences

class physcraper.ids.IdDicts(configfile=None)[source]

Class contains different taxonomic identifiers and helps to find the corresponding ids between ncbi and OToL

To build the class the following is needed:

  • config_obj: Object of class config (see above)
  • workdir: the path to the assigned working directory

During the initializing process the following self objects are generated:

  • self.workdir: contains path of working directory

  • self.config: contains the Config class object

  • self.ott_to_ncbi: dictionary

    • key: OToL taxon identifier
    • value: ncbi taxon identifier
  • self.ncbi_to_ott: dictionary

    • key: OToL taxon identifier
    • value: ncbi taxon identifier
  • self.ott_to_name: dictionary

    • key: OToL taxon identifier
    • value: OToL taxon name
  • self.acc_ncbi_dict: dictionary

    • key: Genbank identifier
    • value: ncbi taxon identifier
  • self.spn_to_ncbiid: dictionary

    • key: OToL taxon name
    • value: ncbi taxon identifier
  • self.ncbiid_to_spn: dictionary

    • key: ncbi taxon identifier
    • value: ncbi taxon name

user defined list of mrca OTT-ID’s #TODO this is flipped form the dat aobj .ott_mrca. On purpose?

#reomved mrca’s from ida, and put them into scrape object

  • Optional:
    • depending on blasting method:
    • self.ncbi_parser: for local blast,
      initializes the ncbi_parser class, that contains information about rank and identifiers
entrez_efetch(gb_id)[source]
Wrapper function around efetch from ncbi to get taxonomic information if everything else is failing.
Also used when the local blast files have redundant information to access the taxon info of those sequences.

It adds information to various id_dicts.

Parameters:gb_id – Genbank identifier
Returns:read_handle
get_ncbiid_from_acc(acc)[source]

checks local dicts, and then runs eftech to get ncbi id for accession

get_tax_seq_acc(acc)[source]

Pulls the taxon ID and the full sequences from NCBI

uses ncbi databases to easily retrieve taxonomic information.

parts are altered from https://github.com/zyxue/ncbitax2lin/blob/master/ncbitax2lin.py

class physcraper.ncbi_data_parser.Parser(names_file, nodes_file)[source]

Reads in databases from ncbi to connect species names with the taxonomic identifier and the corresponding hierarchical information. It provides a much faster way to get those information then using web queries. We use those files to get independent from web requests to find those information (the implementation of it in BioPython was not really reliable). Nodes includes the hierarchical information, names the scientific names and ID’s. The files need to be updated regularly, best way to always do it when a new blast database was loaded.

get_downtorank_id(tax_id, downtorank='species')[source]

Recursive function to find the parent id of a taxon as defined by downtorank.

get_id_from_name(tax_name)[source]

Find the ID for a given taxonomic name.

get_id_from_synonym(tax_name)[source]

Find the ID for a given taxonomic name, which is not an accepted name.

get_name_from_id(tax_id)[source]

Find the scientific name for a given ID.

get_rank(tax_id)[source]

Get rank for given ncbi tax id.

match_id_to_mrca(tax_id, mrca_id)[source]

Recursive function to find out if tax_id is part of mrca_id.

physcraper.ncbi_data_parser.get_acc_from_blast(query_string)[source]

Get the accession number from a blast query. :param query_string: string that contains acc and gi from local blast query result :return: gb_acc

physcraper.ncbi_data_parser.get_gi_from_blast(query_string)[source]

Get the gi number from a blast query. Get acc is more difficult now, as new seqs not always have gi number, then query changes.

If not available return None.

Parameters:query_string – string that contains acc and gi from local blast query result
Returns:gb_id if available
physcraper.ncbi_data_parser.get_ncbi_tax_id(handle)[source]

Get the taxon ID from ncbi. ONly used for web queries

Parameters:handle – NCBI read.handle
Returns:ncbi_id
physcraper.ncbi_data_parser.get_ncbi_tax_name(handle)[source]

Get the sp name from ncbi. Could be replaced by direct lookup to ott_ncbi.

Parameters:handle – NCBI read.handle
Returns:ncbi_spn
physcraper.ncbi_data_parser.get_tax_info_from_acc(gb_id, ids_obj)[source]

takes an accession number and returns the ncbi_id and the taxon name

physcraper.ncbi_data_parser.load_names(names_file)[source]

Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are accepted as scientific name by ncbi.

physcraper.ncbi_data_parser.load_nodes(nodes_file)[source]

Loads nodes.dmp and converts it into a pandas.DataFrame. Contains the information about the taxonomic hierarchy of names.

physcraper.ncbi_data_parser.load_synonyms(names_file)[source]

Loads names.dmp and converts it into a pandas.DataFrame. Includes only names which are viewed as synonym by ncbi.

physcraper.ncbi_data_parser.strip(inputstr)[source]

Strips of blank characters from string in pd dataframe.

Work in progress to pull apart the linked tree and taxon objects from the alignemnt based ATT object

class physcraper.treetaxon.TreeTax(otu_json, treefrom, schema='newick')[source]

wrap up the key parts together, requires OTT_id, and names must already match.

write_labelled(label, path, norepeats=True, add_gb_id=False)[source]

output tree and alignment with human readable labels Jumps through a bunch of hoops to make labels unique.

NOT MEMORY EFFICIENT AT ALL

Has different options available for different desired outputs

Parameters:
  • label – which information shall be displayed in labelled files: possible options: ‘^ot:ottTaxonName’, ‘^user:TaxonName’, “^ot:originalLabel”, “^ot:ottId”, “^ncbi:taxon”
  • treepath – optional: full file name (including path) for phylogeny
  • alnpath – optional: full file name (including path) for alignment
  • norepeats – optional: if there shall be no duplicate names in the labelled output files
  • add_gb_id – optional, to supplement tiplabel with corresponding GenBank sequence identifier
Returns:

writes out labelled phylogeny and alignment to file

physcraper.treetaxon.generate_TreeTax_from_run(workdir, start_files='output', tag=None)[source]

Build an Tree + Taxon object from the outputs of a run. :return: object of class TreeTax