How to run Physcraper

The easiest way to run Physcraper is using the command line tools. This way, you can directly specify arguments. A configuration file will be written down for the sake of reproducibility.

Example Physcraper runs from the command line

Starting with only an OpenTree study and tree id

As input, you will minimally need a study and tree ids from a tree uploaded to the OpenTree website (https://tree.opentreeoflife.org/curator). The --treebase flag (or -tb) will automatically download an alignment for that tree from TreeBASE.

physcraper_run.py [-s OPENTREE_STUDY_ID] [-t OPENTREE_TREE_ID] [-tb] [-o OUTPUT]

e.g.,

physcraper_run.py -s pg_55 -t tree5864 -tb -o pg_55_web

The output files generated by this example run are stored in “docs/examples/pg_55_web”

Starting with an OpenTree study and tree id and an alignment

Alternatively, you can provide the gene alignment that you want to update using the -a command:

physcraper_run.py [-s OPENTREE_STUDY_ID] [-t OPENTREE_TREE_ID] [-o OUTPUT] [-a ALIGNMENT] [-as ALIGNMENT_SCHEMA]

For example, to update a tree from Crous et al. 2012 using an alignment already downloaded from TreeBASE, you can do:

physcraper_run.py -s ot_350 -t Tr53297 -a docs/examples/inputdata/ot_350Tr53297.aln -as "nexus" -o ot_350

Starting with your own tree

If the tree you want to update is not posted to the OpenTree website, you need to match the labels on your tree to taxa using the OpenTree Bulk Taxonomic Name Resolution Service. Download your matched names, unzip the folder, and pass the “json” file that is output from the OpenTree Bulk TNRS tool as --taxon_info or -ti argument:

physcraper_run.py [-tf TREE_FILE] [-tfs TREEFILE_SCHEMA] [-a ALIGNMENT] [-as ALIGNMENT_SCHEMA] [-ti TAXON_INFO_JSONFILE]  [-o OUTPUT]

e.g.,

physcraper_run.py -tf tests/data/tiny_test_example/test.tre -tfs newick -a tests/data/tiny_test_example/test.fas  -as fasta --taxon_info tests/data/tiny_test_example/main.json -o owndata

Checking the inputs before a full run

Use the flag -no_est to simply download a tree from OpenTree and the corresponding alignment from TreeBASE. This will not run the BLAST and tree estimation steps:

physcraper_run.py -s pg_55 -t tree5864 -tb -no_est -o pg55_C

To initiate a full Physcraper run from that tree and alignment, simply remove the -no_est flag. It will re-load the inputs from the specified output directory and will use your same config settings that are automatically written out to “OUTPUT_run.config”.

The -re flag will re-run a Physcraper cycle on a given output directory. If the initial or previous run completed, it will use the final output tree and alignment as input. If the run was not completed, it will reload the original input files.

physcraper_run.py -re pg_55_C -o pg_55_C

You can also re-run with a different configuration file:

physcraper_run.py -re  pg_55_C/ -c alt_config -o  pg_55_D

Configuration parameters

To see all the configuration parameters, use physcraper_run.py -h.

The configuration parameters may be set in a configuration file, and then passed into the analysis run. See file “example.config” for an example.

-c CONFIGFILE, --configfile CONFIGFILE
 Gives the path to the configuration file

If a config file input is combined with command line configuration parameters, the command line values will override those in the configuration file.

The configuration settings for the current run are written to standard out, and saved in the output directory as “run.config”, e.g.,

[blast]
Entrez.email = None
e_value_thresh = 1e-05
hitlist_size = 20
location = local
localblastdb = /home/projects/ncbi/localblastdb/
url_base = None
num_threads = 8
delay = 90
[physcraper]
spp_threshold = 3
seq_len_perc = 0.8
trim_perc = 0.8
min_len = 0.8
max_len = 1.2
taxonomy_path = /home/projects/physcraper/taxonomy

Input Data

Tree information (required):

-s STUDY_ID, --study_id STUDY_ID
 OpenTree study id
-t TREE_ID, --tree_id TREE_ID
 OpenTree tree id

OR

-tf TREE_FILE, --tree_file TREE_FILE

A name (and path) to a tree file.

-tfs {newick,nexus}, --tree_schema {newick,nexus}

Tree file format schema.

-ti FILE_NAME, --taxon_info FILE_NAME

Name (and path) of a taxon info file from an OpenTree TNRS run.

Alignment information (required):

-a ALIGNMENT, --alignment ALIGNMENT
 Gives the path to alignment file
-as ALN_SCHEMA, –aln_schema ALN_SCHEMA
Specifies the alignment schema, one of nexus or fasta

OR

-tb , --treebase

Downloads alignment from TreeBASE.


Tree and alignment information are required. After an analysis has been run, they can be reloaded from a directory from a previous run.

-re RELOAD_FILES, --reload_files RELOAD_FILES

Reloads files and configuration from the output directory specified in -o, --output.

REQUIRED:

-o OUTPUT, --output OUTPUT
 Specifies the path to output directory

Optional:

-st SEARCH_TAXON, --search_taxon SEARCH_TAXON

Specifies the taxonomic id to constrain the BLAST search. Format ott:123 or ncbi:123. By default, it will use the ingroup of the tree from OpenTree, or the MRCA of all tips, if the former is not specified.

Blast search parameters

-e EMAIL, --email EMAIL

An email address for BLAST searches.

-r , --repeat

Repeats a BLAST search until no more sequences are found.

-ev E-VALUE, --eval E-VALUE

Specifies a blast e-value cutoff.

-hl HITLIST_LENGTH, --hitlist_len HITLIST_LENGTH

Specifies the number of BLAST searches to save per taxon.


You can use a local BLAST database. To setup see Local Databases section of this documentation.

-db BLAST_DB, --blast_db BLAST_DB

Specifies the local download of a BLAST database.

-nt NUM_THREADS, --num_threads NUM_THREADS

Specifies the number of threads to use in processing.


You can use your own BLAST database, for example set up on an AWS server.

Sequence filtering parameters

-tp TRIM_PERC, --trim_perc TRIM_PERC

Minimum percentage of sequences end of alignments.

-rlmax RELATIVE_LENGTH_MAX, --relative_length_max RELATIVE_LENGTH_MAX

Maximum relative length of added sequences, compared to input alignment length (BLAST matches not within length cutoffs are stored in "outputs/seqlen_mismatch.txt").

-rlmin RELATIVE_LENGTH_MIN, --relative_length_min RELATIVE_LENGTH_MIN

Minimum relative length of added sequences, compared to input alignment length (BLAST matches not within length cutoffs are stored in "outputs/seqlen_mismatch.txt").

-spn SPECIES_NUMBER, --species_number SPECIES_NUMBER

Maximum number of sequences to include per species.

-de DELAY, --delay DELAY

How much time to wait before blasting the same sequence again.


Tree search parameters

-no_est , --no_estimate_tree

Does not estimate the tree, just gathers the sequences and aligns them.

-bs BOOTSTRAP_REPS, --bootstrap_reps BOOTSTRAP_REPS

Number of bootstrap repetitions.


Internal arguments

-tx TAXONOMY, --taxonomy TAXONOMY

A path to the OpenTree Taxonomy (OTT) database.