How to run Physcraper¶
The easiest way to run Physcraper is using the command line tools. This way, you can directly specify arguments. A configuration file will be written down for the sake of reproducibility.
Example Physcraper runs from the command line¶
Starting with only an OpenTree study and tree id¶
As input, you will minimally need a study and tree ids from a tree uploaded to the OpenTree website (https://tree.opentreeoflife.org/curator). The --treebase
flag (or -tb
) will automatically download an alignment for that tree from TreeBASE.
physcraper_run.py [-s OPENTREE_STUDY_ID] [-t OPENTREE_TREE_ID] [-tb] [-o OUTPUT]
e.g.,
physcraper_run.py -s pg_55 -t tree5864 -tb -o pg_55_web
The output files generated by this example run are stored in “docs/examples/pg_55_web”
Starting with an OpenTree study and tree id and an alignment¶
Alternatively, you can provide the gene alignment that you want to update using the -a
command:
physcraper_run.py [-s OPENTREE_STUDY_ID] [-t OPENTREE_TREE_ID] [-o OUTPUT] [-a ALIGNMENT] [-as ALIGNMENT_SCHEMA]
For example, to update a tree from Crous et al. 2012 using an alignment already downloaded from TreeBASE, you can do:
physcraper_run.py -s ot_350 -t Tr53297 -a docs/examples/inputdata/ot_350Tr53297.aln -as "nexus" -o ot_350
Starting with your own tree¶
If the tree you want to update is not posted to the OpenTree website, you need to match
the labels on your tree to taxa using the OpenTree Bulk Taxonomic Name Resolution Service. Download your matched names, unzip the folder, and pass the “json” file that is output from the OpenTree Bulk TNRS tool as --taxon_info
or -ti
argument:
physcraper_run.py [-tf TREE_FILE] [-tfs TREEFILE_SCHEMA] [-a ALIGNMENT] [-as ALIGNMENT_SCHEMA] [-ti TAXON_INFO_JSONFILE] [-o OUTPUT]
e.g.,
physcraper_run.py -tf tests/data/tiny_test_example/test.tre -tfs newick -a tests/data/tiny_test_example/test.fas -as fasta --taxon_info tests/data/tiny_test_example/main.json -o owndata
Checking the inputs before a full run¶
Use the flag -no_est
to simply download a tree from OpenTree and the corresponding alignment from TreeBASE.
This will not run the BLAST and tree estimation steps:
physcraper_run.py -s pg_55 -t tree5864 -tb -no_est -o pg55_C
To initiate a full Physcraper run from that tree and alignment, simply remove the -no_est
flag.
It will re-load the inputs from the specified output directory and will use your same config settings that are automatically written out to “OUTPUT_run.config”.
The -re
flag will re-run a Physcraper cycle on a given output directory.
If the initial or previous run completed, it will use the final output tree and alignment as input.
If the run was not completed, it will reload the original input files.
physcraper_run.py -re pg_55_C -o pg_55_C
You can also re-run with a different configuration file:
physcraper_run.py -re pg_55_C/ -c alt_config -o pg_55_D
Configuration parameters¶
To see all the configuration parameters, use physcraper_run.py -h
.
The configuration parameters may be set in a configuration file, and then passed into the analysis run. See file “example.config” for an example.
-c CONFIGFILE, --configfile CONFIGFILE Gives the path to the configuration file
If a config file input is combined with command line configuration parameters, the command line values will override those in the configuration file.
The configuration settings for the current run are written to standard out, and saved in the output directory as “run.config”, e.g.,
[blast]
Entrez.email = None
e_value_thresh = 1e-05
hitlist_size = 20
location = local
localblastdb = /home/projects/ncbi/localblastdb/
url_base = None
num_threads = 8
delay = 90
[physcraper]
spp_threshold = 3
seq_len_perc = 0.8
trim_perc = 0.8
min_len = 0.8
max_len = 1.2
taxonomy_path = /home/projects/physcraper/taxonomy
Input Data¶
Tree information (required):
-s STUDY_ID, --study_id STUDY_ID OpenTree study id -t TREE_ID, --tree_id TREE_ID OpenTree tree id
OR
- -tf TREE_FILE, --tree_file TREE_FILE
A name (and path) to a tree file.
- -tfs {newick,nexus}, --tree_schema {newick,nexus}
Tree file format schema.
- -ti FILE_NAME, --taxon_info FILE_NAME
Name (and path) of a taxon info file from an OpenTree TNRS run.
Alignment information (required):
-a ALIGNMENT, --alignment ALIGNMENT Gives the path to alignment file
- -as ALN_SCHEMA, –aln_schema ALN_SCHEMA
- Specifies the alignment schema, one of nexus or fasta
OR
- -tb , --treebase
Downloads alignment from TreeBASE.
Tree and alignment information are required. After an analysis has been run, they can be reloaded from a directory from a previous run.
- -re RELOAD_FILES, --reload_files RELOAD_FILES
Reloads files and configuration from the output directory specified in
-o, --output
.
REQUIRED:
-o OUTPUT, --output OUTPUT Specifies the path to output directory
Optional:
- -st SEARCH_TAXON, --search_taxon SEARCH_TAXON
Specifies the taxonomic id to constrain the BLAST search. Format
ott:123
orncbi:123
. By default, it will use the ingroup of the tree from OpenTree, or the MRCA of all tips, if the former is not specified.
Blast search parameters¶
- -e EMAIL, --email EMAIL
An email address for BLAST searches.
- -r , --repeat
Repeats a BLAST search until no more sequences are found.
- -ev E-VALUE, --eval E-VALUE
Specifies a blast e-value cutoff.
- -hl HITLIST_LENGTH, --hitlist_len HITLIST_LENGTH
Specifies the number of BLAST searches to save per taxon.
You can use a local BLAST database. To setup see Local Databases section of this documentation.
- -db BLAST_DB, --blast_db BLAST_DB
Specifies the local download of a BLAST database.
- -nt NUM_THREADS, --num_threads NUM_THREADS
Specifies the number of threads to use in processing.
You can use your own BLAST database, for example set up on an AWS server.
Sequence filtering parameters¶
- -tp TRIM_PERC, --trim_perc TRIM_PERC
Minimum percentage of sequences end of alignments.
- -rlmax RELATIVE_LENGTH_MAX, --relative_length_max RELATIVE_LENGTH_MAX
Maximum relative length of added sequences, compared to input alignment length (BLAST matches not within length cutoffs are stored in "outputs/seqlen_mismatch.txt").
- -rlmin RELATIVE_LENGTH_MIN, --relative_length_min RELATIVE_LENGTH_MIN
Minimum relative length of added sequences, compared to input alignment length (BLAST matches not within length cutoffs are stored in "outputs/seqlen_mismatch.txt").
- -spn SPECIES_NUMBER, --species_number SPECIES_NUMBER
Maximum number of sequences to include per species.
- -de DELAY, --delay DELAY
How much time to wait before blasting the same sequence again.
Tree search parameters¶
- -no_est , --no_estimate_tree
Does not estimate the tree, just gathers the sequences and aligns them.
- -bs BOOTSTRAP_REPS, --bootstrap_reps BOOTSTRAP_REPS
Number of bootstrap repetitions.
Internal arguments¶
- -tx TAXONOMY, --taxonomy TAXONOMY
A path to the OpenTree Taxonomy (OTT) database.