Workspace

Note

(incomplete) Important: This section assume the project name is ‘example’ while actual projects will likely use different names.

This section describes how phyddle organizes files and directories in its workspace. Visit Formats to learn more about file formats. Visit Configuration to learn more about managing directories and projects within a workspace.

By default, phyddle saves work from its pipeline steps to the location workspace directory. Briefly, the workspace directory itself contains six subdirectories corresponding to the five pipeline steps, plus one directory for logs:

  • simulate contains raw data generated by simulation

  • format contains data formatted into tensors for training networks

  • train contains trained networks and diagnostics

  • estimate contains new test datasets their estimateions

  • plot contains figures of training and validation procedures

  • log contains runtime logs for a phyddle project

This section will assume all steps are using the example project bundled with phyddle was generated using the command

./scripts/run_phyddle.sh --cfg config --proj example --end_idx 25000

This corresponds to a 3-region equal-rates GeoSSE model. All directories have the complete file set, except ./workspace/simulate/example contains only 20 original examples.

A standard configuration for a project named example would store pipeline work into these directories:

workspace/simulate/example       # output of Simulate step
workspace/format/example         # output of Format step
workspace/train/example          # output of Train step
workspace/estimate/example       # output of Estimate step
workspace/plot/example           # output of Plot step
workspace/log/example            # analysis logs

Next, we give an overview of the standard files and formats corresponding to each pipeline directory.

simulate

The Simulate step generates raw data from a simulating model that cannot yet be fed to the neural network for training. A typical simulation will produce the following files

workspace/simulate/example/sim.0.tre              # Newick string
workspace/simulate/example/sim.0.dat.nex          # Nexus file
workspace/simulate/example/sim.0.param_row.csv    # data-generating params

format

Applying Format to a directory of simulated datasets will output tensors containing the entire set of training examples, stored to, e.g. workspace/format/example. What formatted files are created depends on the value of tensor_format and tree_width.

When tree_width is set to 200, Format will yield two simulated dataset tensors: one for the training examples and another for the test examples.

If the tensor_format setting is 'csv' (Comma-Separated Value, or CSV format), the formatted files are:

test.nt200.phy_data.csv
test.nt200.aux_data.csv
test.nt200.labels.csv
train.nt200.phy_data.csv
train.nt200.aux_data.csv
train.nt200.labels.csv

where the phy_data.csv files contain one flattened Compact Phylogenetic Vector + States (CPV+S) entry per row, the aux_data.csv files contain one vector of auxiliary data (summary statistics and known parameters) values per row, and labels.csv contains one vector of label (estimated parameters) per row. Each row for each of the CSV files will correspond to a single, matched simulated training example. All files are stored in standard comma-separated value format, making them easily read by standard CSV-reading functions.

If the tensor_format setting is 'hdf5', the resulting files are:

test.nt200.hdf5
train.nt200.hdf5

where each HDF5 file contains all phylogenetic-state (CPV+S) data, auxiliary data, and label data. Individual simulated training examples share the same set of ordered examples across three iternal datasets stored in the file. HDF5 format is not as easily readable as CSV format. However, phyddle uses gzip to automatically (de)compress records, which often leads to files that are over twenty times smaller than equivalent uncompressed CSV formatted tensors.

train

Training a network creates the following files in the workspace/train/my_project directory:

network_nt200.cpi_adjustments.csv
network_nt200.hdf5
network_nt200.train_aux_data_norm.csv
network_nt200.train_est.csv
network_nt200.train_est.labels.csv
network_nt200.train_history.json
network_nt200.train_label_est_nocalib.csv
network_nt200.train_label_norm.csv
network_nt200.train_true.labels.csv

For example, the network prefix sim_batchsize128_numepoch20_nt500 indicated a network trained with a batch size of 128 samples for 20 epochs on the tree width size-category of max. 500 taxa.

Descriptions of the files are as follows, with train_prefix omitted for brevity: * network.hdf5: a saved copy of the trained neural network that can be loaded by Tensorflow * train_label_norm.csv and train_aux_data_norm.csv: the location-scale values from the training dataset to (de)normalize the labels and auxiliary data from any dataset * train_true.labels.csv: the true values of labels for the training and test datasets, where columns correspond to estimated labels (e.g. model parameters) * train_est.labels.csv: the trained network estimates of labels for the training and test datasets, with calibrated prediction intervals, where columns correspond to point estimates and estimates for lower CPI and upper CPI bounds for each named label (e.g. model parameter) * train_label_est_nocalib.csv: the trained network estimates of labels for the training and test datasets, with uncalibrated prediction intervals * train_history.json: the metrics across training epochs monitored during network training * cpi_adjustments.csv: calibrated prediction interval adjustments, where columns correspond to parameters, the first row contains lower bound adjustments, and the second row contains upper bound adjustments

estimate

The Estimate step will both read new (biological) datasets from the project directory, and save new intermediate files, and store outputted estimates in the same directory, located at e.g. workspace/estimate/example:

new.1.tre               # input:             initial tree
new.1.dat.nex           # input:             character data
new.1.known_params.csv  # input:             params for aux. data (optional)
new.1.extant.tre        # intermediate:      pruned tree
new.1.phy_data.csv      # intermediate:      CPV+S tensor data
new.1.aux_data.csv      # intermediate:      aux. data tensor data
new.1.info.csv          # intermediate:      formatting info
new.1.network_nt200.est_labels.csv  # output: estimates

All files have previously been explained in the simulate, format, or train workspace sections, except for two.

The known_params.csv file is optional, and is used to provide “known” data-generating parameter values to the network for training, as part of the auxiliary dataset. If provided, it contains a row of names for known parameters followed by a row of respective values.

The est_labels.csv file reports the point estimates and lower and upper CPI estimates for all targetted parameters. Estimates for parameters appear across columns, where columns are grouped first by label (e.g. parameter) and then statistic (e.g. value, lower-bound, upper-bound). For example:

$ cat new.1.sim_batchsize128_numepoch20_nt500.pred_labels.csv
w_0_value,w_0_lower,w_0_upper,e_0_value,e_0_lower,e_0_upper,d_0_1_value,d_0_1_lower,d_0_1_upper,b_0_1_value,b_0_1_lower,b_0_1_upper
0.2867125345651129,0.1937433853918723,0.45733220552078013,0.02445545359384659,0.002880695707341881,0.10404499205878459,0.4502031713887769,0.1966340488593367,0.5147956690178682,0.06199703190510973,0.0015074254823161301,0.27544015163806645

plot

The Plot step generates visualizations for results previously generated by Format, Train, and (when available) Estimate.

est_CPI.pdf                       # results from Estimate step
density_{label,aux_data}.pdf      # densities from Simulate/Format steps
pca_contour_{label,aux_data}.pdf  # PCA of Simulate/Format steps
estimate_{test,train}_{param}.pdf # estimation accuracy from Train steps
history.pdf                       # training history for entire network
history_param_{statistic}.pdf     # training history for each estimation target
network_architecture.pdf          # neural network architecture
summary.pdf                       # compiled report of all figures

Visit Pipeline to learn more about the files.