BioPhysCode
is a collection of three open-source biophysical simulation analysis tools developed jointly by Ryan Bradley, Joe Jordan, and David Slochower during their study at the University of Pennsylvania under the supervision of Ravi Radhakrishnan. It is currently under active development by Ryan and Joe.
Our goal is to provide all of our simulation and analysis tools to the larger research community. Each code works independently, however they are designed as a single, coherent framework so you can use them together to create novel, scalable biophysics experiments based on the popular GROMACS integrator. Contact the authors if you have questions, or open an issue on the github pages listed below.
We recommend starting with either the quickstart guide or the list of validated codes.
FACTORY
a flexible framework for building and analyzing molecular simulation data
◾ source from github
◾ contact the authors for instructions
AUTOMACS
reproducible, extensible, shareable simulation protocols ◾ source from github
◾ online documentation
OMNICALC
transparent analysis pipeline for interpreting biophysical simulations
◾ source from github
◾ online documentation
Quickstart guide
Use cases
The following is the quickstart guide for BioPhysCode
, which consists of a simulation package called automacs, a calculation package called omnicalc, and an interface called factory. This guide will cover the following use-cases.
- Install a factory. The factory is primarily used to organize different calculation “projects” by cloning omnicalc and managing the paths to your data. It also manages a virtual environment which is required for running calculations, creating complex simulations with automacs, and serving the web-based interface. We recommend starting with the factory, which takes less than three minutes to install on most machines.
- Make a standalone simulation with automacs according to a stock “experiment”. The authors provide many different simulation experiments, but the most common experiments will make an atomistic protein system (Joe’s favorite use-case) or a coarse-grained protein-membrane system (Ryan’s favorite use-case). This is best for users who are comfortable using the terminal and want to quickly generate new simulations.
- Analyze preexisting simulation data from GROMACS, NAMD, or even custom mesoscale membrane simulations. This use-case is ideal for users who wish to use our analysis codes on simulations they have already created.
- Run and analyze simulations in a single pipeline which can be served over a web-based interface. This is the most scalable use-case, and is best for users who wish to run many simulations and calculations that are already available in automacs and omnicalc.
- Set up a factory inside a docker to either test the code, reproduce a specific use-case from another user, or securely serve a factory over the internet. Docker is effectively a sandbox which provides more control over the software environment and a secure way to serve many web interfaces from one machine. Additionally, the dockerfiles we provide can be used as a blueprint for building the software environment for a workstation you wish to use to run simulations outside of the docker,
1. Install a factory
Install a factory to supply software for advanced automacs simulations, or to organize your analysis of pre-made simulations.
The factory code does three things.
- It supplies a python virtual environment for analysis and more complicated model-building. A simulation of a protein in water only requires GROMACS, but bilayer construction also requires Python libraries like SciPy.
- It organizes instances of omnicalc into “projects” each of which can be used to analyze a set of simulations.
- It creates a graphical interface to run and analyze simulations over the web.
There are two ways to install the factory. Users with a healthy linux environment can perform a “simple” installation by downloading a copy of Miniconda and installing it with the commands below. Users who wish to run the factory in a Docker should use the instructions in the docker section below. We recommend the simple installation if at all possible — Docker is useful primarily as a web server and as a method for ensuring that you know exactly how to set up a new linux box in order to run the simple installation yourself. Note that in the following example, I have saved a copy of the Miniconda installer the factory folder after I cloned it.
Setup should take about three minutes. The commands listed above were also given by the make help
command so you don’t have to remember all of these steps next time. Note that make help
will always be up to date, in the unlikely event that we change parts of the code. Once you build the factory, you can source the included environment from anywhere by running source factory/env/bin/activate py2
as long as you use the right path to the factory.
Anytime you set up an omnicalc project in the factory (see the analysis guide), all of its make
commands will automatically source the factory environment before they run, so you don’t have to remember this step for analysis.
Automacs simulations can be run from anywhere, but you have to remember to source the python environment for complicated use cases, for example bilayer construction. I prefer to make a shortcut to the source command so I can call it from other locations easily by adding alias py2="source /home/username/factory/env/bin/activate py2"
to my ~/.bashrc
file so that I can run py2
before starting simulations.
2. Standalone AUTOMACS quickstart guide
Run an “automatic GROMACS” simulation of a coarse-grained protein on a bilayer.
Requirements. To complete these commands you will need a working copy of GROMACS available at the terminal. The bilayer builder depends on SciPy as well, which you can install yourself or install with the factory according to the factory guide above.
You can clone automacs and the associated modules with the following commands.
The final command lists all of the available simulation “experiments”. The make gromacs_config
command can write an automacs configuration file to either your home directory or in the automacs root folder. This configuration file will help you load the necessary software on the cluster, or set the number of processors for the integrator. The v
flag uses a verbose mode that includes tags and test dates. The colorful tags show the last time the tests were run, and serve as notes for the developers. The make prep v
command lists three types of “experiments”. We’ve highlighted the multi-step experiments (called “metaruns”) in the image below.
The make setup all
command tells automacs to download independent repositories containing different experiments. You can also make your own modular repositories and load them into automacs. The default modules are loaded by make setup
which can request one of several “kickstarters” that pull down several modules from github. The all
kickstart is a good place to start. The instructions to run these experiments are found in the grey paths listed beside them. Because automacs uses “kickstarters” to load extra modules from github, it automatically detects any file that matches *_expts.py
and parses it for experiments, so your list of experiments might be different.
The easiest way to run an experiment is to use the following command.
The optional clean
flag erases the previous simulation data in this copy of automacs, so be careful! This option is useful when adjusting parameters and restarting the simulation. The optional back
flag runs the simulation in the background. Automacs will prepare the necessary python scripts and run them for you. Note that the experiments listed above are composed of other modular experiments (which we call “runs”) hence the sequence is called a “metarun”. Automacs is designed to be modular so that different runs can be reused in other contexts. For example, you can build a bilayer, duplicate it in several directions, and then attach a protein to it by calling on these individual steps inside of the experiment file. This minimizes repetition in our codes.
Most users prefer to start with one of the following examples.
protein
will simulate a small protein in a water box. This is available with a smaller kickstarter invoked viamake setup protein
after you clone automacs.testset_protein_bilayer_flat
will create a coarse-grained bilayer and attach a small helix to it. This requires themake setup all
kickstarter after cloning, and also requires SciPy, provided by the factory or available at your terminal.
Both experiments obviously require a working copy of gromacs. Other validated experiments are listed at the end of this guide. Users who wish to run coarse-grained simulations should check out the detailed coarse-grained bilayer guide for a more thorough walkthrough. The protein experiment also leads the walkthrough that helps to explain the experiment file, which includes all of the parameters necesssary to design the simulation.
3. Analyzing simulations
Automacs will replicate a simulation but you may wish to repeat or modify a calculation on a pre-existing dataset. In this section we will outline this use-case, in which we “reproduce a calculation.” The core calculation code (called “omnicalc”) can be used directly, but it is more convenient to manage your analysis pipeline with an instance of the factory. It will ensure that you have the right python libraries and it will tell omnicalc where to find your data.
This use-case assumes that you have already run a simulation and created a structure and trajectory file for analysis. The pipeline documentation below will describe how you can use the factory to simulate and analyze the data more seamlessly, but oftentimes users have already generated data that they wish to analyze, hence this section represents the simplest way to run a calculation.
Start by making a factory. Once it’s ready, you should write a “connection” file to a subfolder called connections
. Note that whenever we use the location of a code, i.e. “make a subfolder in the factory,” we are referring to the place on disk where you cloned it. Choose a name for your project. This name will be used in several locations, so choose carefully! We are calling this one “proteins”, since we might include lots of simulations of different proteins.
Save the above text to factory/connections/protein_connections.yaml
and then run the following command from the root directory.
When it’s complete, your omnicalc project will be set up at factory/calc/proteins
. We will run all of our commands from this folder. The repo
flag points to folder where we will store our calculation codes. This folder will hold an independent git repository which you may later share with other users who can use the repo
flag to clone your calculations over the web or secure shell. The calculations will be stored at factory/calc/proteins/calcs
. At the end of this guide we will explain how to share your calculations repository.
Importing simulation data
The “omnicalc” code is designed to read simulations created in the corresponding “automacs” code, however it is easy to use the connection file above to point to GROMACS data that you’ve already generated. There is only one requirement: each simulation must have its own folder, and the trajectory “parts” must be in a subfolder. GROMACS generates several different output files typically prefixed according to the deffnm
flag to the mdrun
program. The spots
section of the connection file can be used to point to multiple folders that contain source data. Each of these is a “spot” with its own path. In the example above, we have a single spot called sims
which points to simulation folders at /home/username/external_disk/protein_activation_project
. The remainder of the connection file tells omnicalc how to find the various trajectory (xtc
and trr
), energy (edr
), and structure (gro
or pdb
) files using regular expressions. The top
key has a generic regular expression '(.+)'
that will match any simulation name, however you can use it to match certain folder names. If you used deffnm
correctly, then the connection file above will help you locate trajectory files with paths like /home/username/external_disk/protein_activation_project/wild_type_rep1/production/md.part0009.xtc
. The default GROMACS naming scheme used in the connection file will tell omnicalc to look for an arbitrary number of sequential files. We require energy (edr
) files, run input files (tpr
), and trajectory files at a minimum. We derive the timestamps from the energy files (because this is much faster than reading the trajectory data) in order to figure out how to stitch together many trajectory files for analysis.
Once you have prepared a connection file, run the following commands.
The first command sets up an omnicalc instance and creates a calculation repository. The second command runs the graphical user interface (GUI) in the background at http://localhost:8000
. Check out the single pipeline quickstart guide if you want to run simulations with this interface.
Running some simple calculations
In this step we will compute the root mean-squared deviation (RMSD) of a protein. This requires two new files: a “metadata” file to tell omnicalc which simulations to analyze, and a calculation file to analyze the resulting trajectory “slice”. We choose to store metadata in the YAML format because it’s easy to read. Make sure you start from the omnicalc folder at factory/calc/proteins
. Save the file to calcs/specs/meta.yaml
. Note that all of the files in the specs
folder should be YAML. Later, you can use these files to fine-tune your analyses.
The metadata above tells omnicalc to do several things. First, we define a “collection” of simulations to analyze. In the example above, we only have one simulation called wild_type_rep1
. According to the connection file, omnicalc will expect to find trajectory files for this simulation at e.g. /home/username/external_disk/protein_activation_project/wild_type_rep1/production/md.part0009.xtc
. Note that the simulation name matches the top
regex in the connection file, while its subfolder matches the step
regex and hence must be called production
. This folder should contain all of the associated energy, run input, and trajectory files.
The second section of the metadata is called slices
and tells omnicalc which part of the trajectory to use. In this case, we have requested a “slice” (i.e. a sample of our trajectory) from 10-110 ns with an interval of 200 ps (note that GROMACS uses picoseconds as the default time interval). The group named protein_selection
uses the MDAnalysis package to select parts of your simulation, in this case the atoms returned by a CHARMM-style selection command 'protein'
.
The next section, calculations
tells omnicalc how to analyze this trajectory. In a moment we will write the calculation code to calcs/protein_rmsd.py
, but for now, you should note that the calculation dictionary asks for the slice called long
by its name, referring to the corresponding entry in the slices
dictionary for this simulation. When we execute the calculation, omnicalc will try to find a trajectory slice that matches the one requested in the slices
dictionary. If it doesn’t exist, GROMACS will make it automatically and save it to the post_spot
directory set in the connection file for future use. It will also save the result of the analysis to this folder. Once all of the slices are ready, they will be individually sent to the calculation function.
To run the protein_rmsd
calculation specified in the metadata, omnicalc will look for calcs/protein_rmsd.py
. Save the following code to that location.
Calculations must be written according to the omnicalc style. First, calculation script protein_rmsd.py
must contain a function of the exact same name, however you can call other functions and libraries from the script. Second, the function must accept two arguments, corresponding to the structure and trajectory files required by other packages like MDAnalysis to read the trajectory into memory. In the example above, these arguments are structure
and trajectory
but you may use other names. Additional details about the calculations, including additional settings, can be sent from the metadata to the calculation via kwargs
DOCTHIS. This is extremely useful when you inevitably realize that a constant is actually a variable!
The final requirement is that the calculation function must return two variables, result,attrs
above. The result
must be a dictionary with h5py-compatible numpy arrays. This imposes some minor restriction on the format of the result in order to ensure that the result is written in a space- and speed-efficient binary format which is hardware agnostic. The attrs
can describe metadata in any unstructured dictionary format. These attributes will be combined with the calculation metadata and written to disk alongside the binary result in order to uniquely identify the data after it has been written. These identifiers become important when you want to run the same calculation with different parameters and later plot the results.
Now that we have prepared the metadata and the calculation script, we can run the calculation using the following command.
You can ignore calculations by adding ignore: True
to the metadata, otherwise this command will run all analyses requested by the metadata.
To visualize the results, write the following code to a plot script located at calcs/plot-protein_rmsd.py
.
You can generate the RMSD plot by running the following command.
Omnicalc will deposit you in an interactive terminal in case you wish to inspect the data. Otherwise the following plot will be written to the plot_spot
directory set in the connection file.
Note that this quickstart guide was tested and added to the validation list. The calculations repository has been shared according to the directions below, and can be found on github.
Sharing calculations
At the beginning of the quickstart guide we made a connection file which set the location of our calculations repository repo: calc/PROJECT_NAME/calcs
. Note that PROJECT_NAME
is replaced with the project name proteins
. When you run make connect proteins
, the factory makes a new, empty repository at factory/calc/proteins/calcs/
and runs git init
there to create a new repository. You can use git commit
to save changes to the calculation codes, for example protein_rmsd.py
in this folder. To share the codes via github, you can make a new repository and then send your code there via git remote add origin https://github.com/username/my_repo
and git push -u origin master
. Other users who wish to use your calculations should set repo: https://github.com/username/my_repo
in their connection files. When they run make connect new_project
then the factory will clone the calculations repository from github to factory/calc/new_project/calcs
so they can try your code. The calculations for the quickstart guide were shared on github using this method.
4. A web-based simulation-analysis pipeline
Pipeline documentation is available on a separate walkthrough.
5. Running the factory in Docker
The authors have tested the factory inside of Docker in order to (1) make sure that the code works, (2) serve the factory securely over a network, and (3) provide new users with the tools necessary to deploy the software on different platforms. Both dockerfiles and unit tests for the most important factory use-cases are provided on github and provide a lingua franca for setting up these codes. They are particularly useful when troubleshooting a software dependency problems.
Users who wish to test the code or serve a factory over the network should install docker first. See the docker installation guide for instructions. Docker will require root access to your machine. Users who only want a guide for setting up their software environment can simply read ahead for links to the dockerfiles and unit tests.
To test the factory, we will first clone it and connect it to extra tools designed to interact with docker. Note that this is somewhat counterintuitive: we create a factory for testing (hereafter factory-docker
) which will later create a factory for deployment inside of a special root folder on your host. The testing factory will be small, and is only used to supervise the code. The deployed factory will be installed specifically for a docker container, and will be unusable on the host because the virtual environment sets a single root path when it is installed, and cannot be moved after installation.
The following commands can be used to create a testing factory.
This procedure downloads two critical files. First, tests/docker.py
includes all of the dockerfiles. When we run a unit test, the code will automatically make docker images described in this file. There are multiple images depending on whether we need GROMACS, VMD, or ffmpeg
for encoding videos. Second, tests/testset.py
containts all of the unit tests and is designed to be self contained. Each test has instructions for writing files, creating a script composed of terminal commands, and even mounting a folder on your host inside of the docker. The unit tests are designed to be self-contained, and somewhat readable. Users who are interested in viewing either the dockerfiles or unit tests can find them at the following direct links.
- Dockerfiles via docker.py
- Unit tests via testset.py
To serve a factory from docker, you must first change the DOCKER_SPOT
variable inside tests/testset.py
. This is the root for all of the files that docker writes, and it will contain the deployed copy of the factory. Once you have prepared a location, run the following command.
Note that the arguments factory test
are a sequence of descriptors that can be arranged in any order. The unit tests in testset.py
are named by these sequences. If you wish to serve a factory, you must first find the demo serve
unit test in testset.py
. This test contains a public
dictionary for the network settings. Add your hostname to the hostname
list, and choose the port
and notebook_port
. You must open these ports; this requires root access and depends on your operating system. Lastly, you should set a username and password in credentials
. Once you have done this, you can serve the factory from inside the docker using the following command
You can shut down the factory by running docker kill
(see image above). The authors will provide more detailed documentation for the unit test syntax, however this guide has described the simplest way to test or deploy the factory. This quickstart guid is part of the validation list, and is necessary for running other unit tests. Contact the authors with questions!