OMNICALC is a framework for analyzing biophysical simulations with minimal, compact, human readable analysis functions, parameter sweeps, and elegant plotting functionality. The codes are designed to cleave the scientific analysis components from the bookkeeping and data structures required to run hundreds of different simulations. This separation is literal: omnicalc manages the bookkeeping while a separate repository holds the analysis codes and specifications. These resulting analysis codes are easily shared between datasets and published in tandem with journal articles. In this way, the authors hope that omnicalc can contribute to ensuring that simulation analyses are both scalabe and reproducible.
1 Raw Data¶
Omnicalc is designed to perform what most people refer to as “post-processing” for biophysics simulations, and in particular, molecular dynamcis data generated by GROMACS. It has been specifically designed to work with the Automacs codes which generate large batches of GROMACS simulations, however it is designed to accept any GROMACS simulation data as long as it is organized according to the rules laid out below.
1.1 Incoming data¶
Instructions for importing your data must be written to a file called paths.yaml
. The paths section below will tell you exactly how to write this file. In this part, we will first describe how to organize your data so that omnicalc can identify it.
1.1.1 Data from AUTOMACS¶
Simulations created with automacs follow a consistent directory structure. Each simulation has a root folder which contains the automacs code, along with several sub-folders, each of which correspond to a discrete simulation “step”. Each step can have many parts, but the steps are designed to be the coherent unit of analysis. Complicated simulations may have a few steps, but most of the relevant simulation data are contained in a production run in the final step. See the documentation accompanying automacs for more details on why we organize the data this way.
1.1.2 Data from elsewhere¶
Users who have already generated large simulation data sets, or choose to use other methods to generate their data can prepare the data for use by omnicalc by mimicking its directory structure. The rule of thumb is that omnicalc must be able to find the files using a regular expression, using integers for any necessary ordering. As per the GROMACS custom and the automacs specification, we prefer to have the simulations organized into groups of trajectory files named e.g. md.part0001.xtc
which can easily be detected and ordered by a regular expression (“^md\.part([0-9]{4})\.xtc$
”).
Note
We would never encourage users to rename any files in a primary data set. If your files are not compatible with omnicalc, we suggest that you write a short program that uses symbolic links to name them in a systematic way.
While the naming scheme is relatively open-ended, the directory constraints are not. Users must have one folder per simulation, and the target data must be contained in a sub-folder of this.
1.2 What you get¶
Omnicalc produces three tangible types of data for you. First, it “slices” GROMACS trajectories into more manageable parts, isolating only the system components that you are interested in. This is particularly useful when you wish to load an entire simulation at a particular sampling rate into memory. If you only wish to analyze a single component in a large simulation, this data-reduction step can make it possible to load the entire trajectory into memory, which can often dramatically speed-up your analysis and visualization.
Second, omnicalc can perform detailed mathematical operations on your data using numpy and scipy libraries which can be saved to a portable binary format using the HDF5 in Python. Calculation steps can be linked together in an arbitrary sequence to make the calculation efficient and modular.
Lastly, we have included a number of standardized plotting and animation routines that use matplotlib and VMD to create elegant images of the resulting data or trajectories.
We group the simulation slices and the postprocessing binaries in a single output directory and reserve a separate directory for plots. The paths to these data are specified in paths.yaml
described next.
1.3 Setting the paths¶
As long as your data are correctly organized and mounted on your machine, you can set the paths by first running make default_paths
in the omnicalc root directory. This creates a default paths.yaml
file for you to edit. We have reproduced the essential features below. In the remainder of this section we will fully specify the omnicalc path scheme. For a quick start, read the comments in the default paths.yaml
file.
post_data_spot: post
post_plot_spot: plot
workspace_spot: workspace
timekeeper: false
spots:
sims:
namer: "lambda spot,top : 'v'+re.findall('simulation-v([0-9]+)',top)[0]"
route_to_data: PARENT_DIRECTORY_FOR_MY_DATA
spot_directory: DATA_DIRECTORY_IN_ROUTE_TO_DATA
regexes:
top: '(simulation-v[0-9]+)'
step: '([stuv])([0-9]+)-([^\/]+)'
part:
xtc: 'md\.part([0-9]{4})\.xtc'
trr: 'md\.part([0-9]{4})\.trr'
edr: 'md\.part([0-9]{4})\.edr'
tpr: 'md\.part([0-9]{4})\.tpr'
structure: '(system|system-input|structure)\.(gro|pdb)'
1.3.1 Necessary paths¶
Users must replace any capitalized text in the default paths.yaml
file. The first path you must specify is the route_to_data
variable, which sets the root directory for your incoming dataset. It is best to use this variable to set the system path for a data-specific harddrive. It’s easy to change this later if you move your data to a new system. The remainder of the path to your data is stored in spot_directory
. It is best to set this directory with a more permanent name to avoid confusion. The route_to_data
directory is entirely flexible, and can be changed whenever you move or re-mount the harddrive.
The spot_directory
variable should point from th route_to_data
to a single, specific dataset. Omnicalc is designed to use multiple parallel datasets at the same time. We refer to the paths of these datasets as “spots”, each of which is defined by a separate sub-dictionary underneath spots
in the paths.yaml
file. The name of the subdirectory for each spot (e.g. sims
in the example above) is typically hidden from the user, and is only used by omnicalc to keep track of the parallel datasets.
Warning
There is only one limitation (but many potential flaws) that come with the total flexibility described above. Multiple spots cannot contain sub-folders with the same name otherwise omnicalc won’t know which one to use later, when it looks them up. As long as you follow this rule, the paths are entirely arbitrary. You can have multiple spots (and hence distinct datasets) within a single directory as long as you can distinguish them by regular expressions.
Using multiple spots it useful if you generate your data in separate batches but wish to analyze them together. Organizing large datasets into these “spots” gives you the first opportunity to divde your data into smaller groups. There will be many more opportunities for organizing the data within the omnicalc framework when you begin to write analysis routines. The restrictions described here only tell omnicalc how to read your data.
1.3.2 New data paths¶
All new data generated by omnicalc will default to the post
and plot
directories which will be created inside the omnicalc root folder. Users who wish to store post-processing data or plots elsewhere can either set them in paths.yaml
or use symbolic links to new directories. It is important to note that the post
folder will become large because it will contain slices of the simulation trajectories (as well as any calculated data, although those tend to be much smaller). Omnicalc will also write its current state to a workspace
file which is explained elsewhere.
Warning
link to the workspace description
1.3.3 Further customization¶
The default paths are set to import data from automacs but can be modified to recognize many different naming schemes. The regexes
subdictionary for each spot will tell omnicalc how to find your data. We will describe this dictionary in detail because it has consequences for naming and distinguishing your simulations.
In all subsequent analysis, your simulations must be identified by their name, specifically the name of the folder that contains the simulation (recall that one of the only hard contraints is that the GROMACS trajectories must be in a sub-folder). In the default paths.yaml
file, the simulation names are specified by top
insides the regexes
dictionary. In this case they use the automacs default in which all simulations are named with numbers e.g. simulation-v123
.
It is often clusmy to refer to simulations this way. Omnicalc allows you to group them with colloquial names using collections
during the analysis phase. You can use an arbitrary regular expression (e.g. "(\w+)"
) if you wish to use entirely unconstrained names. Make sure to use parentheses to extract a group from the regex. The contents of the only group in this regex will be the formal “name” of your simulation used throughout the analysis.
Warning
link to collections above
The folders found inside of the spot_directory
that match the top
regex will constitute the “steps” of the simulation. Omnicalc tracks all trajectory files internally using a tuple containing the simulation name (the top directory), the step folder (the sub-directory), and the file name. The step names can be arbitrary, but it is often helpful to order them. This ordering can be useful in the event that you reset the simulation time using GROMACS. Most users wish to collect the most recent portions of the trajectory because they are typically the most relevant, especially if you use a complicated simulation construction procedure.
The default paths.yaml
specifies that step folders should be named with a letter and number followed by a word. This follows the automacs convention in which simulation steps are named e.g. s02-bilayer-protein.
. The step regex can use an arbitrary number of groups which may be useful to the user later, if they wish to sort the steps based on those groups. This could be useful if your simulation steps consist of replicates (however it’s important to note that this can also be achieved by using spots or collections).
Warning
check (1) whether the regex groups are used for step-directory ordering and (2) whether that actually matters and (3) when and why
The final regex is the “part” used to detect files that are part of a simulation trajectory. The example above uses the GROMACS convention in which files are named e.g. md.part0001.xtc
or md.part0001.edr
in sequence. However, the only requirement is that your files have numbers which can be used to sort their constituent parts into the proper order.
You may notice that we have separate regex expressions to identify the common GROMACS file types, namely xtc
, trr
for trajectories, edr
for energy files, structure
for coordinate files, and tpr
for binary input files. Each of these is used by omnicalc. In particular, the energy files provide a fast way to identify the simulation clock for each trajectory, and the input files are essential for unwrapping periodic boundary conditions (or any calculation that requires the topology of your molecules).
1.3.4 Naming new slices¶
In the paragraphs above, we have described how omnicalc reads your dataset. One final component of paths.yaml
specifies the format by which omnicalc will write “slices” of your simulation. Since these slices allow for an arbirary sampling frequency and subsets of the chemical components of the simulation, the slicing functions can be incredibly useful for isolating a particular portion of your data for further analysis. Some users may wish to use omnicalc for this function alone (without the awesome calculation features described next).
The namer
tells omnicalc how to create file names for the trajectory slices. Every time it slices a simulation, it creates a structure file and a trajectory file (the latter can be a full-precision trr
or a compressed xtc
file). The namer
must be a pythonic lambda function that takes two arguments: the spot
and the top
and returns a string. The string will be prefixed to all slice files, which will also be suffixed with the time range for the slice and the contents. For example, if you use the default naming scheme, you might produce a file called v531.100000-200000-1000.protein.xtc
which would contain a slice from 100-200ns
of a simulation named simulation-v531
with a group called “protein”. We’ll describe the groups later.
Warning
link to groups
Even though the namer
function must accept the spot name (the name of its parent dictionary), you do not have to use the spot in the string that it returns. You must only ensure that incoming simulation names (given by top
) will write unique strings to ensure that simulations coming from different spots do not overwrite others in the post
directory. Since we require that no simulation names are repeated across spots (otherwise an error will occur), the namer
must only retain the uniqueness of the top
(simulation name). This is a best practice, so that you can identify your simulation slices.
2 Metadata¶
The design of omnicalc is motivated by a common problem experienced by the authors. In the course of answering a scientific question, you may write an analysis program that answers exactly that question. Once you learn something useful, you might ask follow-up questions that typically require you to add to these scripts, or sweep a particular parameter in some way. After a few iterations, your original calculation code becomes littered with parameter choices that are important to your calculation. The design philosophy behind the omnicalc code is to centralize all of these parameter choices in one place, in a readable format that is easy to expand. We store parameters in one (or many) yaml files. These files constitute the “metadata” for our calculations, and we try our best to separate these metadata from the more generic functions which analyze our simulations. This ensures that your calculations are as modular as possible.
2.1 Paths¶
As a rule, all metadata files must be stored in calcs/specs/*.yaml
. You can use any naming scheme you wish, as long as the files are suffixed with .yaml
. For the tutorial, we will be using meta.yaml
. The yaml files have a particular top-level structure which allows omnicalc to merge them if it finds more than one. This merging is described later. It’s useful for the factory.
Warning
link to the merging rules later on
2.2 Structure¶
The yaml files are automatically merged into a single dictionary accessible throughout the omnicalc codes. The final result must contain at least the following list of top-level keys. Users with highly complicated settings may wish to put each top-level key in a separate file. If you use the factory, it will generate additional yaml files to reflect user input to the front-end GUI, however users may always add additional metadata manually.
- slices: tells omnicalc how to create seamless trajectory slices
- variables: avoid repeating yourself by defining common terms here
- collections: organize simulations into groups which can be analyzed en masse
- meta: alias your possibly clunky simulation names to human-readable ones (particularly useful when making plots)
- calculations: specify a single calculation (or a series) over an arbitrary set of parameters
- plots: specify formats and parameters for making plots
Each of these items will be described in detail in the remainder of the documentation. In section below, we will describe our custom syntax for avoiding repetition in these files. Note that the list above contains protected top-level dictionary keys. You can add extra data to the meta files, but omnicalc will typically only use information in the above-named dictionaries. These dictionaries are intelligently merged in case they are found in separate specification files (that is, any file which matches calcs/specs/*yaml
).
2.3 Variables¶
All of the metadata (or “specification”) files found in calcs/specs/*yaml
should be written in the standard yaml format. We have added a single useful tool for avoiding repetition. The entire set of yaml files is read into a single dictionary. Any “leaf” or “child node” of this dictionary which is (1) a string and (2) begins with a plus sign (e.g. +lipids
) will trigger a variable lookup. If omnicalc fails to map the string to a variable, it will not throw an error.
Omnicalc will replace the variable name with the value it finds in the top-level variables
dictionary. Since this dictionary can be arbitrarily nested, it will traverse the dictionary using a series of keys separated by a forward slash.
variables:
selectors:
ions: (name NA or name CL or name MG or name Cal or name K)
cations: (name NA or name MG or name Cal or name K)
resnames_lipid: ['POPC','DOPC','DOPS','DOPE']
In the example above, whenever we set a child node to “+selectors/ions” it will be automaticall replaced with a string: (name NA or name CL or name MG or name Cal or name K)
. Note that the yaml specification allows pythonic lists, so the resnames_lipid
value would be returned as a proper list. The selectors
in the example above consist of CHARMM-style selection strings which are later sent to the MDAnalysis package for selecting a subset of our simulation. Storing these strings as variables means that you don’t have to repeat them elsewhere, and you can change them at a single location which might affect many downstream calculations.
The authors have found the variables
dictionary to be useful not only for storing selection commands, but also timing information for trajectory slices, common mathematical definitions, and extraneous settings. This feature keeps the metadata simple and provides an additional layer of abstraction.
Warning
please confirm no error on variable lookup
3 Slices¶
The top-level slices
dictionary in your metadata contain instructions for “slicing” a long trajectory into a single file. This has a number of advantages for users who have completed a production run which spans many different parts. The slicing procedure also (crucically) can be used to reassemble molecules which are broken across periodic boundaries, which always happens when you restart your GROMACS simulation.
Each entry in the slices
dictionary should be a formal simulation name. That is, it must be the folder name for a simulation that can be identified via paths.yaml. Recall that you are free to map these simulation names to more convenient, human readable names later on, using the meta
dictionary or the internal variables feature.
Each entry in the slices
dictionary must contain two children. The slices
entry is a dictionary of slice definitions. The keys correspond to slice names which are only used internally to keep track of the slices (note that the calculations retrieve slices by using these names, which do not have to be unique across simulation entries). The slice dictionary must contain a few keys. The groups
key must provide a list of group names which we will describe in a moment. You must also include start
, end
, and skip
which correspond to simulation times sent to the GROMACS trjconv utility. These should all be floats or integers specified in picoseconds.
The parallel groups
entry should be a dictionary of selection commands which are fed first to the GROMACS make_ndx
command to specify the contents of a particular trajectory slice. In the simplest case, you should include all:all
if you want to retain all molecules in your slice. You may find it economical to make some slices which only contain a part of your system. These names should be added to the groups
key in the slice definition. The following example should make this clear.
slices:
membrane-v123:
groups:
all: all
ions: +selectors/ions
slices:
current: {'pbc':'mol','groups':['all'],'start':10000,'end':110000,'skip':100}
current_ions: {'pbc':'mol','groups':['ions'],'start':10000,'end':110000,'skip':1}
The name of each slice is only used by omnicalc to keep track of that slice later on. For example, when you design calculations, you can run a single calculation over all slices of a particular name. Remember that the names do not have to be unique for each simulation. In the example above, we create two kinds of slices (current
and current_ions
) which can be accessed by different calculations. The current_ions
slice is sampled at a much higher rate (in this case skip=1
so the frames are written every picosecond) and could be sent to a calculation which only requires the positions of the ions.
4 Run time¶
In the first two chapters of the documentation, we have described the formulation of an incoming dataset (raw data) and how to write variables (metadata). Understanding how to prepare the data and construct the metadata are necessary to use omnicalc, particularly since its execution is exceedingly simple. Executing omnicalc only requires one command to perform post-processing (there are other commands for plotting and debugging described at the end of this section).
make compute
4.1 The main loop¶
The make compute
command triggers omnicalc’s main loop, found in the workspace.py
module, which performs the following functions in order. Note that each of these actions takes its marching orders from the specifications files described in the metadata section.
- Read and merge all of the specifications files found in
calcs/meta/*.yaml
. Some users may prefer to put the protected top-level dictionaries described in the metadata section in separateyaml
files. These files are merged and loaded into the workspace. Internal variable substitutions are performed at this step. - Create slices specified by the top-level
slices
dictionary compiled from the metadata. Recall that the creation of slices will generate both groups (corresponding to GROMACS-style index files) and trajectory files. - Run the calculations set in the
calculations
dictionary in an order which is inferred by their internal dependencies. This means that a calculation which depends on another will occur later in the loop. Calculation details are interpreted by the wordspace to identify any specialloop
settings, which will cause the calculation to be executed many times, across an arbitrary number of parameter sweeps. Each distinct calculation is sent to thecomputer
function, which runs the calculation over all simulations in the collections list.
The main loop is entirely contained in the action
function and calls many of the member functions of the Workspace
class. In the third step described above, the computer
function will be used to repeatedly send a simulation to a calculation function.
The main loop is designed to be hidden from the user, who is only expected to write the metadata and the most important component of the loop: the calculation functions. Calculation functions should be stored in calcs/function_name.py
and should contain a single python function with the same name as the file. This function can call external libraries or local libraries stored in calcs
(typically calcs/codes
), but must be named carefully so that the compute
function can find it. If the calculation’s uptype
flag is set to simulation
then this function will receive a two arguments, namely the grofile
and trajfile
which will point to the structure and trajectory of the slice created in the second step. If the uptype
is post
, the the function will receive a copy of the upstream data. It will also pass other kwargs
that specify the features of the calculation found in the specs
sub-dictionary. A typical calculation block from calcs/specs/meta.yaml
is pictured below.
calculations:
lipid_abstractor:
uptype: simulation
slice_name: current
group: all
collections: all
specs:
selector:
loop:
lipid_com:
monolayer_cutoff: 1.85
resnames: +selectors/resnames_lipid
type: com
lipid_chol_com:
monolayer_cutoff: 1.4
resnames: +selectors/resnames_lipid_chol
type: com
The calculation is named lipid_abstractor
hence the user must create calcs/lipid_abstractor.py
which contains a function which is also called lipid_abstractor
. The calculation dictionary specifies a few key parameters.
- Users can request the original simulation trajectory (or “slice”) by setting
uptype: simulation
. This sends the structure and trajectory to the analysis function ingrofile
andtrajfile
. Simulations which only depend on another “upstream” calculation should setuptype: post
and also specify anupstream
variable which lists the names of the previous calculations. See the parameter sweeps section for an example of how the parameters are specified in a calculation with upstream dependencies. - Users must identify a
slice_name
and agroup
, both of which are necessary to uniquely identify a slice specified in the top-level slices dictionary. - Users must also identify a list of
collections
of simulations to apply the calculation. Collections are specified in a top-level dictionary calledcollections
which is found the metadata file. Multiple collections should be compiled into a list. Note that each collection requested by a calculation must have corresponding slices specified byslice_name
. If omnicalc cannot find the corresponding slice or group, it will throw an error. The collections list is necessary to apply the calculations to your simulations. Even if you analyze a single simulation, it needs to be in a collection. - Specs are optional, but allow the user to set attributes which are passed all the way to the final data output. These attributes make it easy to perform arbitrary parameter sweeps. In the example above, the loop over the
selector
parameter sends different distance cutoffs and lipid selections to the calculation function in order to generate a lipid trajectory either with or without cholesterol.
4.2 A few, strict rules¶
The omnicalc design philosophy expects more from the user than a typical software package. The incoming data, metadata, and calculation functions must be written according to the framework specified here and in the other chapters of the documentation. In this way, the authors have selected convention over configuration. This means that omnicalc works with a few, very strict rules. The upshot is that users can prepare metadata that make calculations highly customizable and scalable. New parameter sweeps can be instantiated simply by editing a calcs/specs/meta.yaml
file and running make compute
. Note that omnicalc will not perform downstream functions (namely, rendering plots) if you update the metadata without running make compute
. You can always use the respec
function to update the workspace with your metadata when making adjustments to your plots.
Calculation functions can be written in a highly modular format so that they can be shared between different data sets. For example, the authors have used the exact same calculation codes on both atomistic and coarse-grained simulations despite their radically different naming conventions. This scheme also ensures that the codes are easily extensible to slightly novel use-cases.
4.3 When things go wrong¶
Given that omnicalc operates as a framework described above, errors should be interpreted in terms of the position inside the main loop. Whenever you encounter an error, you can find more details about what caused the error by checking the source code. Oftentimes the position within the main loop will tell you what went wrong. Users may also use the make look
utility function to inspect the workspace variable to make sure everything is in order.
Warning
better description of error handling. perhaps an example would be useful.
4.4 Utility functions¶
Warning
controller functions are coming soon
4.5 Plotting¶
Plotting functions can be executed with make plot
or preferably make plot <my_plot_script>
, since this function always re-makes the plots, in contrast to the make compute
function which will only generate post-processing data once.
Note
The make compute
loop is lazy. If it finds the post-processing binaries for a calculation, it won’t re-run that calculation. This design has the advantage that users may add new calculations or extend parameter sweeps in the metadata without recalculating anything. The downside is that changing any hard-coded calculation parameters typically requires that the user manually delete the deprecated binaries. These are usually clearly named, so this isn’t difficult, but in general the authors recommend adding data rather than deleting it and rerunning the calculation. This preserves the calculation history in case something goes wrong. Once you are ready to plot your data, you can single out a particular set of parameters, even if you swept over many. Omnicalc keeps track of the calculation details (typically given in the specs
subdictionary for a particular calculation), which makes it easy to look up the results of a specific calculation. Since plots are both fast and endlessly customizable, the make plot
command will always regenerate the plot.
Warning
Plots have attributes too, so add a link to the note above when they are documented.
5 Calculations¶
In the previous section (run time) we outlined the main execution loop and the logic by which omnicalc links calculation parameters (i.e. metadata) with simulation trajectories and calculation functions. In this section we will describe the structure of a calculation function. By the end of this section, users should have all of the information they need to perform simple calculations, however, a wide variety of extensions and features are described in the remaining chapters.
5.1 An example calculation¶
The following code describes a simple example function for the “undulations” calculation found in calcs/undulations.py
. This function is designed to read the results of a previous calculation called lipid_abstractor
which reduces a bilayer simulation to a set of lipid centers-of-mass for further calculation. This example covers almost all of the features of the omnicalc workflow, which we will address in sequence.
#!/usr/bin/python
import time
from numpy import *
from joblib import Parallel,delayed
from joblib.pool import has_shareable_memory
from codes.mesh import *
from base.timer import checktime
from base.tools import status,framelooper
def undulations(**kwargs):
"""
Compute bilayer midplane structures for studying undulations.
"""
#---parameters
sn = kwargs['sn']
work = kwargs['workspace']
calc = kwargs['calc']
grid_spacing = calc['specs']['grid_spacing']
dat = kwargs['upstream']['lipid_abstractor']
nframes = dat['nframes']
#---choose grid dimensions
grid = array([round(i) for i in mean(dat['vecs'],axis=0)/grid_spacing])[:2]
monolayer_indices = dat['monolayer_indices']
#---parallel
start = time.time()
mesh = [[],[]]
for mn in range(2):
mesh[mn] = Parallel(n_jobs=work.nprocs,verbose=0)(
delayed(makemesh_regular)(
dat['points'][fr][where(monolayer_indices==mn)],dat['vecs'][fr],grid)
for fr in framelooper(nframes,start=start,text='monolayer %d, frame'%mn))
checktime()
#---pack
attrs,result = {},{}
result['mesh'] = array(mesh)
result['grid'] = array(grid)
result['nframes'] = array(nframes)
result['vecs'] = dat['vecs']
result['timeseries'] = work.slice(sn)[kwargs['slice_name']][
'all' if not kwargs['group'] else kwargs['group']]['timeseries']
attrs['grid_spacing'] = grid_spacing
return result,attrs
For starters, you might notice that there is almost no formal computation visible in this function. Almost all of the “work” is performed by the makemesh_regular
function imported from codes.mesh
. Users may wish to embed the computation directly in this function, but they are free to import any modules they wish. Besides allowing local imports from e.g. calcs/codes
, users may also import global packages. In this case, we use the joblib package to parrallelize this code using shared memory. We also use omnicalc’s built-in framelooper
generator to iterate over the number of frames in our simulation using a status bar and a timer.
Since this function depends on an “upstream” lipid_abstractor
computation, omnicalc automatically sends the data in kwargs['upstream']['lipid_abstractor']
. It is possible to draw from multiple upstream calculations. Users specify the upstream dependencies inside of the calculations dictionary in the metadata. To import the lipid_abstractor
data, the user uses the upstream
keyword according to the following example. (Recall that group
, slice_name
, and collections
are required for all calculation dictionaries).
undulations:
uptype: post
group: lipids
slice_name: +slices/steady16
collections:
- all
specs:
grid_spacing: 0.5
upstream: lipid_abstractor
The upstream
key above can point to either an item or a list (see parameter sweeps), but these items must be the names of other calculations. Omnicalc will figure out the correct execution order for you. The uptype: post
flag tells omnicalc not to load the simulation trajectory directly. If you use uptype: simulation
, then omnicalc will send along the structure and trajectory files as arguments named grofile
and trajfile
. These arguments can be passsed directly to the excellent MDAnalysis package, which is equipped to read the GROMACS binary trajectory files. Note that you can request upstream
calculations even when you set uptype: simulation
, in the event that you want to refer back to the original simulation trajectory on a downstream step.
All of the upstream data, file names, and specs are passed via kwargs
. It is the users’s job to unpack them for the calculation.
After the body of the calculation function, users will be ready to “save” the data. All calculation functions must return data in a specific format so that omnicalc can save it for downstream calculations or plotting functions. Data should be stored in either a results dictionary or an attributes dictionary.
The results dictionary can only contain numpy
array data. Most data — even multidimensional lists with different lengths, dictionaries, etc, can be saved as a numpy array. This restriction allows omnicalc to use the excellent h5py package to save the data in a hardware agnostic, efficient, binary format.
Note
You can save highly heterogeneous data (e.g. nested dictionaries) in the numpy format by using packages like JSON, or YAML, to turn them into a string, which can be saved. This can be done with numpy as follows: numpy.string_(json.dumps(ugly_data))
.
The attributes dictionary is called attrs
and has a few strict requirements which are designed to make it easy for omnicalc to retrieve data. In short, the attrs
dictionary should contain any parameters which describe the calculation and distinguish it from others, particularly those in a parameter sweep. Specifications are stored in the calculations dictionary (in the metadata) under the specs
key. Since these parameters are essential to identify the calculation after it is complete, omnicalc will throw an error of the user fails to pass the specs
on to the attrs
dictionary. In the example above, you can see that we have passed along the grid_spacing
parameter. You can also add other parameters to attrs
to further label the data.
One of the most distinctive features of omnicalc is that the software is designed to collect parameters in the metadata files (in calcs/specs/*.yaml
) so that you don’t need to “hard code” them in your analysis functions. Inevitably, you will hard code some of these parameters, and later realize that they are in fact, paramters which you want to vary. If you export the hard-coded parameters in attrs
, then you can later add them to the metadata files (and sweep over them, for example), without causing a naming conflict or deleting the original calculation.
Each calculation produces two files: a dat
file written in hdf5
as described above, and a specs file containing a text-formatted python dictionary given by attrs
. These files are stored in the path given by the work.paths['post_data_spot']
variable and specified in the configuration. The file names are nearly identical to the slice names (see: naming slices) with two small changes. As with the slice names, they begin with the prefixed simulation name defined by prefixer
in the configuration. This is followed by the calculation name defined in the metadata. The only other difference between a slice file name and a calculation file name is that the calculations have a suffix which contains an index number. This index distinguishes distinct calculations from each other. These differences are encoded in the corresponding spec
file, which contains the attrs
defined by the user.
This naming scheme allows the user to produce an arbitrary number of calculations with different parameters without using bloated file names. The parameters are stored in the spec
file, which is studied by omnicalc to figure out which dat
file to open, when you make plots or access the data later on. The index on the spec file is equivalent to a foreign key in a database. The example in the upcoming section uses the following names, where the index is n0
.
v532.50000-100000-100.lipid_mesh.n0.dat
v532.50000-100000-100.lipid_mesh.n0.spec
The example undulations
function above refers to the lipid_abstractor
data without further specification. In the event that your upstream data contains a parameter sweep, you can also perform the sweep over the downstream data. The following example describes a calculation called lipid_mesh
which uses two different lipid selectors (one which contains cholesterol, and one which doesn’t). Using the loop
keyword in the specs
will trigger a parameter sweep.
lipid_mesh:
uptype: post
slice_name: current
collections:
- all
specs:
upstream:
lipid_abstractor:
selector:
loop:
- lipid_com
- lipid_chol_com
Any downstream steps must either perform the same parameter sweep, or select uniquely-identifying parameters for the upstream step in order to import that data. In both cases, the selection is made inside the upstream
dictionary in specs
. If there are no parameters, then the upstream
item can be a list (or a single) item. If you need to select parameters, or perform the sweep above, then upstream
should be a list of dictionaries, each of which contains the specs
section from the upstream calculation.
The example above mimics a parameter sweep that must have also happened in the lipid_abstractor
calculation. If users only wish to use one parameter for the lipid_mesh
calculation, they would still have to select it, using the following notation. In the following example, we choose to include the cholesterol molecules via the selector
spec.
lipid_mesh:
uptype: post
slice_name: current
collections:
- all
specs:
upstream:
lipid_abstractor:
selector: lipid_chol_com
By using loop
, upstream
, and specs
, users can develop highly efficient calculation pipelines.
Note
If you trigger a parameter sweep by using the keyword loop
as per the example above, then the calculation will loop over all of the subsequent lists. You can specify the same parameter sweep in the plots section, or you can omit the specs entirely. In both cases, the plotloader
function will load all of the data you require. You can whittle this down by using a specs
sub-dictionary to select exactly which data goes to the plot functions.
6 Customization¶
Even though the authors have sought to make omnicalc as generic as possible, your particular use case will inevitably require a large degree of customization. The usefulness of our framework depends on these customizations, otherwise omnicalc would not be any more useful than the abundance of similar tools with larger user bases.
Users who have read the documentation and have further questions about deployment are encouraged to submit a tickat at the BioPhysCode project on github, where you can also find contact information for the authors.