Raw Data

Omnicalc is designed to perform what most people refer to as “post-processing” for biophysics simulations, and in particular, molecular dynamcis data generated by GROMACS. It has been specifically designed to work with the Automacs codes which generate large batches of GROMACS simulations, however it is designed to accept any GROMACS simulation data as long as it is organized according to the rules laid out below.

Incoming data

Instructions for importing your data must be written to a file called paths.yaml. The paths section below will tell you exactly how to write this file. In this part, we will first describe how to organize your data so that omnicalc can identify it.

Data from AUTOMACS

Simulations created with automacs follow a consistent directory structure. Each simulation has a root folder which contains the automacs code, along with several sub-folders, each of which correspond to a discrete simulation “step”. Each step can have many parts, but the steps are designed to be the coherent unit of analysis. Complicated simulations may have a few steps, but most of the relevant simulation data are contained in a production run in the final step. See the documentation accompanying automacs for more details on why we organize the data this way.

Data from elsewhere

Users who have already generated large simulation data sets, or choose to use other methods to generate their data can prepare the data for use by omnicalc by mimicking its directory structure. The rule of thumb is that omnicalc must be able to find the files using a regular expression, using integers for any necessary ordering. As per the GROMACS custom and the automacs specification, we prefer to have the simulations organized into groups of trajectory files named e.g. md.part0001.xtc which can easily be detected and ordered by a regular expression (“^md\.part([0-9]{4})\.xtc$”).

Note

We would never encourage users to rename any files in a primary data set. If your files are not compatible with omnicalc, we suggest that you write a short program that uses symbolic links to name them in a systematic way.

While the naming scheme is relatively open-ended, the directory constraints are not. Users must have one folder per simulation, and the target data must be contained in a sub-folder of this.

What you get

Omnicalc produces three tangible types of data for you. First, it “slices” GROMACS trajectories into more manageable parts, isolating only the system components that you are interested in. This is particularly useful when you wish to load an entire simulation at a particular sampling rate into memory. If you only wish to analyze a single component in a large simulation, this data-reduction step can make it possible to load the entire trajectory into memory, which can often dramatically speed-up your analysis and visualization.

Second, omnicalc can perform detailed mathematical operations on your data using numpy and scipy libraries which can be saved to a portable binary format using the HDF5 in Python. Calculation steps can be linked together in an arbitrary sequence to make the calculation efficient and modular.

Lastly, we have included a number of standardized plotting and animation routines that use matplotlib and VMD to create elegant images of the resulting data or trajectories.

We group the simulation slices and the postprocessing binaries in a single output directory and reserve a separate directory for plots. The paths to these data are specified in paths.yaml described next.

Setting the paths

As long as your data are correctly organized and mounted on your machine, you can set the paths by first running make default_paths in the omnicalc root directory. This creates a default paths.yaml file for you to edit. We have reproduced the essential features below. In the remainder of this section we will fully specify the omnicalc path scheme. For a quick start, read the comments in the default paths.yaml file.

post_data_spot: post
post_plot_spot: plot
workspace_spot: workspace
timekeeper: false
spots:
  sims:
    namer: "lambda spot,top : 'v'+re.findall('simulation-v([0-9]+)',top)[0]"
    route_to_data: PARENT_DIRECTORY_FOR_MY_DATA
    spot_directory: DATA_DIRECTORY_IN_ROUTE_TO_DATA
    regexes:
      top: '(simulation-v[0-9]+)'
      step: '([stuv])([0-9]+)-([^\/]+)'
      part:
        xtc: 'md\.part([0-9]{4})\.xtc'
        trr: 'md\.part([0-9]{4})\.trr'
        edr: 'md\.part([0-9]{4})\.edr'
        tpr: 'md\.part([0-9]{4})\.tpr'
        structure: '(system|system-input|structure)\.(gro|pdb)'

Necessary paths

Users must replace any capitalized text in the default paths.yaml file. The first path you must specify is the route_to_data variable, which sets the root directory for your incoming dataset. It is best to use this variable to set the system path for a data-specific harddrive. It’s easy to change this later if you move your data to a new system. The remainder of the path to your data is stored in spot_directory. It is best to set this directory with a more permanent name to avoid confusion. The route_to_data directory is entirely flexible, and can be changed whenever you move or re-mount the harddrive.

The spot_directory variable should point from th route_to_data to a single, specific dataset. Omnicalc is designed to use multiple parallel datasets at the same time. We refer to the paths of these datasets as “spots”, each of which is defined by a separate sub-dictionary underneath spots in the paths.yaml file. The name of the subdirectory for each spot (e.g. sims in the example above) is typically hidden from the user, and is only used by omnicalc to keep track of the parallel datasets.

Warning

There is only one limitation (but many potential flaws) that come with the total flexibility described above. Multiple spots cannot contain sub-folders with the same name otherwise omnicalc won’t know which one to use later, when it looks them up. As long as you follow this rule, the paths are entirely arbitrary. You can have multiple spots (and hence distinct datasets) within a single directory as long as you can distinguish them by regular expressions.

Using multiple spots it useful if you generate your data in separate batches but wish to analyze them together. Organizing large datasets into these “spots” gives you the first opportunity to divde your data into smaller groups. There will be many more opportunities for organizing the data within the omnicalc framework when you begin to write analysis routines. The restrictions described here only tell omnicalc how to read your data.

New data paths

All new data generated by omnicalc will default to the post and plot directories which will be created inside the omnicalc root folder. Users who wish to store post-processing data or plots elsewhere can either set them in paths.yaml or use symbolic links to new directories. It is important to note that the post folder will become large because it will contain slices of the simulation trajectories (as well as any calculated data, although those tend to be much smaller). Omnicalc will also write its current state to a workspace file which is explained elsewhere.

Warning

link to the workspace description

Further customization

The default paths are set to import data from automacs but can be modified to recognize many different naming schemes. The regexes subdictionary for each spot will tell omnicalc how to find your data. We will describe this dictionary in detail because it has consequences for naming and distinguishing your simulations.

In all subsequent analysis, your simulations must be identified by their name, specifically the name of the folder that contains the simulation (recall that one of the only hard contraints is that the GROMACS trajectories must be in a sub-folder). In the default paths.yaml file, the simulation names are specified by top insides the regexes dictionary. In this case they use the automacs default in which all simulations are named with numbers e.g. simulation-v123.

It is often clusmy to refer to simulations this way. Omnicalc allows you to group them with colloquial names using collections during the analysis phase. You can use an arbitrary regular expression (e.g. "(\w+)") if you wish to use entirely unconstrained names. Make sure to use parentheses to extract a group from the regex. The contents of the only group in this regex will be the formal “name” of your simulation used throughout the analysis.

Warning

link to collections above

The folders found inside of the spot_directory that match the top regex will constitute the “steps” of the simulation. Omnicalc tracks all trajectory files internally using a tuple containing the simulation name (the top directory), the step folder (the sub-directory), and the file name. The step names can be arbitrary, but it is often helpful to order them. This ordering can be useful in the event that you reset the simulation time using GROMACS. Most users wish to collect the most recent portions of the trajectory because they are typically the most relevant, especially if you use a complicated simulation construction procedure.

The default paths.yaml specifies that step folders should be named with a letter and number followed by a word. This follows the automacs convention in which simulation steps are named e.g. s02-bilayer-protein.. The step regex can use an arbitrary number of groups which may be useful to the user later, if they wish to sort the steps based on those groups. This could be useful if your simulation steps consist of replicates (however it’s important to note that this can also be achieved by using spots or collections).

Warning

check (1) whether the regex groups are used for step-directory ordering and (2) whether that actually matters and (3) when and why

The final regex is the “part” used to detect files that are part of a simulation trajectory. The example above uses the GROMACS convention in which files are named e.g. md.part0001.xtc or md.part0001.edr in sequence. However, the only requirement is that your files have numbers which can be used to sort their constituent parts into the proper order.

You may notice that we have separate regex expressions to identify the common GROMACS file types, namely xtc, trr for trajectories, edr for energy files, structure for coordinate files, and tpr for binary input files. Each of these is used by omnicalc. In particular, the energy files provide a fast way to identify the simulation clock for each trajectory, and the input files are essential for unwrapping periodic boundary conditions (or any calculation that requires the topology of your molecules).

Naming new slices

In the paragraphs above, we have described how omnicalc reads your dataset. One final component of paths.yaml specifies the format by which omnicalc will write “slices” of your simulation. Since these slices allow for an arbirary sampling frequency and subsets of the chemical components of the simulation, the slicing functions can be incredibly useful for isolating a particular portion of your data for further analysis. Some users may wish to use omnicalc for this function alone (without the awesome calculation features described next).

The namer tells omnicalc how to create file names for the trajectory slices. Every time it slices a simulation, it creates a structure file and a trajectory file (the latter can be a full-precision trr or a compressed xtc file). The namer must be a pythonic lambda function that takes two arguments: the spot and the top and returns a string. The string will be prefixed to all slice files, which will also be suffixed with the time range for the slice and the contents. For example, if you use the default naming scheme, you might produce a file called v531.100000-200000-1000.protein.xtc which would contain a slice from 100-200ns of a simulation named simulation-v531 with a group called “protein”. We’ll describe the groups later.

Warning

link to groups

Even though the namer function must accept the spot name (the name of its parent dictionary), you do not have to use the spot in the string that it returns. You must only ensure that incoming simulation names (given by top) will write unique strings to ensure that simulations coming from different spots do not overwrite others in the post directory. Since we require that no simulation names are repeated across spots (otherwise an error will occur), the namer must only retain the uniqueness of the top (simulation name). This is a best practice, so that you can identify your simulation slices.