.. _sec-paths:

Raw Data
========

Omnicalc is designed to perform what most people refer to as "post-processing" for biophysics simulations, and in particular, molecular dynamcis data generated by `GROMACS <http://www.gromacs.org/>`_. It has been specifically designed to work with the `Automacs <http://github.com/bradleyrp/automacs>`_ codes which generate large batches of GROMACS simulations, however it is designed to accept any GROMACS simulation data as long as it is organized according to the rules laid out below.

Incoming data
-------------

Instructions for importing your data must be written to a file called ``paths.yaml``. The :ref:`paths <sec-path-setting>` section below will tell you exactly how to write this file. In this part, we will first describe how to organize your data so that omnicalc can identify it.

Data from AUTOMACS
^^^^^^^^^^^^^^^^^^

Simulations created with automacs follow a consistent directory structure. Each simulation has a root folder which contains the automacs code, along with several sub-folders, each of which correspond to a discrete simulation "step". Each step can have many parts, but the steps are designed to be the coherent unit of analysis. Complicated simulations may have a few steps, but most of the relevant simulation data are contained in a production run in the final step. See the documentation accompanying automacs for more details on why we organize the data this way.

Data from elsewhere
^^^^^^^^^^^^^^^^^^^

Users who have already generated large simulation data sets, or choose to use other methods to generate their data can prepare the data for use by omnicalc by mimicking its directory structure. The rule of thumb is that omnicalc must be able to find the files using a regular expression, using integers for any necessary ordering. As per the GROMACS custom and the automacs specification, we prefer to have the simulations organized into groups of trajectory files named e.g. ``md.part0001.xtc`` which can easily be detected and ordered by a regular expression ("``^md\.part([0-9]{4})\.xtc$``").

.. note ::

	We would never encourage users to rename any files in a primary data set. If your files are not compatible with omnicalc, we suggest that you write a short program that uses symbolic links to name them in a systematic way.

While the naming scheme is relatively open-ended, the directory constraints are not. Users must have one folder per simulation, and the target data must be contained in a *sub*-folder of this. 

.. _sec-path-setting:

What you get
------------

Omnicalc produces three tangible types of data for you. First, it "slices" GROMACS trajectories into more manageable parts, isolating only the system components that you are interested in. This is particularly useful when you wish to load an entire simulation at a particular sampling rate into memory. If you only wish to analyze a single component in a large simulation, this data-reduction step can make it possible to load the entire trajectory into memory, which can often dramatically speed-up your analysis and visualization.  

Second, omnicalc can perform detailed mathematical operations on your data using `numpy and scipy <http://docs.scipy.org/doc/>`_ libraries which can be saved to a portable binary format using the `HDF5 <http://www.h5py.org/>`_ in Python. Calculation steps can be linked together in an arbitrary sequence to make the calculation efficient and modular. 

Lastly, we have included a number of standardized plotting and animation routines that use `matplotlib <http://matplotlib.org/>`_ and `VMD <http://www.ks.uiuc.edu/Research/vmd/>`_ to create elegant images of the resulting data or trajectories.

We group the simulation slices and the postprocessing binaries in a single output directory and reserve a separate directory for plots. The paths to these data are specified in ``paths.yaml`` described next.

Setting the paths
-----------------

As long as your data are correctly organized and mounted on your machine, you can set the paths by first running ``make default_paths`` in the omnicalc root directory. This creates a default ``paths.yaml`` file for you to edit. We have reproduced the essential features below. In the remainder of this section we will fully specify the omnicalc path scheme. For a quick start, read the comments in the default ``paths.yaml`` file.

.. code-block :: yaml

	post_data_spot: post
	post_plot_spot: plot
	workspace_spot: workspace 
	timekeeper: false
	spots:
	  sims:
	    namer: "lambda spot,top : 'v'+re.findall('simulation-v([0-9]+)',top)[0]"
	    route_to_data: PARENT_DIRECTORY_FOR_MY_DATA
	    spot_directory: DATA_DIRECTORY_IN_ROUTE_TO_DATA
	    regexes:
	      top: '(simulation-v[0-9]+)'
	      step: '([stuv])([0-9]+)-([^\/]+)'
	      part: 
	        xtc: 'md\.part([0-9]{4})\.xtc'
	        trr: 'md\.part([0-9]{4})\.trr'
	        edr: 'md\.part([0-9]{4})\.edr'
	        tpr: 'md\.part([0-9]{4})\.tpr'
	        structure: '(system|system-input|structure)\.(gro|pdb)'

Necessary paths
^^^^^^^^^^^^^^^

Users must replace any capitalized text in the default ``paths.yaml`` file. The first path you must specify is the ``route_to_data`` variable, which sets the root directory for your incoming dataset. It is best to use this variable to set the system path for a data-specific harddrive. It's easy to change this later if you move your data to a new system. The remainder of the path to your data is stored in ``spot_directory``. It is best to set this directory with a more permanent name to avoid confusion. The ``route_to_data`` directory is entirely flexible, and can be changed whenever you move or re-mount the harddrive.

The ``spot_directory`` variable should point from th ``route_to_data`` to a single, specific dataset. Omnicalc is designed to use *multiple* parallel datasets at the same time. We refer to the paths of these datasets as "spots", each of which is defined by a separate sub-dictionary underneath ``spots`` in the ``paths.yaml`` file. The name of the subdirectory for each spot (e.g. ``sims`` in the example above) is typically hidden from the user, and is only used by omnicalc to keep track of the parallel datasets. 

.. warning ::

	There is only one limitation (but many potential flaws) that come with the total flexibility described above. Multiple spots cannot contain sub-folders with the same name otherwise omnicalc won't know which one to use later, when it looks them up. As long as you follow this rule, the paths are entirely arbitrary. You can have multiple spots (and hence distinct datasets) within a single directory as long as you can distinguish them by regular expressions.

Using multiple spots it useful if you generate your data in separate batches but wish to analyze them together. Organizing large datasets into these "spots" gives you the first opportunity to divde your data into smaller groups. There will be many more opportunities for organizing the data *within* the omnicalc framework when you begin to write analysis routines. The restrictions described here only tell omnicalc how to read your data.

New data paths
^^^^^^^^^^^^^^

All new data generated by omnicalc will default to the ``post`` and ``plot`` directories which will be created inside the omnicalc root folder. Users who wish to store post-processing data or plots elsewhere can either set them in ``paths.yaml`` or use symbolic links to new directories. It is important to note that the ``post`` folder will become large because it will contain slices of the simulation trajectories (as well as any calculated data, although those tend to be much smaller). Omnicalc will also write its current state to a ``workspace`` file which is explained elsewhere.

.. warning ::

	link to the workspace description

Further customization
^^^^^^^^^^^^^^^^^^^^^

The default paths are set to import data from automacs but can be modified to recognize many different naming schemes. The ``regexes`` subdictionary for each spot will tell omnicalc how to find your data. We will describe this dictionary in detail because it has consequences for naming and distinguishing your simulations.

In all subsequent analysis, your simulations must be identified by their name, specifically the name of the folder that contains the simulation (recall that one of the only hard contraints is that the GROMACS trajectories must be in a *sub*-folder). In the default ``paths.yaml`` file, the simulation names are specified by ``top`` insides the ``regexes`` dictionary. In this case they use the automacs default in which all simulations are named with numbers e.g. ``simulation-v123``.

It is often clusmy to refer to simulations this way. Omnicalc allows you to group them with colloquial names using ``collections`` during the analysis phase. You can use an arbitrary regular expression (e.g. ``"(\w+)"``) if you wish to use entirely unconstrained names. Make sure to use parentheses to extract a group from the regex. The contents of the only group in this regex will be the formal "name" of your simulation used throughout the analysis.

.. warning ::

	link to collections above

The folders found inside of the ``spot_directory`` that match the ``top`` regex will constitute the "steps" of the simulation. Omnicalc tracks all trajectory files internally using a tuple containing the simulation name (the top directory), the step folder (the sub-directory), and the file name. The step names can be arbitrary, but it is often helpful to order them. This ordering can be useful in the event that you reset the simulation time using GROMACS. Most users wish to collect the most recent portions of the trajectory because they are typically the most relevant, especially if you use a complicated simulation construction procedure.

The default ``paths.yaml`` specifies that step folders should be named with a letter and number followed by a word. This follows the automacs convention in which simulation steps are named e.g. ``s02-bilayer-protein.``. The step regex can use an arbitrary number of groups which may be useful to the user later, if they wish to sort the steps based on those groups. This could be useful if your simulation steps consist of replicates (however it's important to note that this can also be achieved by using spots or collections). 

.. warning ::

	check (1) whether the regex groups are used for step-directory ordering and (2) whether that actually matters and (3) when and why

The final regex is the "part" used to detect files that are part of a simulation trajectory. The example above uses the GROMACS convention in which files are named e.g. ``md.part0001.xtc`` or ``md.part0001.edr`` in sequence. However, the only requirement is that your files have numbers which can be used to sort their constituent parts into the proper order. 

You may notice that we have separate regex expressions to identify the common GROMACS file types, namely ``xtc``, ``trr`` for trajectories, ``edr`` for energy files, ``structure`` for coordinate files, and ``tpr`` for binary input files. Each of these is used by omnicalc. In particular, the energy files provide a fast way to identify the simulation clock for each trajectory, and the input files are essential for unwrapping periodic boundary conditions (or any calculation that requires the topology of your molecules).

.. _sec-slice-names:

Naming new slices
^^^^^^^^^^^^^^^^^

In the paragraphs above, we have described how omnicalc reads your dataset. One final component of ``paths.yaml`` specifies the format by which omnicalc will write "slices" of your simulation. Since these slices allow for an arbirary sampling frequency and subsets of the chemical components of the simulation, the slicing functions can be incredibly useful for isolating a particular portion of your data for further analysis. Some users may wish to use omnicalc for this function alone (without the awesome calculation features described next).

The ``namer`` tells omnicalc how to create file names for the trajectory slices. Every time it slices a simulation, it creates a structure file and a trajectory file (the latter can be a full-precision ``trr`` or a compressed ``xtc`` file). The ``namer`` must be a pythonic lambda function that takes two arguments: the ``spot`` and the ``top`` and returns a string. The string will be prefixed to all slice files, which will also be suffixed with the time range for the slice and the contents. For example, if you use the default naming scheme, you might produce a file called ``v531.100000-200000-1000.protein.xtc`` which would contain a slice from ``100-200ns`` of a simulation named ``simulation-v531`` with a group called "protein". We'll describe the groups later.

.. warning ::

	link to groups

Even though the ``namer`` function must accept the spot name (the name of its parent dictionary), you do not have to use the spot in the string that it returns. You must only ensure that incoming simulation names (given by ``top``) will write unique strings to ensure that simulations coming from different spots do not overwrite others in the ``post`` directory. Since we require that no simulation names are repeated across spots (otherwise an error will occur), the ``namer`` must only retain the uniqueness of the ``top`` (simulation name). This is a best practice, so that you can identify your simulation slices.