Connecting Sub-databases

From Docswiki
Revision as of 10:40, 17 May 2019 by Dw34 (talk | contribs)
Jump to navigation Jump to search

Definitions

For the purposes of this tutorial, I am defining sub-databases to be sets of connected minima and transition states within a larger database.

Context and Motivation

In databases containing many thousands of minima and TSs, it is unlikely that these will all be connected to one another. This is particularly the case when the database has been grown using such methods as ADDPATH and MERGEDB. Instead, the database is more likely to consist of many sub-databases of varying size. Therefore, when constructing a disconnectivity graph, which cannot plot more than one set of connected minima (i.e. more than one sub-database) at a time, a lot of data present in the min.data, points.min, points.ts and ts.data files is ignored. The sub-database that the disconnectivity graph plots depends on the numerical argument to the keyword CONNECTMIN in the dinfo file. These numerical arguments correspond to minima, as listed in the min.data file. For example, an argument of 12 corresponds to line 12 of the min.data file. Therefore, only this minimum plus any others it is connected to, are plotted in the disconnectivity graph.

The question, therefore, is how to efficiently connect minima already present in the min.data file. It would be particularly important to connect sub-databases with a lot of minima in them (it would probably be a waste of time to connect all those sub-databases with only 2 minima in them, for example, as by doing so you’re not collecting much more information).

Another consideration is that we want the connection attempts between sub-databases to be efficient. We want to try to connect sub-databases that are closer to one another (or, more specifically, sub-databases which have at least one minimum which is close in chemical space to a minimum in another sub-database). This consideration is especially important for large systems (such as large proteins with cofactors) as trying to connect minima far apart in space can be very slow or even break down due to memory issues.

Systems for which this approach might be particularly useful

This methodology might be particularly useful for cases where you have a protein with a cofactor and various sites within a pocket that you think the cofactor can attach to. It provides an efficient method to connect these sites within the pocket, having already sampled each.

Step 1: Using disconnectionDPS to determine the breakdown of sub-databases within your database

Requirements

A folder containing the files min.data, points.min, points.ts, ts.data, dinfo, the script find_connections.sh (to be found in the svn at ~svn/SCRIPTS/DISCONNECT) and the binary disconnectionDPS, plus any other auxiliary files you may need.

Method

In dinfo, you need to use the keyword PRINTCONNECTED. An example dinfo file that I've used is:

! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED

PRINTCONNECTED ensures that a file called connected is written, which lists all of the minima plotted in the disconnectivity graph (i.e. all of the minima present in the sub-database considered). In the example above, because the argument to CONNECTMIN is 1, this means that minimum 1, and all those minima to which 1 is connected, gets plotted.

This gives us information on only one sub-database present in the file. To find out about all of them, the script find_connections.sh is used.

This script cycles through the min.data file, executing a disconnectionDPS command for every iteration of the argument to CONNECTMIN. The connected file produced is then renamed connected_* where * is the argument to CONNECTMIN when that disconnectionDPS command was executed. For a min.data file with 17603 lines (and therefore 17603 minima) for example, the argument to CONNECTMIN therefore ranges from CONNECTMIN 1 to CONNECTMIN 17603. If a minimum is already present in a previous connected_* file then that argument is skipped. For example, if a disconnectionDPS execution when the argument to CONNECTMIN was set to CONNECTMIN 1 gave a sub-database with minima 1 and 2 (i.e. the minima on lines 1 and 2 in min.data) in it, then a disconnectionDPS attempt using CONNECTMIN 2 will not be attempted as minimum 2 is already assigned to the sub-database described in connected_1. The next iteration will be using CONNECTMIN 3.

This script cycles until all the minima in min.data have been considered.

Another feature of find_connections.sh is that, when connected_* files exceed a set number of minima (I think 10 is sensible) then they get copied to a corresponding relevant_connected_* file, eg if connected_3 has 50 minima then it exceeds 10 and so the information in this file is copied to another one called relevant_connected_3. This is a piece of book-keeping which allows the user to identify more easily larger sub-databases (and so ones that s/he is more likely to want to connect to one another).

A few notes on use: to use this script, it is sensible to copy the min.data, points.min, points.ts and ts.data files of the database you are interested in to another folder. The only other files you need are the script itself, the relevant binary and dinfo (plus perhaps some case-specific auxiliary files). It should be ensured that before executing the binary, the argument to CONNECTMIN in dinfo is 1. Also, PRINTCONNECTED must be included as a keyword.

Step 2: RETAINSP and CONNECTUNC LOWESTTEST

So, we now have a list of files relevant_connected_* corresponding to sub-databases which we would like to connect.

Remember, though, we want to connect them efficiently!

Before attempting any connections then, it is probably advised to get a flavour of the distances separating these sub-databases from one another (or, at least, the shortest distance possible between any two minima of all of the sub-databases).

To do this, we need to limit the min.data file (and ts.data) in a sub-folder so that only those minima corresponding to the two sub-databases we are interested in are considered. We can use the keyword found in PATHSAMPLE, RETAINSP, for this purpose. By using an adapted version of CONNECTUNC with a new argument called LOWESTTEST, we can identify sensible connections to make, without actually attempting the connection.

This approach works as long as min.A and min.B both correspond to minima in the AB set (this is accounted for in the script I’ve written, connect_sub_databases.sh). What this does is find the unconnected minima (i.e. those in the set which is not the AB set) of lowest energy. It then loops through all the minima in the AB set, printing the distance between each pair of minima without actually attempting the connection. A further loop operates so that all unconnected minima are considered too.

Once all minima are considered, the loop is abruptly exited by a STOP statement.

Using grep (don't worry about executing these commands yourself as they are all contained in the script connect_sub_databases.sh):

grep "connectlowest> Distance: " pathsample_connectunc_test.out > distances
sed -e "s/^/$dirname  /g" distances > distances_tmp

we are able to build up a list of all the proposed connections made by CONNECTLOWESTTEST between the two chosen sub-databases. This information then gets concatenated into an overall file called distances_tot in the folder where the script was originally launched. Eventually, once all pairs of sub-databases are considered, we should have a massive file listing all of the potential connections between all of the minima in all of the sub-databases, along with the distances separating them. An example of a few lines from such a file is as follows:

00003_00303       359    4550 connectlowest> Distance:    42.42963734
00003_00303       341     147 connectlowest> Distance:    39.39663225
00003_02150      2280    1932 connectlowest> Distance:    75.54181654

The first column lists the two sub-databases which were considered. Another nice feature of the CONNECTUNC LOWESTTEST keyword and argument is that, alongside the distance, the specific minima (DMIN1 and DMIN2) from the two sub-databases being considered are listed (highlighted in red below):

00003_00303       359    4550  connectlowest> Distance:    42.42963734

359, therefore, is a minimum which belongs to sub-database 00003 (i.e. the sub-database described by the file relevant_connected_00003) and 4550 a minimum which belongs to sub-database 00303.

Step 3: Organising Calculations to Attempt

Clearly, a pair of minima separated by 39.397 is a more feasible calculation to make than one separated by 42.430 or 75.542. The script we have (connect_sub_databases) therefore reorganises distances_tot to list the proposed connections between pairs from shortest distance to longest. This new reorganised file we give the rather unimaginative name of lowest_to_highest_distances_tot.

The rest of the script is concerned with connecting all of the sub-databases in as efficient a way, using as few steps, as possible. This is probably best illustrated by an example:

I have 15 sub-databases I wish to connect. The minima comprising each can be found in:

relevant_connected_00003
relevant_connected_00164
relevant_connected_00303
relevant_connected_02150
relevant_connected_06061
relevant_connected_06274
relevant_connected_06610
relevant_connected_06913
relevant_connected_07339
relevant_connected_09000
relevant_connected_09969
relevant_connected_10040
relevant_connected_12405
relevant_connected_14191
relevant_connected_14775

Here are the first ten lines of lowest_to_highest_distances_tot. Those coloured green are connections attempted, whilst those coloured red were skipped over because they turn out to be superfluous (why attempt line 5, for example, when line 4 is attempting to connect the same two sub-databases?):

00003_02150      2657    2663 connectlowest> Distance:     0.39352080 
09000_12405      3003    3033 connectlowest> Distance:     0.84958725 
09000_10040      1228    1251 connectlowest> Distance:     1.01130262 
09000_14191      3176    3209 connectlowest> Distance:     1.07183817 
09000_14191      3194    3209 connectlowest> Distance:     1.81036433 
09000_14191      3193    3187 connectlowest> Distance:     1.88481550  
09000_14775      3450    3457 connectlowest> Distance:     2.41249957 
09000_14191      3203    3187 connectlowest> Distance:     2.42913148 
09000_14191      3177    3209 connectlowest> Distance:     2.45715932 
00003_02150      2572    2663 connectlowest> Distance:     2.82747537 

Using these principles, the sub-databases were therefore connected as follows.

The first line of lowest_to_highest_distances_tot:

00003_02150      2657    2663 connectlowest> Distance:     0.39352080
Error creating thumbnail: Unable to save thumbnail to destination

After next line:

09000_12405      3003    3033 connectlowest> Distance:     0.84958725
Error creating thumbnail: Unable to save thumbnail to destination
09000_10040      1228    1251 connectlowest> Distance:     1.01130262
Error creating thumbnail: Unable to save thumbnail to destination
09000_14191      3176    3209 connectlowest> Distance:     1.07183817
Error creating thumbnail: Unable to save thumbnail to destination
09000_14775      3450    3457 connectlowest> Distance:     2.41249957
Error creating thumbnail: Unable to save thumbnail to destination
00003_00164       134     159 connectlowest> Distance:     5.00815137
Error creating thumbnail: Unable to save thumbnail to destination
06061_06913       402     296 connectlowest> Distance:     5.01723232
Error creating thumbnail: Unable to save thumbnail to destination
00003_07339      3893    3899 connectlowest> Distance:     5.68186344
Error creating thumbnail: Unable to save thumbnail to destination
06610_07339       135     137 connectlowest> Distance:     7.04874883
Error creating thumbnail: Unable to save thumbnail to destination
00003_00303       670     811 connectlowest> Distance:    24.67395896
Error creating thumbnail: Unable to save thumbnail to destination
06061_06274       257     459 connectlowest> Distance:    31.59473639
Error creating thumbnail: Unable to save thumbnail to destination
09000_09969      1317    1946 connectlowest> Distance:    40.39286979
Error creating thumbnail: Unable to save thumbnail to destination
00003_06061      4149    4481 connectlowest> Distance:    44.36489142
Error creating thumbnail: Unable to save thumbnail to destination
00003_09000      7369    2632 connectlowest> Distance:    71.98718590
Error creating thumbnail: Unable to save thumbnail to destination

Connection attempts are therefore chosen using as small distances as possible. It should also be possible to connect these 15 sub-databases using 14 connections. Some OPTIM runs for whatever reason may not work (although this is unlikely) in which case the user can manually look through the lowest_to_highest_distances_tot file to find the next most suitable pair of minima from the two sub-databases in question to connect.

The diagrams above list 14 connections which are recommended to be made. What the script therefore does is to create 14 folders for these connections to be attempted. These are named as 00003_02150, 09000_12405 etc.

Within each sub-folder, a PATHSAMPLE calculation using RETAINSP is first of all performed. This ensures that only the minima pertaining to the two sub-databases we are trying to connect are included (eg 00003 and 02150 in the case of the sub-folder 00003_02150) and that the numbering scheme in min.data is consistent with when CONNECTUNC LOWESTTEST was used. Once this is done, the pathdata file is altered so that rather than RETAINSP we now wish to use the keywords CONNECTPAIRS connectfile and CYCLES 1.

What CONNECTPAIRS does is to chose specific minima from min.data for connection attempts. This requires an argument which specifies a file to be read in order to list the minima we want to connect. Typically, this argument is connectfile, and therefore we require a file called connectfile. This file lists all of the minima we wish to connect according to their line number in min.data. As shown above (and copied immediately below) for the connection we wish to make between the sub-databases 00003 and 02150, the minima we are interested in are 2657 and 2663.

00003_02150      2657    2663 connectlowest> Distance:     0.39352080

Therefore, connectfile, upon opening, should appear simply as:

2657 2663

Thus, an OPTIM job gets launched, which tries to connect minima 2657 (in the 00003 sub-database) and 2663 (in the 02150 sub-database). Because these minima are separated by a really short distance of 0.394, it will hopefully (this is not foolproof!) be a simple connection.

In this example, following the use of the script, we should end up with 14 folders, each with a connected path between two sub-databases.

How to execute steps 2 and 3?

Steps 2 and 3 above clearly involves a number of steps. A script has been written which does all of this (i.e. chooses which connections to make and then attempts them) for you. This script is called connect_sub_databases.sh and can be found in the svn at ~/svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases. An annotated version of this script is also available.

The following files are required in the folder from which you launch this script (assuming you are using the AMBER interface):

coords.inpcrd, coord.mdcrd, coords.prmtop, min.in, min.A, min.B, min.data, odata.connect, pathdata, perm.allow, points.min, points.ts, ts.data, untrap_sub_script, relevant_connected_*

Where relevant_connected_* are all of the sub-databases found in step 1. This could be any number of files. Note that untrap_sub_script just happens to be the name of the sub-script I used for my calculations. You can either rename your sub-script to untrap_sub_script or alter connect_sub_databases.sh before launching your calculations. It is essential that your pathdata file is correctly formatted before launching the script too. Two lines

RETAINSP
! CYCLES 1

must be included. An example of a pathdata file I used is as follows:

EXEC           /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS           1
NATOMS         5430
SEED           1
DIRECTION      AB
CONNECTIONS    1
TEMPERATURE    0.592
PLANCK         9.536D-14

PERMDIST
ETOL           8D-4
GEOMDIFFTOL    0.2D0
ITOL           0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12

Make sure that this pathdata points towards a valid binary and that the number of atoms is consistent with the system you are examining.

Step 4: Merging the sub-databases

Following the use of the script, we should end up with x number of folders, each with a connected path between two sub-databases. The total number of folders will encompass the total number of connections which needed to be made in order to connect all of the sub-databases we were interested in.

Assuming all of the connections were successfully made, all we need to do now is to merge these new connected databases together. This can be achieved using the MERGEDB keyword in PATHSAMPLE, as is outlined here.

Summary for those who don't like screeds of writing

It is possible that your PATHSAMPLE database contains many sub-databases not necessarily connected to one another. Therefore, a lot of information is lost when you try to construct disconnectivity graphs.

To retrieve this information, we should therefore connect these sub-databases. This can be done efficiently by first identifying which minima contained in respective sub-databases are closest to each other, and then trying to connect these first.

To at first identify the sub-databases present in your database, launch the script find_connections.sh using disconnectionDPS with the PRINTCONNECTED keyword included and the argument to CONNECTMIN set to 1. This produces a list of relevant_connected_* files, each one representing a sub-database which lists the minima present in it.

Following this step, the script connect_sub_databases.sh is launched. It determines the distances between all of the possible minima in all of the possible sub-databases. It then prioritises calculations in order to connect the maximum number of sub-databases in the minimum possible number of steps, with those of shortest distance being attempted first.

The outcome to the launch of this latter script is that a number of folders are created, each providing a connected path between two sub-databases. The data from these folders can be merged into the original using the PATHSAMPLE keyword MERGEDB, thus connecting all of the previously unconnected sub-databases.

This methodology is particularly useful for cases where you have a protein with a cofactor and various sites within a pocket that you think the cofactor can attach to. It provides an efficient method to connect these sites within the pocket, having already sampled each.

--adk44 14.30, 16 May 2019 (BST)