Difference between revisions of "Connecting Sub-databases"
Line 184: | Line 184: | ||
00003_09000 7369 2632 connectlowest> Distance: 71.98718590 |
00003_09000 7369 2632 connectlowest> Distance: 71.98718590 |
||
− | [[Image:fourteenth step.png| |
+ | [[Image:fourteenth step.png|550px|center]] |
== Summary for those who don't like screeds of writing == |
== Summary for those who don't like screeds of writing == |
Revision as of 17:56, 14 May 2019
Definitions
For the purposes of this tutorial, I am defining sub-databases to be sets of connected minima and transition states within a larger database.
Context and Motivation
In databases containing many thousands of minima and TSs, it is unlikely that these will all be connected to one another. This is particularly the case when the database has been grown using such methods as ADDPATH and MERGEDB. Instead, the database is more likely to consist of many sub-databases of varying size. Therefore, when constructing a disconnectivity graph - which cannot plot more than one set of connected minima (i.e. more than one sub-database) at a time - a lot of data present in the min.data, points.min, points.ts and ts.data files gets ignored. The sub-database which the disconnectivity graph plots depends on the numerical argument to the keyword CONNECTMIN chosen in the dinfo file. These numerical arguments correspond to minima, as listed in the min.data file. For example, an argument of 12 corresponds to line 12 of the min.data file. Therefore, only this minimum plus any others it is connected to gets plotted on the disconnectivity graph.
The question, therefore, is how to efficiently connect minima already present in the min.data file. It would be particularly important to connect sub-databases with a lot of minima in them (it would probably be a waste of time to connect all those sub-databases with only 2 minima in them for example as by doing so you’re not collecting much more information).
Another consideration to make is that we want the connection attempts between sub-databases to be efficient. Namely, we want to try to connect sub-databases which are closer to one another (or, more specifically, sub-databases which have at least one minimum which is close in chemical space to a minimum in another sub-database). This is especially important for large systems (such as large proteins with cofactors) as trying to connect minima far apart in space can be very slow or even break down due to memory issues.
Step 1: Using disconnectionDPS to determine the breakdown of sub-databases within your database
Requirements
A folder containing the files min.data, points.min, points.ts, ts.data, dinfo, the script find_connections.sh and the binary disconnectionDPS, plus any other auxiliary files you may need.
Method
In dinfo, you need to use the keyword PRINTCONNECTED. An example dinfo file that I've used is:
! REQUIRED KEYWORDS DELTA 0.25 FIRST -15120.0 LEVELS 800 MINIMA min.data TS ts.data ! OPTIONAL KEYWORDS NCONNMIN 0 CONNECTMIN 1 LABELFORMAT F8.1 PRINTCONNECTED
PRINTCONNECTED ensures that a file called connected is written, which lists all of the minima plotted in the disconnectivity graph (i.e. all of the minima present in the sub-database considered). In the example above, because the argument to CONNECTMIN is 1, this means that minimum 1 - and all those minima to which 1 is connected - gets plotted.
This gives us information on only one sub-database present in the file. To find out about all of them, the script find_connections.sh is used.
This script cycles through the min.data file, executing a disconnectionDPS command for every iteration of the argument to CONNECTMIN. The connected file produced is then renamed connected_* where * is the argument to CONNECTMIN when that disconnectionDPS command was executed. For a min.data file with 17603 lines (and therefore 17603 minima) for example, the argument to CONNECTMIN therefore ranges from CONNECTMIN 1 to CONNECTMIN 17603. If a minimum is already present in a previous connected_* file then that argument is skipped. For example, if a disconnectionDPS command when the argument to CONNECTMIN was set to CONNECTMIN 1 gave a sub-database with minima 1 and 2 (i.e. the minima on lines 1 and 2 in min.data) in it, then a disconnectionDPS attempt using CONNECTMIN 2 will not be attempted as minimum 2 is already assigned to the sub-database described in connected_1. The next iteration will be using CONNECTMIN 3.
As mentioned previously, this script cycles until all the minima in min.data have been considered.
Another feature of find_connections.sh is that, when connected_* files exceed a set number of minima (I think 10 is sensible) then they get copied to a corresponding relevant_connected_* file, eg if connected_3 has 50 minima then it exceeds 10 and so the information in this file is copied to another one called relevant_connected_3. This is a piece of book-keeping which allows the user to identify more easily larger sub-databases (and so ones that s/he is more likely to want to connect to one another).
A few notes on use: to use this script, it is sensible to copy the min.data, points.min, points.ts and ts.data files of the database you are interested in to another folder. The only other files you need are the script itself, the relevant binary and dinfo (plus perhaps some case-specific auxiliary files). It should be ensured that before executing the binary, the argument to CONNECTMIN in dinfo is 1. Also, PRINTCONNECTED must be included as a keyword.
Step 2: RETAINSP and CONNECTUNC LOWESTTEST
So, we now have a list of files relevant_connected_* corresponding to sub-databases which we would like to connect.
Remember, though, we want to connect them efficiently!
Before attempting any connections then, it is probably advised to get a flavour of the distances separating these sub-databases from one another (or, at least, the shortest distance possible between any two minima of all of the sub-databases).
To do this, we need to limit the min.data file (and ts.data) so that only those minima corresponding to the two sub-databases we are interested in are considered. We can use the keyword found in PATHSAMPLE, RETAINSP, for this purpose. By using an adapted version of CONNECTUNC with a new argument called LOWESTTEST, we can identify sensible connections to make, without actually attempting the connection.
This works as long as min.A and min.B both correspond to minima in the AB set (this is accounted for in the script I’ve written, connect_sub_databases.sh). What this does is find the unconnected minima (ie those in the set which is not the AB set) of lowest energy. It then loops through all the minima in the AB set, printing the distance between each pair of minima without actually attempting the connection. A further loop operates so that all unconnected minima are considered too.
Once all minima are considered, the loop is abruptly exited by the STOP statement.
By grepping (don't worry about executing these commands yourself as they are all contained in the script connect_sub_databases.sh):
grep "connectlowest> Distance: " pathsample_connectunc_test.out > distances sed -e "s/^/$dirname /g" distances > distances_tmp
we are able to build up a list of all the proposed connections made by CONNECTLOWESTTEST between the two chosen sub-databases. This information then gets concatenated into an overall file called distances_tot in the folder where the script was originally launched. Eventually, once all pairs of sub-databases are considered, we should have a massive file listing all of the potential connections between all of the minima in all of the sub-databases, along with the distances separating them. An example of a few lines from such a file is as follows:
00003_00303 359 4550 connectlowest> Distance: 42.42963734 00003_00303 341 147 connectlowest> Distance: 39.39663225 00003_02150 2280 1932 connectlowest> Distance: 75.54181654
Another nice feature of the CONNECTUNC LOWESTTEST keyword and argument is that, alongside the distance, the specific minima (DMIN1 and DMIN2) from the two sub-databases being considered are listed (highlighted in red below):
00003_00303 359 4550 connectlowest> Distance: 42.42963734
This will come in handy, as described in Step 3.
Step 3: Organising Calculations to Attempt
Clearly, a pair of minima separated by 39.397 is a more feasible calculation to make than one separated by 42.430 or 75.542. The script we have (connect_sub_databases therefore reorganises distances_tot to list the proposed connections between pairs from shortest distance to longest. This new reorganised file we give the rather unimaginative name of lowest_to_highest_distances_tot.
The rest of the script is concerned with connecting all of the sub-databases in as efficient a way, using as few steps, as possible. This is probably best illustrated by an example:
I have 15 sub-databases I wish to connect. The minima comprising each can be found in:
relevant_connected_00003 relevant_connected_00164 relevant_connected_00303 relevant_connected_02150 relevant_connected_06061 relevant_connected_06274 relevant_connected_06610 relevant_connected_06913 relevant_connected_07339 relevant_connected_09000 relevant_connected_09969 relevant_connected_10040 relevant_connected_12405 relevant_connected_14191 relevant_connected_14775
Here are the first ten lines of lowest_to_highest_distances_tot. Those coloured green are connections attempted, whilst those coloured red were skipped over because they turn out to be superfluous (why attempt line 5 for example when line 4 is attempting to connect the same two sub-databases?):
00003_02150 2657 2663 connectlowest> Distance: 0.39352080 09000_12405 3003 3033 connectlowest> Distance: 0.84958725 09000_10040 1228 1251 connectlowest> Distance: 1.01130262 09000_14191 3176 3209 connectlowest> Distance: 1.07183817 09000_14191 3194 3209 connectlowest> Distance: 1.81036433 09000_14191 3193 3187 connectlowest> Distance: 1.88481550 09000_14775 3450 3457 connectlowest> Distance: 2.41249957 09000_14191 3203 3187 connectlowest> Distance: 2.42913148 09000_14191 3177 3209 connectlowest> Distance: 2.45715932 00003_02150 2572 2663 connectlowest> Distance: 2.82747537
Using these principles, the sub-databases were therefore connected as follows.
The first line of lowest:
00003_02150 2657 2663 connectlowest> Distance: 0.39352080
After next line:
09000_12405 3003 3033 connectlowest> Distance: 0.84958725
09000_10040 1228 1251 connectlowest> Distance: 1.01130262
09000_14191 3176 3209 connectlowest> Distance: 1.07183817
09000_14775 3450 3457 connectlowest> Distance: 2.41249957
00003_00164 134 159 connectlowest> Distance: 5.00815137
06061_06913 402 296 connectlowest> Distance: 5.01723232
00003_07339 3893 3899 connectlowest> Distance: 5.68186344
06610_07339 135 137 connectlowest> Distance: 7.04874883
00003_00303 670 811 connectlowest> Distance: 24.67395896
06061_06274 257 459 connectlowest> Distance: 31.59473639
09000_09969 1317 1946 connectlowest> Distance: 40.39286979
00003_06061 4149 4481 connectlowest> Distance: 44.36489142
00003_09000 7369 2632 connectlowest> Distance: 71.98718590