Difference between revisions of "Connecting Sub-databases"
Line 44: | Line 44: | ||
This script cycles through the min.data file, executing a [[disconnectionDPS]] command for every iteration of the argument to '''CONNECTMIN'''. The '''connected''' file produced is then renamed '''connected_*''' where * is the argument to '''CONNECTMIN''' when that [[disconnectionDPS]] command was executed. For a min.data file with 17603 lines (and therefore 17603 minima) for example, the argument to '''CONNECTMIN''' therefore ranges from CONNECTMIN 1 to CONNECTMIN 17603. If a minimum is already present in a previous '''connected_*''' file then that argument is skipped. For example, if a [[disconnectionDPS]] command when the argument to '''CONNECTMIN''' was set to '''CONNECTMIN 1''' gave a sub-database with minima 1 and 2 (i.e. the minima on lines 1 and 2 in min.data) in it, then a [[disconnectionDPS]] attempt using '''CONNECTMIN 2''' will not be attempted as minimum 2 is already assigned to the sub-database described in '''connected_1'''. The next iteration will be using '''CONNECTMIN 3'''. |
This script cycles through the min.data file, executing a [[disconnectionDPS]] command for every iteration of the argument to '''CONNECTMIN'''. The '''connected''' file produced is then renamed '''connected_*''' where * is the argument to '''CONNECTMIN''' when that [[disconnectionDPS]] command was executed. For a min.data file with 17603 lines (and therefore 17603 minima) for example, the argument to '''CONNECTMIN''' therefore ranges from CONNECTMIN 1 to CONNECTMIN 17603. If a minimum is already present in a previous '''connected_*''' file then that argument is skipped. For example, if a [[disconnectionDPS]] command when the argument to '''CONNECTMIN''' was set to '''CONNECTMIN 1''' gave a sub-database with minima 1 and 2 (i.e. the minima on lines 1 and 2 in min.data) in it, then a [[disconnectionDPS]] attempt using '''CONNECTMIN 2''' will not be attempted as minimum 2 is already assigned to the sub-database described in '''connected_1'''. The next iteration will be using '''CONNECTMIN 3'''. |
||
+ | As mentioned previously, this script cycles until all the minima in min.data have been considered. |
||
− | '''IMPORTANT NOTE:''' Before executing ./'''find_connections.sh''', if you wish to obtain details on all sub-databases in the database, it is therefore important to set the argument to '''CONNECTMIN''' as 1. |
||
+ | |||
+ | Another feature of '''find_connections.sh''' is that, when '''connected_*''' files exceed a set number of minima (I think 10 is sensible) then they get copied to a corresponding '''relevant_connected_*''' file, eg if '''connected_3''' has 50 minima then it exceeds 10 and so the information in this file is copied to another one called '''relevant_connected_3'''. This is a piece of book-keeping which allows the user to identify more easily larger sub-databases (and so ones that s/he is more likely to want to connect to one another). |
||
+ | |||
+ | A few notes on use: to use this script, it is sensible to copy the min.data, points.min, points.ts and ts.data files of the database you are interested in to another folder. The only other files you need are the script itself, the relevant binary and dinfo (plus perhaps some case-specific auxiliary files). It should be ensured that before executing the binary, the argument to '''CONNECTMIN''' in dinfo is 1. Also, '''PRINTCONNECTED''' must be included as a keyword. |
Revision as of 15:53, 14 May 2019
Definitions
For the purposes of this tutorial, I am defining sub-databases to be sets of connected minima and transition states within a larger database.
Context and Motivation
In databases containing many thousands of minima and TSs, it is unlikely that these will all be connected to one another. This is particularly the case when the database has been grown using such methods as ADDPATH and MERGEDB. Instead, the database is more likely to consist of many sub-databases of varying size. Therefore, when constructing a disconnectivity graph - which cannot plot more than one set of connected minima (i.e. more than one sub-database) at a time - a lot of data present in the min.data, points.min, points.ts and ts.data files gets ignored. The sub-database which the disconnectivity graph plots depends on the numerical argument to the keyword CONNECTMIN chosen in the dinfo file. These numerical arguments correspond to minima, as listed in the min.data file. For example, an argument of 12 corresponds to line 12 of the min.data file. Therefore, only this minimum plus any others it is connected to gets plotted on the disconnectivity graph.
The question, therefore, is how to efficiently connect minima already present in the min.data file. It would be particularly important to connect sub-databases with a lot of minima in them (it would probably be a waste of time to connect all those sub-databases with only 2 minima in them for example as by doing so you’re not collecting much more information).
Another consideration to make is that we want the connection attempts between sub-databases to be efficient. Namely, we want to try to connect sub-databases which are closer to one another (or, more specifically, sub-databases which have at least one minimum which is close in chemical space to a minimum in another sub-database). This is especially important for large systems (such as large proteins with cofactors) as trying to connect minima far apart in space can be very slow or even break down due to memory issues.
Step 1: Using disconnectionDPS to determine the breakdown of sub-databases within your database
Requirements
A folder containing the files min.data, points.min, points.ts, ts.data, dinfo, the script find_connections.sh and the binary disconnectionDPS, plus any other auxiliary files you may need.
Method
In dinfo, you need to use the keyword PRINTCONNECTED. An example dinfo file that I've used is:
! REQUIRED KEYWORDS DELTA 0.25 FIRST -15120.0 LEVELS 800 MINIMA min.data TS ts.data ! OPTIONAL KEYWORDS NCONNMIN 0 CONNECTMIN 1 LABELFORMAT F8.1 PRINTCONNECTED
PRINTCONNECTED ensures that a file called connected is written, which lists all of the minima plotted in the disconnectivity graph (i.e. all of the minima present in the sub-database considered). In the example above, because the argument to CONNECTMIN is 1, this means that minimum 1 - and all those minima to which 1 is connected - gets plotted.
This gives us information on only one sub-database present in the file. To find out about all of them, the script find_connections.sh is used.
This script cycles through the min.data file, executing a disconnectionDPS command for every iteration of the argument to CONNECTMIN. The connected file produced is then renamed connected_* where * is the argument to CONNECTMIN when that disconnectionDPS command was executed. For a min.data file with 17603 lines (and therefore 17603 minima) for example, the argument to CONNECTMIN therefore ranges from CONNECTMIN 1 to CONNECTMIN 17603. If a minimum is already present in a previous connected_* file then that argument is skipped. For example, if a disconnectionDPS command when the argument to CONNECTMIN was set to CONNECTMIN 1 gave a sub-database with minima 1 and 2 (i.e. the minima on lines 1 and 2 in min.data) in it, then a disconnectionDPS attempt using CONNECTMIN 2 will not be attempted as minimum 2 is already assigned to the sub-database described in connected_1. The next iteration will be using CONNECTMIN 3.
As mentioned previously, this script cycles until all the minima in min.data have been considered.
Another feature of find_connections.sh is that, when connected_* files exceed a set number of minima (I think 10 is sensible) then they get copied to a corresponding relevant_connected_* file, eg if connected_3 has 50 minima then it exceeds 10 and so the information in this file is copied to another one called relevant_connected_3. This is a piece of book-keeping which allows the user to identify more easily larger sub-databases (and so ones that s/he is more likely to want to connect to one another).
A few notes on use: to use this script, it is sensible to copy the min.data, points.min, points.ts and ts.data files of the database you are interested in to another folder. The only other files you need are the script itself, the relevant binary and dinfo (plus perhaps some case-specific auxiliary files). It should be ensured that before executing the binary, the argument to CONNECTMIN in dinfo is 1. Also, PRINTCONNECTED must be included as a keyword.