Docswiki - User contributions [en]

CHECKSPMUTATE

2020-06-17T09:53:51Z

Adk44:

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, a pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

Care must especially be taken when inserting residues within the chain. If possible to avoid, please do so. I have written the code so that a maximum of two such residues can be inserted at the same point along the chain. This is already likely to cause problematic steric clashes, and so adding more would not likely be feasible. Care must also be taken that these new residues, once added and optimised, have the correct cis-trans isomerism and chirality.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding. Care will need to be taken that sufficient sampling for alternative pathways is performed.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding. Care will need to be taken that sufficient sampling for alternative pathways is performed.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-11T19:26:30Z

Adk44: /* Potential Uses */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, a pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding. Care will need to be taken that sufficient sampling for alternative pathways is performed.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding. Care will need to be taken that sufficient sampling for alternative pathways is performed.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T13:54:47Z

Adk44:

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in '''/svn/SCRIPTS/PATHSAMPLE/checkspmutate/generate_connectfile'''.

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in '''/svn/SCRIPTS/DISCONNECT''') and connect_sub_databases.sh (which can be found in '''/svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases''').

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running [[disconnectionDPS]] in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the [[disconnectionDPS]] binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it may not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:33:08Z

Adk44: /* A Few Notes */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile'''.

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in '''/svn/SCRIPTS/disconnect''') and connect_sub_databases.sh (which can be found in '''/svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases''').

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running [[disconnectionDPS]] in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the [[disconnectionDPS]] binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it may not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:30:34Z

Adk44: /* Execution */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile'''.

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in '''/svn/SCRIPTS/disconnect''') and connect_sub_databases.sh (which can be found in '''/svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases''').

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running [[disconnectionDPS]] in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the [[disconnectionDPS]] binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:29:57Z

Adk44: /* Files Required */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile'''.

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in '''/svn/SCRIPTS/disconnect''') and connect_sub_databases.sh (which can be found in '''/svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases''').

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:25:33Z

Adk44: /* Files Required */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile'''.

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:24:57Z

Adk44: /* Execution */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later).

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:24:05Z

Adk44: /* Execution */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one TS in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:23:14Z

Adk44: /* Files Required */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in '''/svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs''', are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:21:41Z

Adk44: /* Note on readts */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readts''').

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:21:03Z

Adk44: /* Files Required */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in '''/svn/SCRIPTS/CHECKSPMUTATE/readmin'''.

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:19:40Z

Adk44: /* Note on checkts */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs''' are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:19:17Z

Adk44: /* Execution */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Make sure reoptimise.sh assigns an appropriate binary when manipulating the submission script. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:17:22Z

Adk44: /* checkmin/checkts */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in '''/svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima'''.

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:15:57Z

Adk44: /* Rationale */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisations first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-04T09:14:36Z

Adk44: /* Introduction */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method described below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T09:10:47Z

Adk44: /* pathdata and odata.checksp files */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, a pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T09:09:35Z

Adk44: /* Making auxiliary files */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, an pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T09:04:42Z

Adk44: /* Preparation */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and its position in the original protein, and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, an pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T09:03:21Z

Adk44: /* Potential Uses */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM are (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and its position in the original protein, and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, an pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.
* Comparing a DNA strand against an RNA/XNA strand. Note that this would require some different python scripts, also found in '''/svn/SCRIPTS/AMBER/BHmutation_steps''', such as '''nucleic_acids.py'''.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T09:00:31Z

Adk44: /* Preparation */

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory. Note that the python scripts can be found in '''/svn/SCRIPTS/AMBER/BHmutation_steps'''.

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM are (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and its position in the original protein, and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, an pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

CHECKSPMUTATE

2020-06-04T08:57:10Z

Adk44:

== Purpose ==

It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as a fully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.

What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.

CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, [[SHORTCUT]], [[SHORTCUT 2 BARRIER]] and [[UNTRAP]], this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).

== Preparation ==

Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory:

*'''aa_ringdata.pyc''' list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
*'''amino_acids.pyc''' list of parameters for all residues.
*'''atomnumberlog''' list of indices of the first atom of the residues to be mutated.
*'''coordinates_mut.pyc''' script which mutates the selected residue.
*'''coords.inpcrd''' ensure this is for the system we are CHANGING TO.
*'''coords.mdcrd''' for use with min.in, may not be required.
*'''coords.prmtop''' ensure this is for the system we are CHANGING TO.
*'''min.A''' needs to be present, although the index listed in it is unimportant
*'''min.B''' needs to be present, although the index listed in it is unimportant
*'''min.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''min.in''' for use with [[AMBER]]. Defines certain aspects of the model, such as solvent being used.
*'''mutate_aa.py''' organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
*'''newreslog''' list of codes for residues which we are CHANGING TO.
*'''nresidueslog''' list of the total number of mutations to be made to the system.
*'''odata.checksp''' list of conditions for optimisation carried out on each stationary point by [[OPTIM]].
*'''oldreslog''' list of codes for residues which we are CHANGING FROM.
*'''original_protein.pdb''' pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM are (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
*'''pathdata''' organises [[OPTIM]] jobs. Certain keywords are required, described below.
*'''perm.allow''' full description of the groups of permutable atoms in the system we are CHANGING TO.
*'''points.min''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''points.ts''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''resnumberlog''' list of indices of the residues to be mutated.
*'''ts.data''' ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using [[DIJKSTRA]].
*'''submission_script''' script for executing your binary.

== Example of Mutation ==

=== [Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH] ===

To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using [[DIJKSTRA]] and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.

*'''nresidueslog''' as we are only making one mutation, this is simply:
<pre>
1
</pre>
*'''resnumberlog''' this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
101
</pre>
*'''atomnumberlog''' this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
<pre>
1569
</pre>
*'''oldreslog''' this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
PHE
</pre>
*'''newreslog''' this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
<pre>
ALA
</pre>

=== pathdata and odata.checksp files ===

To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.

When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.

An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5491

AMBER12
</pre>

The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!

=== Before and After Mutation ===

Here is a rendering of [wt HemS + haem + NADH], alongside a rendering of the same minimum reoptimised after the F101A mutation was made.

<gallery mode=packed widths="540px" heights="360px">
File:wt_HemS_wiki.png|wt HemS
File:F101A_Hems_wiki.png|F101A HemS
</gallery>

== Example of Transformation to a Homologue ==

=== [Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH] ===

This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in [[Preparing an AMBER topology file for a protein plus ligand system]] and [[Symmetrising AMBER topology files]]. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.

=== Making auxiliary files ===

First, we need to align and compare the two systems.

==== wt HemS + haem + NADH ====

NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP
LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP
ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU
VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG
HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU
GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO
GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU
ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS
ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN
VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE
MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL
THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU
THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER
LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYSHEM NAD

==== wt ChuS + haem + NADH ====

NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP
ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP
ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU
THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN
GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU
PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG
GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP
ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR
PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR
VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS
ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS
VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE
MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL
VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU
GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER
LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU
GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPROXXX HEM NAD

Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and its position in the original protein, and purple represents the two cofactors.

We now need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define all the mutations/insertions/deletions we are making, and where in the chain they take place. In this example, we are making 113 mutations/insertions/deletions (to be precise, 111 mutations and 2 deletions) in total:

*'''nresidueslog''' for 113 mutations, this is:
<pre>
113
</pre>
*'''resnumberlog''' this is a list of the indices of the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
2
4
.
.
336
337
338
</pre>
*'''atomnumberlog''' this is a list of the indices of the first atoms in the residues we are mutating. In cases where the number of residues changes, this index is always with respect to the original system.
<pre>
1
14
54
.
.
5307
5321
5335
</pre>
*'''oldreslog''' this is the residue code for the residues we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NSER
ILE
GLU
.
.
ASN
ASN
CLYS
</pre>
*'''newreslog''' this is the residue code for the residues we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall. If we are inserting a residue then the residue code is XXX or NXXX or CXXX.
<pre>
NASN
HIE
THR
.
.
ILE
CPRO
XXX
</pre>

=== pathdata and odata.checksp files ===

Same as for the mutation example above. I would expect, for my example here with ChuS, an pathdata file like this:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CHECKMIN 1 1235
CYCLES 1235
CHECKSP_MUT
NATOMS_NEW 5464

AMBER12
</pre>

== Final Notes ==

We therefore now have a method in which to transform the stationary points of a database of interest into a mutant or close homologue. This should allow a method in which to directly compare a wild type against mutants of interest or two protein homologues. In the examples above, I have been interested in comparing the pathway of a cofactor as it enters a pocket against other pathways in which key residues have been removed, and against a homologue of suspected similar reactivity.

The advantage of this method is that time and computational resources can be significantly saved in transforming one database to a system of similar properties.

It must be noted, however, that following the use of CHECKSPMUTATE, the new pathway might have gaps as not all of the stationary points necessarily will have converged. These gaps will need filled with the [[CONNECTPAIRS]] keyword. The full procedure that I used to fill such gaps is detailed in [[Pathway Gap Filling Post-CHECKSPMUTATE]]

IT MUST ALSO BE NOTED that the pathway for the new system will not necessarily (and is indeed unlikely to be) the optimal one. Therefore, post CHECKSPMUTATE, an appropriate number of rounds of [[SHORTCUT]]/[[SHORTCUT 2 BARRIER]]/[[UNTRAP]] etc will need to be performed. It is not advised that CHECKSPMUTATE be used on proteins which are significantly different.

=== Potential Uses ===

CHECKSPMUTATE could potentially be used to shed light on some of the following problems:

* Comparing a wt protein against some mutations to determine how such mutations affect protein folding.
* Comparing a wt protein against some mutations to determine how such mutations affect protein-cofactor interactions.
* Comparing a protein against a close homologue to see whether it engages in the same manner of protein folding.
* Comparing a protein against a close homologue to see whether it engages in similar protein-cofactor interactions.
* Transformation of a cofactor (for example from NADH to NADPH) to see whether this affects protein-cofactor interactions.

=== Acknowledgements ===

Though adk44 wrote the code for CHECKSPMUTATE, he interfaced this code with kr366's pre-existent python scripts (with minor alterations) which had originally been written to allow mutations as part of kr366's [[Mutational BH steps]] routine.

--adk44 16.45, 2 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T17:18:31Z

Adk44: /* Cleanining Up the Pathway */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleaning Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T17:18:00Z

Adk44: /* readmin/readts */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Note on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleanining Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T17:17:23Z

Adk44:

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Notes on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleanining Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

== A Few Notes ==

As mentioned at the top of this article, this method to plug in the gaps of a new pathway is highly idiosyncratic, and was particularly useful for the system I was working on. I don't necessarily anticipate it being useful for everybody, and that's why I haven't really generalised the various scripts I've used. Should this prove more popular than I suspect, I would be happy to generalise this code.

One thing to note is that, if you have a really big database describing a large system, it make not be practical to create so many directories, and copy over the large points.min and points.ts files. In which case, some of my scripts will have to be adapted. Again, if anybody struggles doing this, I would be happy to step in and help.

--adk44 18.15, 3 June 2020 (BST)

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T17:11:46Z

Adk44:

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Notes on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Once all of these connection attempts have been made (and hopefully proven successful), the information in each of these subdirectories need to be merged into the overall one in the directory relevant_connected. Therefore, within the relevant_connected directory, the MERGEDB keyword is introduced to pathdata. To merge the information contained in the subdirectory 00025_00102 into relevant_connected, for example, pathdata would look something like this

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ./00025_00102

AMBER12
</pre>

This process is repeated for all of the other subdirectories, changing the argument to MERGEDB each time.

== We Now Have a Connected Pathway ==

Hooray! We should now have a fully connected pathway for our new, mutated system, akin to the one we had in our original, non-mutated system. We can start to compare the two.

== Cleanining Up the Pathway ==

HOWEVER, it must be noted that this new pathway is probably not optimal. Some barriers are probably awfully high, and the pathway can probably be shortened. A series of SHORTCUT/SHORTCUT 2 BARRIER and UNTRAP runs should probably be performed to refine the pathway. I'd recommend doing this in yet another directory, perhaps called s_s2b_and_u, just to differentiate the resulting data from the unrefined pathway so we can better keep track of the changes such refinements make.

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T17:02:05Z

Adk44: /* relevant_connected */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Notes on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

With the minima now grouped into sub-databases, it is time to try to connect these sub-databases. Therefore, connect_sub_databases.sh should now be executed. Before doing so, though, ensure that the name of the submission script in this bash script is consistent with the name of the submission script in your directory.

This bash script first of all calculates all of the distances between all of the unconnected minima, before rearranging them shortest to longest. Sub-directories are then created, and connections attempted within them to connect all of the various sub-databases. If any of this is unclear, please consult [[Connecting Sub-databases]] for more details.

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T16:56:03Z

Adk44: /* relevant_connected */

== Introduction ==

This is a recommended procedure to be used following the use of [[CHECKSPMUTATE]], if it was a pathway which was being reoptimised.

[[CHECKSPMUTATE]] mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.

It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.

Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.

== Method ==

The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.

[[Image:Mind_Map_directories_for_group_wiki.png|1200px|center]]

=== checkmin/checkts ===

==== Rationale ====

Ordinarily, I will have run [[CHECKSPMUTATE]] calculations in checkmin and checkts directories respectively. Because of the way [[OPTIM]] jobs are assigned by [[PATHSAMPLE]] - with each job being assigned a random number - it is possible that two or more [[OPTIM]] jobs get assigned the same random number within the same [[PATHSAMPLE]] batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within [[PATHSAMPLE]] but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by [[CHECKSPMUTATE]]. For example, with my [wt ChuS + haem + NADH] system (please see [[CHECKSPMUTATE]] for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.

==== Files Required ====

To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:

* aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data

Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5501
NATOMS_CHAIN 5357
NATOMS_NEW 5464
CHECKSP_MUT
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14
DUMMYRUN
PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
CYCLES 1

AMBER12
</pre>

and

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=test_top
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

echo Finished at `date`
</pre>

were included.

Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima

==== Execution ====

First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.

The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.

==== Note on checkts ====

Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.

=== readmin/readts ===

==== Rationale ====

Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.

==== Files Required ====

From checkmin/all_launched_simult, copy the following files to the readmin directory:

* checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow

From checkmin, copy all of the min.data.info.**** files to the readmin directory.

Also, required:

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

READMIN min.data.info.total

AMBER12
</pre>

And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin

==== Execution ====

First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file

Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.

==== Notes on readts ====

This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.

Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out

Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).

In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.

=== all_opt_TSs ===

==== Rationale ====

Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in [[OPTIM]].

==== Files Required ====

The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data

We also require an odata file of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
COMMENT PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
MAXMAX 2.0
DUMPALLPATHS
COMMENT DUMPSTRUCTURES
COMMENT MODE -1
DUMPDATA
NOFRQS
UPDATES 20 20
PATH 3
NOPOINTS
BFGSTS 200 20 100 0.01 50
ENDHESS
NOHESS
BFGSSTEPS 10000
BFGSMIN 1D-3
MAXBFGS 1.0
AMBER12 extractedts
</pre>

A pathdata file of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

EXTRACTTS 1

AMBER12
</pre>

And a sub_script_CUDAOPTIM file of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=m_wt_ChuS
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Run the executable in the local node scratch directory

/home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out

echo Finished at `date`
</pre>

In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:

* addpath.sh, descents.sh

==== Execution ====

First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).

Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS
! STEP 1: creating the initial database from A9OPTIM path.info file
! STARTFROMPATH path.info.initial 1 2
! CYCLES 0

ADDPATH path.info.initial

AMBER12
</pre>

Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.

=== generate_connectfile ===

==== Rationale ====

This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.

==== Files Required ====

Copy ts.data from startfrompath into the sub-directory generate_connectfile

The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile

* det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py

==== Execution ====

Simply execute det_connections.sh. connectfile shall be generated.

=== merge_minima_into_database ===

==== Rationale ====

Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives [[OPTIM]] a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.

==== Files Required ====

From startfrompath, copy the following files to the merge_minima_into_database directory:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Use the following pathdata file:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

MERGEDB ../../../readmin

AMBER12
</pre>

==== Execution ====

Run the PATHSAMPLE binary. The [[MERGEDB]] keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.

=== now_use_connectfile_to_fill_gaps ===

==== Rationale ====

This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway.

==== Files Required ====

From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data

Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.

Also required are:

odata.connect, of the form:

<pre>
CUDA A
COMMENT DEBUG
COMMENT CUDATIME
NEWCONNECT 100 3 0.5 100.0 600 1 0.01
NEWNEB 10 500 0.005
ADJUSTK 5 5.0 1.03D0
NEBK 10.0
DIJKSTRA EXP
PERMDIST
EDIFFTOL 8D-4
GEOMDIFFTOL 0.2
PUSHOFF 0.2
COMMENT PERMDIST
BFGSTS 500 10 16 0.01 100
STEPS 1000
USEDIAG 2
DUMPALLPATHS
DUMPDATA

NOHESS
ENDHESS
ENDNUMHESS
NOFRQS
UPDATES 20 20
BFGSSTEPS 10000
BFGSMIN 1D-4
AMBER12 start
</pre>

pathdata of the form:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918
! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

SLURM
COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in
COPYFILES perm.allow

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

CONNECTPAIRS connectfile
CYCLES 244

AMBER12
</pre>

The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).

Also required is sub_script_connections, of the form:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_ncf
#SBATCH --ntasks=6 --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

echo Finished at `date`
</pre>

==== Execution ====

Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.

It may be the case that some gaps are so large that there is little chance they will get connected. If there are a few of these, then such connections can clog up the overall CONNECTPAIRS process. One way to get round this is to adapt the [[OPTIM]] code so that gaps above a certain length are not even attempted. This is the purpose of my other executable in pathdata:

EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap

=== relevant_connected ===

==== Rationale ====

Connection attempts are not always successful. Often this is because the gaps are simply too large to traverse within the given number of connection attempts that have been specified. Some such instances may or may not occur with your system. In case they do (as happened with some of my systems), the methods outlined in [[Connecting Sub-databases]] could be very helpful in overcoming this problem. In short, this is a method in which all of the minima within a database are grouped into 'sub-databases'. Each sub-database contains all the minima connected to one another within the overall database, but all of the minima from one sub-database are unconnected from all of the minima in any other sub-database. Once grouped into these sub-databases, this method then compares each and every minima from each sub-database with all of the minima from all of the other sub-databases, finding the distances between them. These distances are then arranged from shortest to longest, and connection attempts made in that order. If two minima from two different sub-databases are then connected (and thus, by implication, all of the minima from these two sub-databases become connected), any further connection attempts between two minima in these two sub-databases become unnecessary and so are skipped. This process continues until all the sub-databases are connected.

This organisational scheme is a useful way of circumventing the need to connect two minima a long way apart. It is often the case that two better candidate minima from the same two sub-databases are available, which are closer in space.

==== Files Required ====

From now_use_connectfile_to_fill_gaps, the following should be copied over to relevant_connected:

* coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, odata.connect, perm.allow, points.min, points.ts, ts.data

The following are also required:

* A dinfo file, with PRINTCONNECTED included as a keyword such as:

<pre>
! REQUIRED KEYWORDS

DELTA 0.25
FIRST -15120.0
LEVELS 800
MINIMA min.data
TS ts.data

! OPTIONAL KEYWORDS

NCONNMIN 0
CONNECTMIN 1
LABELFORMAT F8.1
PRINTCONNECTED
</pre>

* A pathdata file, such as:

<pre>
EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap
CPUS 1
NATOMS 5464
SEED 1
DIRECTION AB
CONNECTIONS 1
TEMPERATURE 0.592
PLANCK 9.536D-14

PERMDIST
ETOL 8D-4
GEOMDIFFTOL 0.2D0
ITOL 0.1D0
NOINVERSION
NOFRQS

RETAINSP
! CYCLES 1

AMBER12
</pre>

* A submission script for pathsample, sub_script_connections such as:

<pre>
#!/bin/bash

# Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs
#SBATCH --constraint=titanblack
#SBATCH --job-name=wt_ChuS_con
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mail-type=FAIL

hostname
echo "Time: `date`"
source /etc/profile.d/modules.sh

# Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat
module add cuda/6.5
module add icc/64/2013_sp1/4/211
module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3

# Set the GPU to exclusive process mode
sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3

# Make a temporary directory on the node, copy job files there and change to that directory
TMP=/scratch/$USER/$SLURM_JOB_ID
mkdir -p $TMP
cp ${SLURM_SUBMIT_DIR}/{connectfile,coords.inpcrd,coords.mdcrd,coords.prmtop,odata.connect,min.in,min.A,min.B,min.data,ts.data,points.min,points.ts,pathdata,perm.allow} $TMP
cd $TMP

# Run the executable in the local node scratch directory
/home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out

# Copy all files back to the original submission directory
cp * $SLURM_SUBMIT_DIR
STATUS=$?
echo "$STATUS"
if [ $STATUS == 0 ];
then
echo "No error in cp"
cd $SLURM_SUBMIT_DIR
rm -rf $TMP
fi

echo Finished at `date`
</pre>

* The bash scripts, find_connections.sh (which can be found in /svn/SCRIPTS/disconnect) and connect_sub_databases.sh (which can be found in /svn/SCRIPTS/PATHSAMPLE/connecting_sub_databases).

==== Execution ====

First of all, the minima need to be grouped into sub-databases. I found that the easiest way to do this is by running disconnectionDPS in a loop until all the minima in the database are assigned. This is done by executing find_connections.sh. Before doing so, ensure that the argument to CONNECTMIN is set to 1 and the PRINTCONNECTED keyword is included. Also, make sure to change the disconnectionDPS binary in the bash script to whatever you've called your binary.

I tend to find it a bit of a waste of time to consider really small sub-databases. I only tend to consider sub-databases with 10 minima or more in them. It's up to you what cutoff, if any, you wish to have. For these sub-databases, the '''connected_*''' files which list the minima comprising them are copied to '''relevant_connected_*''' files. Following the completion of find_connections.sh, the '''connected_*''' can simply be deleted, but keep the '''relevant_connected_*''' ones. Just make sure that the two minima at the beginning and end of the pathway that you are interested in are retained (i.e. if one of these minima is contained in a '''connected_*''' file beneath the cutoff, ensure that this file is also renamed to a '''relevant_connected_*''' file).

Pathway Gap Filling Post-CHECKSPMUTATE

2020-06-03T16:43:50Z

Adk44: /* relevant_connected */

Adk44: /* Method */