Difference between revisions of "Pathway Gap Filling Post-CHECKSPMUTATE"
Line 348: | Line 348: | ||
==== Rationale ==== |
==== Rationale ==== |
||
− | This is the business-end of this whole procedure. |
+ | This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within [[PATHSAMPLE]] to try to fill in the gaps in our new, mutated pathway. |
+ | |||
+ | ==== Files Required ==== |
||
+ | |||
+ | From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps: |
||
+ | |||
+ | * coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data |
||
+ | |||
+ | Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps. |
||
+ | |||
+ | Also required are: |
||
+ | |||
+ | odata.connect, of the form: |
||
+ | |||
+ | <pre> |
||
+ | CUDA A |
||
+ | COMMENT DEBUG |
||
+ | COMMENT CUDATIME |
||
+ | NEWCONNECT 100 3 0.5 100.0 600 1 0.01 |
||
+ | NEWNEB 10 500 0.005 |
||
+ | ADJUSTK 5 5.0 1.03D0 |
||
+ | NEBK 10.0 |
||
+ | DIJKSTRA EXP |
||
+ | PERMDIST |
||
+ | EDIFFTOL 8D-4 |
||
+ | GEOMDIFFTOL 0.2 |
||
+ | PUSHOFF 0.2 |
||
+ | COMMENT PERMDIST |
||
+ | BFGSTS 500 10 16 0.01 100 |
||
+ | STEPS 1000 |
||
+ | USEDIAG 2 |
||
+ | DUMPALLPATHS |
||
+ | DUMPDATA |
||
+ | |||
+ | NOHESS |
||
+ | ENDHESS |
||
+ | ENDNUMHESS |
||
+ | NOFRQS |
||
+ | UPDATES 20 20 |
||
+ | BFGSSTEPS 10000 |
||
+ | BFGSMIN 1D-4 |
||
+ | AMBER12 start |
||
+ | </pre> |
||
+ | |||
+ | pathdata of the form: |
||
+ | |||
+ | <pre> |
||
+ | EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 |
||
+ | ! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap |
||
+ | CPUS 1 |
||
+ | NATOMS 5464 |
||
+ | SEED 1 |
||
+ | DIRECTION AB |
||
+ | CONNECTIONS 1 |
||
+ | TEMPERATURE 0.592 |
||
+ | PLANCK 9.536D-14 |
||
+ | |||
+ | SLURM |
||
+ | COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in |
||
+ | COPYFILES perm.allow |
||
+ | |||
+ | PERMDIST |
||
+ | ETOL 8D-4 |
||
+ | GEOMDIFFTOL 0.2D0 |
||
+ | ITOL 0.1D0 |
||
+ | NOINVERSION |
||
+ | NOFRQS |
||
+ | |||
+ | CONNECTPAIRS connectfile |
||
+ | CYCLES 244 |
||
+ | |||
+ | AMBER12 |
||
+ | </pre> |
||
+ | |||
+ | The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile). |
||
+ | |||
+ | Also required is sub_script_connections, of the form: |
||
+ | |||
+ | <pre> |
||
+ | #!/bin/bash |
||
+ | |||
+ | # Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs |
||
+ | #SBATCH --constraint=titanblack |
||
+ | #SBATCH --job-name=wt_ChuS_ncf |
||
+ | #SBATCH --ntasks=6 --ntasks-per-node=1 |
||
+ | #SBATCH --gres=gpu:2 |
||
+ | #SBATCH --mail-type=FAIL |
||
+ | |||
+ | hostname |
||
+ | echo "Time: `date`" |
||
+ | source /etc/profile.d/modules.sh |
||
+ | |||
+ | # Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat |
||
+ | module add cuda/6.5 |
||
+ | module add icc/64/2013_sp1/4/211 |
||
+ | module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3 |
||
+ | |||
+ | # Set the GPU to exclusive process mode |
||
+ | sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3 |
||
+ | |||
+ | echo $SLURM_NTASKS > nodes.info |
||
+ | srun hostname >> nodes.info |
||
+ | echo $USER >> nodes.info |
||
+ | pwd >> nodes.info |
||
+ | |||
+ | # Run the executable in the local node scratch directory |
||
+ | /home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out |
||
+ | |||
+ | echo Finished at `date` |
||
+ | </pre> |
||
+ | |||
+ | ==== Execution ==== |
||
+ | |||
+ | Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like. |
Revision as of 16:07, 3 June 2020
Introduction
This is a recommended procedure to be used following the use of CHECKSPMUTATE, if it was a pathway which was being reoptimised.
CHECKSPMUTATE mutates a selected set of residues in a protein or protein+ligand system, and reoptimises all of the stationary points from the original system. Thus mutated forms or a close homologue can be directly compared against a wild type protein. This is particularly useful when comparing a particular protein fold or protein+cofactor interaction. In these instances, we are interested in reoptimising only the stationary points comprising a particular pathway, and the database before mutation is set up accordingly.
It is almost inevitable (particularly is we are introducing bulky mutations) that not all of the stationary points post-mutation will reoptimise (there could be steric clashes etc). Thus, there will be gaps in our new, mutated pathway. Hence the need for post-processing to fill these gaps.
Please note this method listed below is highly idiosyncratic, and as such is only meant as a loose guide. It uses very simple bash scripts, which can be easily edited. Please feel free to adapt the procedure to your own needs/preferences.
Method
The directories used for CHECKSPMUTATE and its post-processing. The bash scripts are set up to move between these, so will need to be adapted if the directories are named/organised differently.
checkmin/checkts
Rationale
Ordinarily, I will have run CHECKSPMUTATE calculations in checkmin and checkts directories respectively. Because of the way OPTIM jobs are assigned by PATHSAMPLE - with each job being assigned a random number - it is possible that two or more OPTIM jobs get assigned the same random number within the same PATHSAMPLE batch. Therefore, the former job gets overwritten by the latter. This seems to be a fairly significant bug within PATHSAMPLE but nobody else seems to have had a problem with it before (I can only assume nobody else has run into this problem, or have come up with their own workarounds). I didn't want to tamper with the cycle2.f90 routine and so my fix involves optimising again these overwritten files. Typically, the number of overwritten files is small compared to the overall number of reoptimisation first conducted by CHECKSPMUTATE. For example, with my [wt ChuS + haem + NADH] system (please see CHECKSPMUTATE for details), of the 1235 minima which were reoptimised, it was found that 14 of these had been overwritten.
Files Required
To find out which files had been overwritten in the first place, a sub-directory (called all_launched_simult) was created within checkmin. The following files from checkmin were copied into this new folder:
- aa_ringdata.pyc, amino_acids.pyc, atomnumberlog, coordinates_mut.pyc, coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, mutate_aa.py, newreslog, nresidueslog, odata.checksp, odata (exactly the same as odata.checksp), original_protein.pdb, pathsample_checkmin.out, perm.allow, points.min, points.ts, resnumberlog, ts.data
Additionally, pre_pathdata and pre_sub_script_CUDAOPTIM files of the form:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5501 NATOMS_CHAIN 5357 NATOMS_NEW 5464 CHECKSP_MUT SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 DUMMYRUN PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS CYCLES 1 AMBER12
and
#!/bin/bash # Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs #SBATCH --constraint=titanblack #SBATCH --job-name=test_top #SBATCH --gres=gpu:2 #SBATCH --mail-type=FAIL hostname echo "Time: `date`" source /etc/profile.d/modules.sh # Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat module add cuda/6.5 module add icc/64/2013_sp1/4/211 module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3 # Set the GPU to exclusive process mode sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3 # Run the executable in the local node scratch directory echo Finished at `date`
were included.
Before proceeding, we also required duplicates.sh, duplicates.py, duplicates2.py and reoptimise.sh, all of which can be found in /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/minima
Execution
First, execute duplicates.sh. This generates checkminfile, a list of all of the minima which were overwritten during the original CHECKSPMUTATE run. It identifies such minima by reading pathsample_checkmin.out (i.e. the output from the CHECKSPMUTATE calculation), which gives a log of all of the random numbers each respective OPTIM job was assigned.
The script reoptimise.sh is then used to reoptimise these overwritten minima. pre_pathdata and sub_script_CUDAOPTIM are first manipulated to ensure the correct minima are reoptimised. Each reoptimisation is carried out in a sub-directory named after the index of the minimum being reoptimised.
Note on checkts
Because of slightly different requirements, make sure that the auxiliary files from /svn/SCRIPTS/CHECKSPMUTATE/all_launched_simult/TSs are used instead. Also, the file to be read in by duplicates.sh should be called pathsample_checkts.out rather than pathsample_checkmin.out.
readmin/readts
Rationale
Now that we've reoptimised all of the stationary points to our new mutated system/homolgue (bearing in mind that not all will have converged) as well as reoptimised any overwritten ones, we now need to create points.min, min.data, points.ts and ts.data files for our new system. The READMIN keywords can do this by reading in a list of coordinates for all of the reoptimised minima/TSs. Before doing that, we need to create such a file by concatenating all of the min.data.info.**** files into two large min.data.info.total files (one for minima, one for TSs - it is a quirk of the CHECKTS keyword that it also logs its optimised structures in min.data.info.**** files rather than ts.data.info.**** files). I like to keep my minima/TSs in the same order that their equivalents from the original, non-mutated pathway were in, and so concatenate these files in a specific way, whilst ensuring that those from the overwritten reoptimisations are also included.
Files Required
From checkmin/all_launched_simult, copy the following files to the readmin directory:
- checkminfile, coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, pathsample_checkmin.out, perm.allow
From checkmin, copy all of the min.data.info.**** files to the readmin directory.
Also, required:
A pathdata file of the form:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5464 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS READMIN min.data.info.total AMBER12
And the bash script, organise_mindatinfo_min.sh, to be found in /svn/SCRIPTS/CHECKSPMUTATE/readmin
Execution
First, the organise_mindatainfo_min.sh file is executed. This gives the min.data.info.total file
Then, execute a PATHSAMPLE binary to run READMIN. This shall give you min.data and points.min files for your new, mutated system.
Notes on readts
This is the same as for the readmin procedure above but with a few distinctions. First, carry out the calculations in the folder readts.
Rather than checkminfile and pathsample_checkmin.out, we need the files checktsfile and pathsample_checkts.out
Rather than organise_mindatainfo_min.sh, use the script organise_mindatainfo_ts.sh (to be found in /svn/SCRIPTS/CHECKSPMUTATE/readts).
In pathdata, stick with the READMIN (as there isn't actually a READTS keyword available). This is not a big problem - all that we need to do after running the READMIN calculation is to rename points.min to points.ts and min.data to ts.data.
all_opt_TSs
Rationale
Though we now have a database of transition states for our new, mutated system, we have no idea which minima these TSs are directly connected to. To find this out, we need to use the PATH keyword in OPTIM.
Files Required
The following files are copied over from the readts directory to the all_opt_TSs directory. Note: take special care that ts.data and points.ts are copied from readts and NOT from checkts as we are now concerned with the database for the new, mutated system, not the old one.
- coords.inpcrd, coords.mdcrd, coords.prmtop, min.in, perm.allow, points.ts, ts.data
We also require an odata file of the form:
CUDA A COMMENT DEBUG COMMENT CUDATIME COMMENT PERMDIST EDIFFTOL 8D-4 GEOMDIFFTOL 0.2 MAXMAX 2.0 DUMPALLPATHS COMMENT DUMPSTRUCTURES COMMENT MODE -1 DUMPDATA NOFRQS UPDATES 20 20 PATH 3 NOPOINTS BFGSTS 200 20 100 0.01 50 ENDHESS NOHESS BFGSSTEPS 10000 BFGSMIN 1D-3 MAXBFGS 1.0 AMBER12 extractedts
A pathdata file of the form:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5464 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS ! STEP 1: creating the initial database from A9OPTIM path.info file ! STARTFROMPATH path.info.initial 1 2 ! CYCLES 0 EXTRACTTS 1 AMBER12
And a sub_script_CUDAOPTIM file of the form:
#!/bin/bash # Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs #SBATCH --constraint=titanblack #SBATCH --job-name=m_wt_ChuS #SBATCH --gres=gpu:2 #SBATCH --mail-type=FAIL hostname echo "Time: `date`" source /etc/profile.d/modules.sh # Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat module add cuda/6.5 module add icc/64/2013_sp1/4/211 module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3 # Set the GPU to exclusive process mode sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3 # Run the executable in the local node scratch directory /home/$USER/bin/CUDAOPTIM_AMB12_gf_030718 > optim.out echo Finished at `date`
In addition, the following scripts, which can be accessed in /svn/SCRIPTS/CHECKSPMUTATE/all_opt_TSs, are required:
- addpath.sh, descents.sh
Execution
First, descents.sh is executed. This extracts all of the transition states in the database, moves them to a subdirectory with an index corresponding to where it appears in ts.data, and then performs the descent calculations to determine the two minima directly connected to this TS. Additionally, this script creates another subdirectory called startfrompath, in which the path.info file generated from the first TS and its two associated minima is used to create a new database (i.e. two minima should appear in min.data and one in ts.data).
Once all of the calculations are complete, move into the startfrompath directory and change pathdata to:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5464 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS ! STEP 1: creating the initial database from A9OPTIM path.info file ! STARTFROMPATH path.info.initial 1 2 ! CYCLES 0 ADDPATH path.info.initial AMBER12
Then, move back into the all_opt_TSs directory, and execute addpath.sh. Now, all of the TSs and their associated minima shall be added to the database in the startfrompath directory. Thus now, we don't just have a database of transition states, but of transition states and their associated minima (note that the database of minima obtained in the readmin directory have not yet been added. This will come later.
generate_connectfile
Rationale
This is simply to identify the gaps along the pathway which need to be filled in order to be able to fully connect the first and last transition states in our new database. If the above procedures have been followed correctly, then these two transition states should correspond very closely (if not exactly) to the first and last transition states of the original, non-mutated database.
Files Required
Copy ts.data from startfrompath into the sub-directory generate_connectfile
The following files are required, which can be found in /svn/SCRIPTS/CHECKSPMUTATE/generate_connectfile
- det_connections.sh, generate_connectfile_prep.py, generate_connectfile.py
Execution
Simply execute det_connections.sh. connectfile shall be generated.
merge_minima_into_database
Rationale
Before trying to connect all of the minima as listed in connectfile, it makes sense to include all the minima contained in the readmin database. This gives OPTIM a greater number of pre-existent minima which could be of relevance to draw upon when trying to fill these gaps in the pathway.
Files Required
From startfrompath, copy the following files to the merge_minima_into_database directory:
- coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data
Use the following pathdata file:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5464 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS MERGEDB ../../../readmin AMBER12
Execution
Run the PATHSAMPLE binary. The MERGEDB keywords merges the minima from the readmin directory into the present directory, thus expanding the number of minima in the latter.
now_use_connectfile_to_fill_gaps
Rationale
This is the business-end of this whole procedure. We are now going to run the CONNECTPAIRS algorithm within PATHSAMPLE to try to fill in the gaps in our new, mutated pathway.
Files Required
From merge_minima_into_database, the following files should be copied to now_use_connectfile_to_fill_gaps:
- coords.inpcrd, coords.mdcrd, coords.prmtop, min.A, min.B, min.data, min.in, perm.allow, points.min, points.ts, ts.data
Additionally, connectfile from generate_connectfile needs to be copied across to now_use_connectfile_to_fill_gaps.
Also required are:
odata.connect, of the form:
CUDA A COMMENT DEBUG COMMENT CUDATIME NEWCONNECT 100 3 0.5 100.0 600 1 0.01 NEWNEB 10 500 0.005 ADJUSTK 5 5.0 1.03D0 NEBK 10.0 DIJKSTRA EXP PERMDIST EDIFFTOL 8D-4 GEOMDIFFTOL 0.2 PUSHOFF 0.2 COMMENT PERMDIST BFGSTS 500 10 16 0.01 100 STEPS 1000 USEDIAG 2 DUMPALLPATHS DUMPDATA NOHESS ENDHESS ENDNUMHESS NOFRQS UPDATES 20 20 BFGSSTEPS 10000 BFGSMIN 1D-4 AMBER12 start
pathdata of the form:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 ! EXEC /home/adk44/bin/CUDAOPTIM_stop_large_gap CPUS 1 NATOMS 5464 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 SLURM COPYFILES coords.inpcrd coords.mdcrd coords.prmtop min.in COPYFILES perm.allow PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS CONNECTPAIRS connectfile CYCLES 244 AMBER12
The number of CYCLES depends on the number of connections we are attempting (i.e. the number of lines in connectfile).
Also required is sub_script_connections, of the form:
#!/bin/bash # Request 1 TITAN Black GPU - use '--constraint=teslak20' for a Tesla or '--constraint=maxwell' to request a Maxwell GPU for single precision runs #SBATCH --constraint=titanblack #SBATCH --job-name=wt_ChuS_ncf #SBATCH --ntasks=6 --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --mail-type=FAIL hostname echo "Time: `date`" source /etc/profile.d/modules.sh # Load the appropriate compiler modules on the node - should be the same as those used to compile the executable on pat module add cuda/6.5 module add icc/64/2013_sp1/4/211 module add anaconda/python2/2.2.0 # Needed for python networkx module - must be python 2, not 3 # Set the GPU to exclusive process mode sudo nvidia-smi -i $CUDA_VISIBLE_DEVICES -c 3 echo $SLURM_NTASKS > nodes.info srun hostname >> nodes.info echo $USER >> nodes.info pwd >> nodes.info # Run the executable in the local node scratch directory /home/$USER/bin/PATHSAMPLE_gf_CHECKSP_MUT > pathsample_connectfile.out echo Finished at `date`
Execution
Simply submit the submission script. This is a highly parallelizable calculation, and in the example above I have specified that 6 tasks can run at once, with one task per node. Therefore, six connection attempts can be run in parallel. This, of course, can be set to whatever you like.