CHECKSPMUTATE
Purpose
It can take an awfully long time to create a large database or to fully optimise a specific feature of it such as afully connected pathway showing complex protein folding. This is particularly acute when considering large proteins/protein+ligand systems.
What if we are interested in examining how a Wild Type protein behaves with respect to some carefully selected mutants? Or in comparing one protein against a close homologue? It would seem like a colossal waste of time to create large databases for each of these similar cases, completely independently from each other. This is where CHECKSPMUTATE comes in.
CHECKSPMUTATE uses the CHECKSPODATA routine, which reoptimises the minima/transition states of a database. CHECKSPMUTATE extends this by allowing for user-selected sections of the coordinates of the stationary points comprising the database to be mutated before the reoptimisation takes place. Thus a database can be transformed so that it describes the behaviour of a mutated protein as opposed to the Wild Type. Though this new database will need to be tidied up through the use of, eg, SHORTCUT, SHORTCUT 2 BARRIER and UNTRAP, this process should be far quicker than starting a whole new database from scratch. In the example below, I shall show how a pathway describing the approach of the cofactor, NADH, towards another cofactor, haem, within the pocket of HemS (a pathway which took months to find and fully connect) could be quickly replicated in a system where the wt HemS has been replaced by a mutated form (or even by another protein entirely).
Preparation
Before running the reoptimisations, we need to prepare a series of auxiliary files. In all, we should have the following files in our directory:
- aa_ringdata.pyc list of parameters/definition of planes for residues with rings. Only required if we are mutating to a residue with a ring.
- amino_acids.pyc list of parameters for all residues.
- atomnumberlog list of indices of the first atom of the residues to be mutated.
- coordinates_mut.pyc script which mutates the selected residue.
- coords.inpcrd ensure this is for the system we are CHANGING TO.
- coords.mdcrd for use with min.in, may not be required.
- coords.prmtop ensure this is for the system we are CHANGING TO.
- min.A needs to be present, although the index listed in it is unimportant
- min.B needs to be present, although the index listed in it is unimportant
- min.data ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using DIJKSTRA.
- min.in for use with AMBER. Defines certain aspects of the model, such as solvent being used.
- mutate_aa.py organises the residues to be mutated, and does the mutations in conjunction with coordinates_mut.pyc
- newreslog list of codes for residues which we are CHANGING TO.
- nresidueslog list of the total number of mutations to be made to the system.
- odata.checksp list of conditions for optimisation carried out on each stationary point by OPTIM.
- oldreslog list of codes for residues which we are CHANGING FROM.
- original_protein.pdb pdb file for the system we are CHANGING FROM. All lines which do not dscribe an ATOM are (e.g. TITLE, TER and END) are removed, so that the number of lines of the file should correspond to the number of atoms in the system.
- pathdata organises OPTIM jobs. Certain keywords are required, described below.
- perm.allow full description of the groups of permutable atoms in the system we are CHANGING TO.
- points.min ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using DIJKSTRA.
- points.ts ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using DIJKSTRA.
- resnumberlog list of indices of the residues to be mutated.
- ts.data ensure this is for the system we are CHANGING FROM. Could be an entire database, or a section of it (such as a pathway) found using DIJKSTRA.
- submission_script script for executing your binary.
Example of Mutation
[Wild Type HemS + haem + NADH] to [F101A HemS + haem + NADH]
To use CHECKSPMUTATE, we first need a database of interest, or a subset of it. In my example, I had extracted the minima and transition states comprising the pathway I was interested in using DIJKSTRA and moved the new min.data, points.min, points.ts and ts.data files to a new directory. Therefore, each of the stationary points in my database described a stage along this pathway. I wanted to see how this pathway, describing the approach of NADH to haem within the wt HemS pocket, changed when certain mutations were made to the HemS structure. One such residue of interest was a phe-gate (which appeared to regulate the approach of NADH) and so a mutation from phenylalanine to alanine (F101A) was made. I made input files for the new mutated system using tleap, elsewhere described in Preparing an AMBER topology file for a protein plus ligand system and Symmetrising AMBER topology files. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.
Making auxiliary files
We need to prepare atomnumberlog, newreslog, nresidueslog, oldreslog and resnumberlog to define the mutation we are making, and where in the chain it takes place. In this example, we are only making one mutation - from phenylalanine to alanine at position 101.
- nresidueslog as we are only making one mutation, this is simply:
1
- resnumberlog this is the index of the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
101
- atomnumberlog this is the index of the first atom in the residue we are mutating. In cases (unlike this one) where the number of residues changes, this index is always with respect to the original system.
1569
- oldreslog this is the residue code for the residue we are mutating from. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
PHE
- newreslog this is the residue code for the residue we are mutating to. If only three letters (i.e. not terminal and so preceded by N--- or C---), ensure that a space is put after it so that the code is four characters overall.
ALA
pathdata and odata.checksp files
To invoke the reoptimisation process, either CHECKMIN (if reoptimising the minima of the database) or CHECKTS (if reoptimising the transition states) is required. The arguments should correspond to the number of minima or transition states present in the database.
When mutating, we also need to include CHECKSP_MUT, which invokes the process whereby our system is mutated before being reoptimised. We should also include NATOMS_NEW which is the number of atoms in the system to which we are mutating. Also, if our system includes cofactors (which are conventionally listed after the protein chain) then we should include NATOMS_CHAIN, the number of atoms in the protein chain.
An example pathdata file (for minima, as opposed to TSs) would therefore look something like this:
EXEC /home/adk44/bin/CUDAOPTIM_ppt_final_210918 CPUS 1 NATOMS 5501 NATOMS_CHAIN 5357 SEED 1 DIRECTION AB CONNECTIONS 1 TEMPERATURE 0.592 PLANCK 9.536D-14 PERMDIST ETOL 8D-4 GEOMDIFFTOL 0.2D0 ITOL 0.1D0 NOINVERSION NOFRQS CHECKMIN 1 1235 CYCLES 1235 CHECKSP_MUT NATOMS_NEW 5491 AMBER12
The odata.checksp files should be no different from what you originally used to find your original database. Make sure though that you use one suitable for finding minima with CHECKMIN, and one suitable for finding TSs with CHECKTS!
Example of Transformation to a Homologue
[Wild Type HemS + haem + NADH] to [Wild Type ChuS + haem + NADH]
This is a more complex problem than the single-point mutation described above. HemS and Chus are ~70% similar, and so ~30% of the residues of HemS need to be 'mutated' to transform the system to ChuS. In addition to that, ChuS is two residues shorter than HemS, with one such deletion occurring approximately midway through the chain, and the other at the end. As with the mutant example above, I made input files for the new system using tleap, elsewhere described in Preparing an AMBER topology file for a protein plus ligand system and Symmetrising AMBER topology files. This gave me coords.inpcrd, coords.prmtop and perm.allow files for my new system. These were moved to the same directory where I had the min.data, points.min, points.ts and ts.data files for the original system.
Making auxiliary file
First, we need to align and compare the two systems.
wt HemS + haem + NADH
NSERILE TYR GLU GLN TYR LEU GLN ALA LYS ALA ASP ASN PRO GLY LYS TYR ALA ARG ASP LEU ALA THR LEU MET GLY ILE SER GLU ALA GLU LEU THR HIE SER ARG VAL SER HIE ASP ALA LYS ARG LEU LYS GLY ASP ALA ARG ALA LEU LEU ALA ALA LEU GLU ALA VAL GLY GLU VAL LYS ALA ILE THR ARG ASN THR TYR ALA VAL HIE GLU GLN MET GLY ARG TYR GLU ASN GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ASN LEU ASP LEU ARG LEU PHE LEU ASN GLN TRP ALA SER ALA PHE THR LEU THR GLU GLU THR ARG HIE GLY VAL ARG HIE SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU HIE LYS VAL TYR VAL THR GLU GLN THR ASP MET PRO ALA TRP GLU ALA LEU LEU ALA GLN PHE ILE THR THR GLU ASN PRO GLU LEU GLN LEU GLU PRO LEU SER ALA PRO GLU VAL THR GLU PRO THR ALA THR ASP GLU ALA VAL ASP ALA GLU TRP ARG ALA MET THR ASP VAL HID GLU PHE PHE GLN LEU LEU LYS ARG ASN ASN LEU THR ARG GLN GLN ALA PHE ARG ALA VAL GLY ASN ASP LEU ALA TYR GLN VAL ASP ASN SER SER LEU THR GLN LEU LEU ASN ILE ALA GLN GLN GLU GLN ASN GLU ILE MET ILE PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY MET ILE GLU LYS VAL THR PRO HIE GLN ASP TRP ILE ASN VAL PHE ASN GLN ARG PHE THR LEU HIE LEU ILE GLU THR THR ILE ALA GLU SER TRP ILE THR ARG LYS PRO THR LYS ASP GLY PHE VAL THR SER LEU GLU LEU PHE ALA ALA ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU GLY GLN PRO GLU GLN THR GLN TRP ARG ASP GLN ILE ALA ARG LEU ASN ASN CLYS HEM NAD
wt ChuS + haem + NADH
NASNHIE TYR THR ARG TRP LEU GLU LEU LYS GLU GLN ASN PRO GLY LYS TYR ALA ARG ASP ILE ALA GLY LEU MET ASN ILE ARG GLU ALA GLU LEU ALA PHE ALA ARG VAL THR HIE ASP ALA TRP ARG MET HIE GLY ASP ILE ARG GLU ILE LEU ALA ALA LEU GLU SER VAL GLY GLU THR LYS CYS ILE CYS ARG ASN GLU TYR ALA VAL HIE GLU GLN VAL GLY THR PHE THR ASN GLN HIE LEU ASN GLY HIE ALA GLY LEU ILE LEU ASN PRO ARG ALA LEU ASP LEU ARG LEU PHE LEU ASN GLN TRP ALA SER VAL PHE HIE ILE LYS GLU ASN THR ALA ARG GLY GLU ARG GLN SER ILE GLN PHE PHE ASP HIE GLN GLY ASP ALA LEU LEU LYS VAL TYR ALA THR ASP ASN THR ASP MET ALA ALA TRP SER GLU LEU LEU ALA ARG PHE ILE THR ASP GLU ASN THR PRO LEU GLU LEU LYS ALA VAL ASP ALA PRO VAL VAL GLN THR XXX ARG ALA ASP ALA THR VAL VAL GLU GLN GLU TRP ARG ALA MET THR ASP VAL HID GLN PHE PHE THR LEU LEU LYS ARG HIE ASN LEU THR ARG GLN GLN ALA PHE ASN LEU VAL ALA ASP ASP LEU ALA CYS LYS VAL SER ASN SER ALA LEU ALA GLN ILE LEU GLU SER ALA GLN GLN ASP GLY ASN GLU ILE MET VAL PHE VAL GLY ASN ARG GLY CYS VAL GLN ILE PHE THR GLY VAL VAL GLU LYS VAL VAL PRO MET LYS GLY TRP LEU ASN ILE PHE ASN PRO THR PHE THR LEU HIE LEU LEU GLU GLU SER ILE ALA GLU ALA TRP VAL THR ARG LYS PRO THR SER ASP GLY TYR VAL THR SER LEU GLU LEU PHE ALA HIE ASP GLY THR GLN ILE ALA GLN LEU TYR GLY GLN ARG THR GLU GLY GLU GLN GLU GLN ALA GLN TRP ARG LYS GLN ILE ALA SER LEU ILE CPRO XXX HEM NAD
Here, red font signifies a residue which has been mutated, green represents a residue which has been deleted (given the code name XXX), and its position in the original protein, and purple represents the two cofactors.