NECI Re-write
Issues
- Code has been hacked together over time, with an emphasis on just getting it to work. Rarely have we gone back to tidy up.
- The code base has become increasingly harder to deal with, maintain and develop.
Alex, George and James have all agreed to spend time cleaning it up.
Please sign contributions!
Proposals
- Use the only specifier when declaring modules (for me, this is non-negotiable). --james 15:21, 6 December 2007 (GMT)
- Transfer include files to modules? How far do we want to go on this? --james 15:21, 6 December 2007 (GMT)
All the way! Lets get rid of those include files altogether. --alex 23:18, 19 February 2008 (GMT)
Tasks
- Rewrite the main structure (neci.F). This will be painful, but a good opportunity to document/excise old code and to modularise (I suppose I've got the knowledge for that, but less the inclination) --alex 16:36, 11 December 2007 (GMT)
I think this will be very useful - there is little commenting on the neci.F file, and it is difficult to see a logical thread through it. Many of the flags for the IF blocks are very cryptic, especially the BTEST ones, and if the variable has a different name to the readinput one, things get very tricky. --ghb24 17:51, 12 December 2007 (GMT)
In particular, the "new" stuff (CPMD, Dalton, star etc) make calls from NECI, whereas the Hubbard model etc do a lot of work in neci.F itself. This could be moved to a separate file...--james 18:55, 18 February 2008 (GMT)
- Modularize the system-specific data. I think it would be useful to have some sort of global 'system' object which contains details specifying the system (e.g. for Hubbard the size, t and U values etc. Other data for CPMD. For read-in integrals, this could just be symmetry info.) --alex 16:36, 11 December 2007 (GMT)
Would it be useful to have separate modules for each system type, or just one covering everything in the System section? --james 13:55, 12 December 2007 (GMT)
Has this been covered by George's work on the input?--james 18:55, 18 February 2008 (GMT)
Yes and no. I was hoping it would end up a bit more modular. I'm taking some drastic renaming options. e.g. Gone is the sysread module and out comes a System module which will contain all the relevant System input data, as well as the NECI-generated runtime data. --alex 23:17, 19 February 2008 (GMT)
- Modularize the basis. Again this will include lists of basis functions, energies and symmetries. Probably also the interfaces to access the 1- and 2-electron integrals. --alex 16:36, 11 December 2007 (GMT)
- Modularize the many-electron system. By this I'm thinking of how we deal with determinants. The initial code used just lists of electrons in the determinants. As number of electrons increases, this will mean large amounts of copying essentially redundant data (very few electrons actually change in a particular process). I've already hacked some of the code (the star I think) to deal with 'excitation-based' determinants - i.e. excitations with respect to the Hartree-Fock determinant. This sort of thing should be transparent to the actual users (i.e. subroutines) of the determinants. --alex 16:36, 11 December 2007 (GMT)
This would be a big bonus in how we deal with new code, and would almost certainly affect the speed, and possibly scaling of the program --ghb24 17:51, 12 December 2007 (GMT)
- Many have heard horror stories of the infamous excitation generators, but only Alex has confronted the beast - is there a way that these can be improved and made accessible to mere mortals, or are the problems with interfacing with CPMD too hard? - I'm not sure of the quality of the commenting in these, or the generality of their use either, but might be worth thinking about...--ghb24 17:51, 12 December 2007 (GMT)
Now I understand better the murkier details of Fortran memory management, perhaps the excitation generator storage could be handled better. I'm still not quite sure whether Fortran is flexible enough. --alex 23:17, 19 February 2008 (GMT)
- Combine sym.inc, sym.F and the symmetry work in UMatCache (and other places?) into a symmetry module.--james 18:55, 18 February 2008 (GMT)
- Go through each of the input modules and clearly comment what each variable does.--james 18:55, 18 February 2008 (GMT)
Completed
The following have been completed and merged back into the trunk:
- Input routine and associated data now in modules. Nice one George!
- testcode, a replacement for the old TESTQMC suite, is now in use. It can be used to perform regular, automated testing of the code in the repository. Currently this is done for the trunk every night on keiko. It can also been done on branches if anyone wishes.
- Local modifications to CPMD also stored in modules.
- Sets of defaults that can be called via a date.
- New memory book-keeping scheme in place. Currently the old routines just call the new ones with some of the options set to a default value.
Problems
- Documentation. Experience shows that documentation must be added at the time the code is written. If interfaces are not well-defined when writing code (i.e. the code is very experimental, and you're not quite sure what it can/will do) then keep it out of the repository. Once code is checked into the repository it should be documented and understandable (both in the interface and the general algorithm). References to papers in the code are good. If a concept is too difficult to describe in the code, then refer to a document kept with the code (otherwise it will be lost). --alex 16:59, 11 December 2007 (GMT)
- So, last night I was doing some development on ajax, and then realised the systems I wanted to test were on destiny, so I had to transfer the development across. To do this I checked the code in (after running the short purely NECI test suite), and then upped my destiny version. It
turns out that what I had coded was utter rubbish (which is why I needed the tests), but as it was new, it shouldn't affect anyone. Eventually I put an updated version in. This seems to be somewhat of an abuse of the repository, however, my other options were:
- scp the code across. This would lead to two copies of experimental code lying around, which I disliked the idea of.
- create a new branch, and use that to get it across. This seemed more complicated than I would want to consider to test some experimental ideas.
I suspect the solution is to have my own branch which I can use for experimental things and transfers, and then sync those to the trunk when they're ready. This seems like effort however, but it may be necessary --alex 17:28, 12 December 2007 (GMT)
Perhaps the solution is to have a dev version of the trunk, which is mirrors very closely the trunk (i.e. is never more than a day out of sync) which people can use for quick testing like this case, without worrying that it's breaking everyone's work? It would be quite easy to set this up. For more serious development work (which would take maybe a week or more to be ready to merge back into the trunk), we should be using a more "edgy" branch (which only those working on that specific area are using). How does that sound?--james 18:55, 18 February 2008 (GMT)
- Documentation 2. Everyone agreed to peer-review.
Two months on, how's this working out for everyone? Are we doing it ok, or do we need to do it better/change something? --james 18:55, 18 February 2008 (GMT)
- Programmers' Reference. There are automatic programmers' reference generation systems which take comments from the code and turn them into a TeX reference manual. doxygen is one such system. Does anyone know of it? --alex 16:59, 11 December 2007 (GMT)
Such a resource would be very helpful for new people joining the group/working on areas of the code not written by you. Doxygen cannot parse Fortran code, sadly. I assume there's a similar resource available for Fortran... --james 14:06, 12 December 2007 (GMT)
It might well be worth rewritting the INPUT_DOC into a .tex file as well as the code comments and expanding it, so that maths can easily be inserted into it, and references sorted easily. I can give that a go if people think its worthwhile. --ghb24 17:51, 12 December 2007 (GMT)
The beauty of generating a programmers' reference from the source is that it reduces duplicate effort: reading the source is easy because of the commenting, and the documentation is complete from the comments. It's a neat solution and one I'll investigate.--james 17:58, 12 December 2007 (GMT)
A further thought on this: a good practice would be to comment at the beginning of each routine what its purpose is (I know that this is done in many places, but certainly not all!).--james 18:01, 12 December 2007 (GMT)
- Scalability: we are (well, I am) looking at larger and larger systems. There are various places in the code (eg uhfdet), where a maximum system size is hard-coded in and causes seg faults etc. Allocatable arrays are the way forward...--james 16:18, 18 December 2007 (GMT)
Object-oriented approach?
- I have been thinking about objectifying code for some time. One problem is that FORTRAN objects (TYPEs) cannot contain variable length data. In particular a general Determinant type would need to have a field with the length of the number of electrons which is a variable. One solution (which I dislike) is to include a fortran variable sized pointer array. This means that every time you allocate memory for a determinant (which is done quite a lot) you also then allocate a separate portion of memory for the actual list of electrons. This is messy and inefficient. Another solution is to consider another language. C++ (with templates) can handle this sort of object with ease. Since object files and linking conventions have been around for a good 30 years, I think most of the problems linking different languages have been sorted out in this time. Could we venture into cross-language programming? My experience from Q-Chem is mixed. It is written in FORTRAN, C and C++, and this seems to cause quite a bit of hassle -- not with linking, but with misunderstanding of interfaces. Worse still, there are C++ wrappers on Fortran routines which seem to misunderstand the data-encapsulation nature of objects. This can lead to computational overhead if handled badly. I'm undecided. --alex 16:59, 11 December 2007 (GMT)
My (limited) experience is that mixed code is frequently messy, nasty and best avoided if possible. Plus, only Alex has any real experience with C or C++... Could we have a general determinant type based on a reference determinant, which is regarded as "special" (i.e. contains the full list of electrons)? --james 14:06, 12 December 2007 (GMT)
The idea is to allow things to be as general as possible without incurring computational overhead. So, for example, for the 2-vertex star one only needs double excitations from a reference determinant. i.e. one only needs to store a reference, and then the ij->ab excitation for each determinant. However, one can base such a star on multiple determinants (especially in the case where we have degenerate systems like metals), at which point you'll need to specify a different reference. Further, I want to develop the idea of an 'active space' of determinants from which one might excite. As much of the manipulation for this should go on 'under-the-hood', so that you can deal with a 'Determiant' without know quite how it is specified. This leads to nice generally applicable code, without having to rewrite for specific cases. --alex 17:28, 12 December 2007 (GMT)
I tend to agree with above - I guess though that as we get more adventurous, the solution which minimises memory requirements will be key - Being able to specify an active space would be good - a la CASSCF - would there be subtleties here when doing MC and normalising graphs when some excitations within a vertex level lie outside the active space? --ghb24 17:51, 12 December 2007 (GMT)