Revamping the modules system
The large number of different compilers, maths libraries, and MPI suites on the clusters causes confusion. The modules system works quite well for switching between them (it's a lot better than nothing, anyway) but sometimes causes more confusion because of dependencies between modules. Can this be improved?
Default environment
One possible improvement would be to have all new user accounts load a module called 'basic' that loads a standard set of modules. The 'basic' module could then be updated centrally as needed. For example I currently have a problem in that one of the clusters sets users up with 'score' 'pgi64' and 'ifort_em64t' by default. Later on we got a new compiler, so we now have an 'icc' module too, I want everyone to load that because it makes sense- it will add to the functionality of the 'score' module. But unless I edit all the users' config files by hand there is no way to do this. If I have to do the editing, I might as well edit them to use 'basic'. Good idea or not?
This idea has now been implemented on the reinstalled destiny cluster, mek-quake, and the workstations. It works OK but you have to be quite careful to purge the environment before loading it.
MPI confusion
The MPI modules currently enforce the correct choice of compilers by failing to load if you try to load the wrong MPI for your loaded compilers, or if you haven't got compilers loaded at all. Personally I find I almost always have the wrong compilers loaded for the MPI I want, and have to mess about for a few minutes to fix it. Should the MPI modules automatically load the right compiler modules themselves?
It seems like a good idea at first, but then you have the problem of dealing with conflicts- either I make the load fail with an error (and then it's almost the same as the current situation, but deals with the rare situation that no compilers were loaded) or silently swap the compilers. The latter seems to me to be asking for trouble. I could make the module be verbose about what it is doing, but that will probably mean learning more Tcl and could lead to problems when people load non-default modules in shell startup files.
The problem with MPI is particularly acute because there is no one MPI library I can load as standard that will work for most (or even many) users so most people using MPI are going to have to fight with the modules.
The current attempt at a fix for this is to make a family of modules called 'environment' which load a compatible set of things. For example, we have environment/32-bit/parallel/lam/pgi on mek-quake, plus many others. The idea is that most people will just load one version of the 'environment' module in their .bash_profile and never see the underlying mpi and compiler modules at all.
Libraries
Currently library modules are very inconsistent. MKL forces you to load an Intel compiler module (and usually forces it to be one for the same architecture as your MKL) and the Intel compilers contain local tweaks that point them at the right MKL libs and includes when the MKL module is loaded. However this is inconsistent because you can perfectly well use MKL with gcc and loading the MKL module does not help gcc compiles find MKL.
Other library modules just set LD_LIBRARY_PATH and hope that the person loading them knows to put in -L and -I for compiling. That's not much help to the user. Experience has shown that explaining this in the module help doesn't help because people don't expect to need to look there. Maybe those modules need to write out instructions on loading (and therefore must never ever be loaded by default or that breaks other things...)
Some people have suggested having library modules append to a LIBDIRS and INCLUDEDIRS variable that all the compilers use, and people could use in their Makefiles. This is very attractive. However, what if the library is one which only works with a certain compiler (eg fftw2, which must be recompiled for each Fortran compiler). I'd have to make that library module conflict with the other compiler modules, as opposed to simply requiring the right compiler. Then someone's going to have fftw2 loaded and forget about it, and suddenly find they can't load a new compiler. How likely is this to happen?
The other problem is that some of the libs work with most compilers, but some combinations I have no idea about and no time to test. What do I do with those? If I don't put dependencies on them, does that send a signal that they're expected to just work?
I am also concerned about ordering effects- by using LIBDIRS and INCLUDEDIRS I am forcing an ordering of -L and -I options onto people. Is that likely to go wrong? What do I do when someone needs fine control of their linker? Tell them to mangle the variables by hand, tell them to unload all the modules and do it all the old fashioned way, suggest they run ld by hand (total control but really bad for anyone compiling MPI code), tell them to load modules in exactly the right order? Which of those is least bad?
Finally, what do I do about gcc and g77 to make them use the vars? I can't wrap them because every other security patch would stamp on that. I can't afford to uninstall the system ones and maintain my own. Only putting a dummy somewhere in the path like /usr/local deals with it, and that has other problems: it can't be an NFS PATH because it has to be at the start and then you're vulnerable to the network going down. If it's in /usr/local I can't maintain it on the workstations without building more infrastructure. Does anyone here care about gcc/g77 anyway?
Conflicts
It's a pity conflicts aren't automatically symmetric. A lot of the confusion in the past has arisen from my forgetting to make a pair conflict both ways. I really don't know what to do about that except test things more carefully.
Address space confusion
I have not been consistent with the 32/64 bit conflict. The compilers are handled by separate modules for 32 and 64 bit versions that conflict as needed, and the MKL libs in some places are done with one module and different versions for 32 and 64 bit versions. That's confusing for everyone and needs fixing. The fact that Portland ship 32-bit compilers capable of producing a 64-bit binary just complicates things even more.
Maybe I should just remove the 32-bit Portland from the 64-bit machines and force everyone to use explicit compiler switches to generate a 32-bit binary. But then there's the problem of stopping people from trying to link to a 64-bit library when they're compiling 32-bit code.
Two module trees, 32/ and 64/, which conflict? I never managed to make conflicts in .modulercs work quite right though.
Is it just easier to make the address length extremely explicit in the version of each module, load a consistent standard set by default, and let people work it out for themselves?
Possibly the whole problem just needs pointing out clearly in the documentation for the machines where this is an issue.