Revamping the modules system
The large number of different compilers, maths libraries, and MPI suites on the clusters causes confusion. The modules system works quite well for switching between them (it's a lot better than nothing, anyway) but sometimes causes more confusion because of dependencies between modules. Can this be improved?
Default environment
One possible improvement would be to have all new user accounts load a module called 'basic' that loads a standard set of modules. The 'basic' module could then be updated centrally as needed. For example I currently have a problem in that one of the clusters sets users up with 'score' 'pgi64' and 'ifort_em64t' by default. Later on we got a new compiler, so we now have an 'icc' module too, I want everyone to load that because it makes sense- it will add to the functionality of the 'score' module. But unless I edit all the users' config files by hand there is no way to do this. If I have to do the editing, I might as well edit them to use 'basic'. Good idea or not?
This idea has now been implemented on the reinstalled destiny cluster, mek-quake, and the workstations. It works OK for people who know exactly what they're doing, but you have to be quite careful to purge the environment before loading one of these package modules (ie it's best only to do it in shell startup files), and they seem to cause a certain amount of confusion amongst those who aren't used to the whole switching environment thing. Jury is still out.
MPI confusion
The MPI modules currently enforce the correct choice of compilers by failing to load if you try to load the wrong MPI for your loaded compilers, or if you haven't got compilers loaded at all. Personally I find I almost always have the wrong compilers loaded for the MPI I want, and have to mess about for a few minutes to fix it. Should the MPI modules automatically load the right compiler modules themselves?
It seems like a good idea at first, but then you have the problem of dealing with conflicts- either I make the load fail with an error (and then it's almost the same as the current situation, but deals with the rare situation that no compilers were loaded) or silently swap the compilers. The latter seems to me to be asking for trouble. I could make the module be verbose about what it is doing, but that will probably mean learning more Tcl and could lead to problems when people load non-default modules in shell startup files.
The problem with MPI is particularly acute because there is no one MPI library I can load as standard that will work for most (or even many) users so most people using MPI are going to have to fight with the modules.
The current attempt at a fix for this is to make a family of modules called 'environment' which load a compatible set of things. For example, we have environment/32-bit/parallel/lam/pgi on mek-quake, plus many others. The idea is that most people will just load one version of the 'environment' module in their .bash_profile and never see the underlying mpi and compiler modules at all.
Having now tried this for a few months, it works rather well provided the user sets their shell up with the right package and forgets it. Attempts to change horses midstream lead to grief!
Libraries
Currently library modules are very inconsistent. MKL forces you to load an Intel compiler module (and usually forces it to be one for the same architecture as your MKL) and the Intel compilers contain local tweaks that point them at the right MKL libs and includes when the MKL module is loaded. However this is inconsistent because you can perfectly well use MKL with gcc and loading the MKL module does not help gcc compiles find MKL.
Other library modules just set LD_LIBRARY_PATH and hope that the person loading them knows to put in -L and -I for compiling. That's not much help to the user. Experience has shown that explaining this in the module help doesn't help because people don't expect to need to look there. Maybe those modules need to write out instructions on loading (and therefore must never ever be loaded by default or that breaks other things...)
Some people have suggested having library modules append to a LIBDIRS and INCLUDEDIRS variable that all the compilers use, and people could use in their Makefiles. This is very attractive. However, what if the library is one which only works with a certain compiler (eg fftw2, which must be recompiled for each Fortran compiler). I'd have to make that library module conflict with the other compiler modules, as opposed to simply requiring the right compiler. Then someone's going to have fftw2 loaded and forget about it, and suddenly find they can't load a new compiler. How likely is this to happen?
The other problem is that some of the libs work with most compilers, but some combinations I have no idea about and no time to test. What do I do with those? If I don't put dependencies on them, does that send a signal that they're expected to just work?
I am also concerned about ordering effects- by using LIBDIRS and INCLUDEDIRS I am forcing an ordering of -L and -I options onto people. Is that likely to go wrong? What do I do when someone needs fine control of their linker? Tell them to mangle the variables by hand, tell them to unload all the modules and do it all the old fashioned way, suggest they run ld by hand (total control but really bad for anyone compiling MPI code), tell them to load modules in exactly the right order? Which of those is least bad?
Finally, what do I do about gcc and g77 to make them use the vars? I can't wrap them because every other security patch would stamp on that. I can't afford to uninstall the system ones and maintain my own. Only putting a dummy somewhere in the path like /usr/local deals with it, and that has other problems: it can't be an NFS PATH because it has to be at the start and then you're vulnerable to the network going down. If it's in /usr/local I can't maintain it on the workstations without building more infrastructure. Does anyone here care about gcc/g77 anyway?
After musing on this for a few months I had another crack at it while working on the SuSE 10.1 image, and discovered the variables LIBRARY_PATH and CPATH. These do Exactly What I Want for gcc and friends: act as a list of directories to search for libs and includes respectively and don't require wrapping the compiler in any way at all (why is this stuff not better documented? I eventually found it in a locked filing cabinet stuck in a disused lavatory etc etc). icc and ifort also support these out of the box, and it was a trivial fix to make Portland do so (siterc file and some help from the Portland FAQ). Nag is just a wrapped gcc so it also supports them, and anyone who needs to have full control just unsets them and everything goes away. I haven't tacked Pathscale yet, but since I had to wrap it extensively to make it work at all, I already have the hooks even if it doesn't turn out to support them out of the box.
So basically all library modules adjust these two, and the difficult library modules (fftw and acml if I ever install it) will just have to have dependencies on the compiler modules.
Not that I have exposed anyone other than myself to this yet of course...
some months later
I implemented this (Pathscale and all) in the 10.2 image. No one's made any comment at all, so I guess at least it broke nothing. And personally I find it really useful.
Conflicts
It's a pity conflicts aren't automatically symmetric. A lot of the confusion in the past has arisen from my forgetting to make a pair conflict both ways. I really don't know what to do about that except test things more carefully.
Address space confusion
I have not been consistent with the 32/64 bit conflict. The compilers are handled by separate modules for 32 and 64 bit versions that conflict as needed, and the MKL libs in some places are done with one module and different versions for 32 and 64 bit versions. That's confusing for everyone and needs fixing. The fact that Portland ship 32-bit compilers capable of producing a 64-bit binary just complicates things even more.
Maybe I should just remove the 32-bit Portland from the 64-bit machines and force everyone to use explicit compiler switches to generate a 32-bit binary. But then there's the problem of stopping people from trying to link to a 64-bit library when they're compiling 32-bit code.
A few months later I tried a variant of this by installing the new 64-bit cluster tardis without 32-bit Portland modules. I was fairly soon asked to install the 32-bit module. That answers that question pretty conclusively I think.
Two module trees, 32/ and 64/, which conflict? I never managed to make conflicts in .modulercs work quite right though.
Is it just easier to make the address length extremely explicit in the version of each module, load a consistent standard set by default, and let people work it out for themselves?
Possibly the whole problem just needs pointing out clearly in the documentation for the machines where this is an issue.
Current approach on suse 10.1 image: give every module (except for those progs like Mathematica which genuinely work out which arch you're on and Do The Right Thing) a 32 and 64 subdirectory. Have 32 bit machines only see a module tree with 32 subdirectories populated. 64 bit machines see both but defaults are always 64. That way people have to work quite hard to mix things up.