Mek-quake Queueing system
Please write your thoughts about the mek-quake queueing system here. And please sign your posts! --Catherine 16:26, 21 June 2006 (BST)
My thoughts are that what works best on most of the other clusters to keep everyone satisfied is a fairly short maximum queue length, a limit on idle jobs, and priority that is strongly influenced by fairshare. This is what I have set up for the moment. We need to see how it works on this particular machine and then tweak it if people are unhappy --Catherine 16:41, 21 June 2006 (BST)
Comment posted by Catherine on behalf of David Wales:
We are definitely going to need a two week queue. I don't think we need more than 32 processors, though. I think the ideal scenario is to keep the number of users to 8 or fewer, so that everyone can use up to 32 processors and no jobs will need to be queued.
My worry about a two week queue that it is only workable for the system's users if, as you say, there is no queue of waiting jobs. I am not sure that will be true on this system, but we'll see when the usage settles down. It also has problems for systems administration. --Catherine 16:35, 21 June 2006 (BST)
I would like to have fewer nodes but for a long time (2 weeks). In this way I don't have to bother restarting and I don't have to wait in the queue for ever. What about we go down to 16 nodes queue ? In this way the system should be more optimized (less dead time waiting for 32 nodes to become free..) [User:gf247|Giorgio] 18:27 21 June 2006 (BST)
I don't know a lot about queuing systems, so I can only really talk about front-end problems. What has annoyed me about nimbus is that it is supposed to be for running parallel jobs, yet whenever I submit something (usually 2-10 but sometimes up to 20 nodes, which I have to run with 10 nodes, 2 jobs per node which is already not ideal) it can spend literally weeks in the queue while hundreds of serial jobs run. This leads me to use rama because I can get a job running sooner, however rama is not optimised for parallel work so although they start, they run very slowly and I no doubt annoy all the people wanting to run serial jobs there. Is there any way around this?
Jane