Mek-quake Queueing system
Please write your thoughts about the mek-quake queueing system here. And please sign your posts! --Catherine 16:26, 21 June 2006 (BST)
My thoughts are that what works best on most of the other clusters to keep everyone satisfied is a fairly short maximum queue length, a limit on idle jobs, and priority that is strongly influenced by fairshare. This is what I have set up for the moment. We need to see how it works on this particular machine and then tweak it if people are unhappy --Catherine 16:41, 21 June 2006 (BST)
Comment posted by Catherine on behalf of David Wales:
We are definitely going to need a two week queue. I don't think we need more than 32 processors, though. I think the ideal scenario is to keep the number of users to 8 or fewer, so that everyone can use up to 32 processors and no jobs will need to be queued.
My worry about a two week queue that it is only workable for the system's users if, as you say, there is no queue of waiting jobs. I am not sure that will be true on this system, but we'll see when the usage settles down. It also has problems for systems administration. --Catherine 16:35, 21 June 2006 (BST)
I would like to have fewer nodes but for a long time (2 weeks). In this way I don't have to bother restarting and I don't have to wait in the queue for ever. What about we go down to 16 nodes queue ? In this way the system should be more optimized (less dead time waiting for 32 nodes to become free..) [User:gf247|Giorgio] 18:27 21 June 2006 (BST)
The big trouble with longer queues from the user point of view is that they increase the latency of the system. The fairshare might have put you at the top of the queue, but if the whole cluster is tied up running two week jobs that have already started, that's still two weeks before your queued job can start to run. It also makes life difficult for me because I have to plan system work longer in advance. I would really like to avoid two weeks queues if at all possible. A one week queue is already much longer than you have on nimbus: can we try it like this and see what happens first? --Catherine 09:57, 22 June 2006 (BST)
David has tried and still needs a two week queue, so I have put some on the system now: they are called hXX (for huge). In order to try to reduce the latency problem I've set a maximum outstanding processor time limit of 128 processor-weeks. This can be reduced if we have problems. --Catherine 17:44, 24 August 2006 (BST)
David proposes a cpu limit per user.
My feeling is that 128 would be best but I'm prepared to try 64, and David wants 64. I also think MAXPS should be reduced to 64 cpu-weeks to avoid people submitting many 1 or 4 cpu jobs to the h queues and blocking the system that way. --Catherine 08:43, 1 November 2006 (GMT)
Finally it was agreed on a cpu limit of 96, and reducing MAXPS to 64 cpu weeks --Catherine 14:35, 3 November 2006 (GMT)
I have moved a question about rama and nimbus from here to a new page for general queueing questions --Catherine 09:57, 22 June 2006 (BST)