Tardis scheduling policy

From CUC3
Jump to navigation Jump to search

Now is a good time to review tardis's scheduling policy as we have had the machine in service for a few months. Whatever we decide upon needs to be given a fair chance to work, so after this consultation I am not intending to change the policy again until the Long Vac at the earliest. It's therefore important to have your say now.

Current policy

Fairshare and priority

Every individual user has a fairshare target of 20% of the machine. If you go over that then you get penalties; under it and you get bonuses. There are also two QOS (quality of service) groups, which also have fairshare targets. These are based on how the machine was funded: the 'stuart' group (Stuart Althorpe's research group) get 52% and the 'portfolio' group (everyone else) get 48%.

Fairshare targets and usage can be seen by running 'diagnose -f'

The fairshare calculation takes the last six weeks of usage into account, decaying it at a rate of 0.8 per week.

Priority is currently: 20 * ( personal fairshare bonus/penalty + group fairshare bonus/penalty ) + job expansion factor .

Job expansion factor rises with time spent on the queue, but rises faster for short jobs. The reason for using that and not basic queue time is that it helps the very short (30 min) test jobs to run. It makes practically no difference when compared to the fairshare numbers, but ensures that every job eventually runs.

Priority calculations for all queued jobs can be seen by running 'diagnose -p'

Throttling policies

There is one throttling policy in use: any user may only have four jobs in the 'Idle' state at any given time. This avoids queue stuffing. However it does not help when one person has a very big fairshare bonus and submits a lot of jobs, because every job that gets to run is replaced in the queue immediately.


Reservation and Backfill

This policy isn't making much difference at the moment; I don't think it needs changing but I mention it for completeness. We make a reservation for the top queued job and then backfill other jobs around that (ie let them jump the queue if and only if they will not have any effect on the start time of the top queued job). This stops big parallel jobs being crowded out by small jobs, but only once they have got to the top of the queue. Without this the 32 proc jobs would almost never run.

The current problem

The Althorpe group currently have only user actively submitting jobs and little historical usage, so that user account gets a huge fairshare bonus and can crowd everyone else out of the machine. This is taking considerable time to correct because the fairshare memory is long and the deficit is large.

Things we could change and their likely effects

Maui is amazingly configurable; any policy you can come up with we can probably find a way to make Maui do. Here are a few possibilities:

  • Shorter or fewer fairshare windows, so machine has shorter memory. We could reduce the windows to 48 hours which would mean the machine would only remember just under a fortnight of usage.
  • Dilute group fairshare (ie give personal fairshare a bigger multiplier than 20)
  • Drop group fairshare and possibly give Stuart's group bigger personal fairshare instead. The problem with this is working out how much is fair, as it would vary with the number of users in each category.
  • Max processors per person limit. This would have to be quite high otherwise it reduces utilization.
  • Max outstanding processor-seconds per person limit. Works well on machines with very variable job lengths. But again how much?

--Catherine 16:01, 16 March 2007 (GMT)


My suggestion is that we could increase the priority penalty for each jobs that is already running. In other words if I have already 10 jobs running, my jobs in queue should get a lower priority. In this way if the machine is empty I can still run many jobs, which will not be possible if a "Max processors per person limit" or "Max outstanding processor-seconds per person limit" is introduced.


--Ivan 10:47, 17 March 2007 (GMT)

My suggestion would be to limit the maximum processors per person (set high) if there are jobs from other users queuing, but allow one user to use the whole machine if it's empty. I also think a shorter fairshare window is needed. Usage of tardis has fluctuated a lot recently so I think it should reflect this.

--Chris 14:20, 19 March 2007 (GMT)


I think Ivan's suggestion is an excellent solution. A max processors/max processor-seconds approach could stop someone from using the full resources available if tardis is empty, yet wouldn't penalise queued jobs submitted by people who are occupying a large amount of the cluster.

A 6 week memory might be on the long side: I go through periods of running just short test jobs (mainly on my workstation) and periods when I'm running large numbers of fairly demanding calculations. I get these through very quickly because of my large fairshare factor (which is good for me, but maybe less good for those who have to wait whilst I occupy a huge chunk of a cluster for several days).

--james 17:05, 19 March 2007 (GMT)

A MAXPS or MAXPROC limit causes excess jobs to be blocked, so it does penalise queued jobs submitted by people who are occupying large parts of the cluster already, by not allowing them to run at all. We are using those policies on destiny and mek-quake successfully at the moment.

Ivan's suggestion seems to me to be equivalent to fairshare with a very short memory. This would work but I think setting the memory too short would counterproductive.

Chris's suggestion of a variable MAXPROC limit is an interesting one; Maui can do something like this but I'm not sure of the details. I will look them up. The thing that approach suffers from is that you still get the latency of waiting for a slot after one person has filled up the entire machine, even if their queued jobs are then blocked.

--Catherine 09:03, 20 March 2007 (GMT)

Yes, I think that 6 weeks' memory is probably a bit too long: I would suggest we try 3 weeks and see how this works out.

Also, another suggestion: it would be useful if the script could be configured to send out a weekly or monthly automated summary of usage. This would help identify any problems that needed rectifying in the longer term.

--stuart 11:03, 20 March 2007 (GMT)