Difference between revisions of "Tardis scheduling policy"

From CUC3
Jump to navigation Jump to search
import>Cen1001
import>Ghb24
 
(17 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
We appear to need to take another look at tardis's scheduling policy. Greg and Catherine would like everyone to agree on a policy and then give it a reasonable trial before starting to tweak it again, so whatever we come up with we'd like to stick with until January 2009 at the earliest.
Now is a good time to review tardis's scheduling policy as we have had the machine in service for a few months.
 
   
 
== Current policy ==
 
== Current policy ==
   
=== Fairshare and priority ===
+
=== Priority ===
   
  +
This determines what order Idle jobs are on the queue. Type 'showq' to see the sorted queue.
Every individual user has a fairshare target of 20% of the machine. If you go over that then you get penalties; under it and you get bonuses. There are also two QOS (quality of service) groups, which also have fairshare targets. These are based on how the machine was funded: the 'stuart' group (Stuart Althorpe's research group) get 52% and the 'portfolio' group (everyone else) get 48%.
 
   
 
Priority is currently: 20 * ( personal fairshare result + group fairshare result ) + job expansion factor .
Fairshare targets and usage can be seen by running 'diagnose -f'
 
   
 
Every individual user has a fairshare target of 20% of the machine. If you go over that usage then your fairshare result is negative; under it and it's positive. Everyone is also in one of four groups of users which have group fairshare targets. These are based on how the machine was funded: the 'stuart' group (Stuart Althorpe's research group) gets 38%, the 'michiel' group (Michiel Sprik) gets 20% (this includes their share of the Portfolio funding), the 'jochen' group (Jochen Blumberger) gets 14% and the 'portfolio' group (everyone else) gets 28%. Your group and your personal result are added, so other people in your group can make your fairshare negative by running lots of jobs.
The fairshare calculation takes the last six weeks of usage into account, decaying it at a rate of 0.8 per week.
 
   
  +
Fairshare targets and usage can be seen by running 'diagnose -f'. The research group targets as labelled as QOS targets rather than group targets- this is a technical consequence of the way we set the system up. 'group' to Maui means 'Unix group' not 'research group'.
Priority is currently: 20 * ( personal fairshare bonus/penalty + group fairshare bonus/penalty ) + job expansion factor .
 
  +
  +
The fairshare calculation works on 48-hour slots (the longest queue is 48 hours) and uses the last ten. Older slots count for less than recent ones.
   
 
Job expansion factor rises with time spent on the queue, but rises faster for short jobs. The reason for using that and not basic queue time is that it helps the very short (30 min) test jobs to run. It makes practically no difference when compared to the fairshare numbers, but ensures that every job eventually runs.
 
Job expansion factor rises with time spent on the queue, but rises faster for short jobs. The reason for using that and not basic queue time is that it helps the very short (30 min) test jobs to run. It makes practically no difference when compared to the fairshare numbers, but ensures that every job eventually runs.
Line 20: Line 22:
   
 
There is one throttling policy in use: any user may only have four jobs in the 'Idle' state at any given time. This avoids queue stuffing. However it does not help when one person has a very big fairshare bonus and submits a lot of jobs, because every job that gets to run is replaced in the queue immediately.
 
There is one throttling policy in use: any user may only have four jobs in the 'Idle' state at any given time. This avoids queue stuffing. However it does not help when one person has a very big fairshare bonus and submits a lot of jobs, because every job that gets to run is replaced in the queue immediately.
 
   
 
=== Reservation and Backfill ===
 
=== Reservation and Backfill ===
   
This policy isn't making much difference at the moment; I don't think it needs changing but I mention it for completeness. We make a reservation for the top queued job and then backfill other jobs around that (ie let them jump the queue if and only if they will not have any effect on the start time of the top queued job). This stops big parallel jobs being crowded out by small jobs, but only once they have got to the top of the queue. Without this the 32 proc jobs would almost never run.
+
The system makes a reservation for the top queued job and then backfills other jobs around that (ie lets them jump the queue if and only if they will not have any effect on the start time of the top queued job). This stops big parallel jobs being crowded out by small jobs, but only once they have got to the top of the queue. Without this the 64 proc jobs would almost never run.
   
=== The current problem ===
 
   
  +
=== Node groups ===
The Althorpe group currently have only user actively submitting jobs and little historical usage, so that user account gets a huge fairshare bonus and can crowd everyone else out of the machine. This is taking considerable time to correct because the fairshare memory is long and the deficit is large.
 
  +
  +
We force multinode jobs to run on a set of nodes that are all attached to the same Infiniband switch. This can cause very large jobs (64 and 32-way) to take a long time to start because the system has to clear most of a switch for them. Other jobs may appear to jump the queue in this situation but if you check then those jobs will be being started on switches other than the one the system has picked for the top job. The system picks the optimum switch for the top job, ie the one where the job can start soonest if all the other jobs in the system run to their maximum walltime. The calculation is redone each time the system state changes.
  +
 
== The current problem ==
  +
  +
32 and 64-way jobs are taking a very long time to start even with the above system.
  +
  +
I think the problem is this: one of these jobs reaches the top of the queue and the system books it a reservation and starts clearing space for it, but then a small, higher-priority job comes along and the top job loses its reservation. The small job takes some of the space. The system can't start the large job so it backfills the rest of the queue into what space is left. Once the small job starts the large job is top but has to start clearing space to run in all over again...and then another small job comes along.
  +
  +
This is made worse by the tendency of groups which run small jobs to be under their fairshare target and the large job groups to be over.
   
 
=== Things we could change and their likely effects ===
 
=== Things we could change and their likely effects ===
   
  +
* Increase the reservation depth to 2 or 3, causing the system to make reservations for the top 2 or 3 queued jobs, not just the top one. This will reduce overall utilization but on average help larger jobs because they won't keep losing their reservations. Small jobs tend to run fast enough that reservations are not an issue for them.
Maui is amazingly configurable; any policy you can come up with we can probably find a way to make Maui do. Here are a few possibilities:
 
  +
  +
* Give a straight priority boost to large jobs to try to keep them at the top of the queue. More complex to configure. Would need to decide how much is enough and define a 'large job'. Dilutes fairshare.
  +
  +
* Increase the multiplier we give to 'time spent waiting' in the priority calculation so that if a job waits a very long time it can't be overtaken by jobs with better fairshare. This is simple and easy to understand. It will dilute the effect of fairshare to some extent. (added 4pmish 10/09/08, based on a suggestion from Nick Maclaren)
  +
  +
Please add any comments below. I'd like to get any changes agreed by 24th September.
  +
 
--[[User:cen1001|Catherine]] 14:04, 10 September 2008 (BST)
  +
  +
  +
Increasing the reservation depth to 3 sounds the easiest thing to do first. We can see how that works. If this reduces the overall usage too much we can reduce it to 2. --[[User:cja49|Chris]] 12:10, 11 September 2008
   
  +
After some discussion with the people who put in the money for the machine we've agreed to alter the fairshare groups to reflect current reality rather the past. The portfolio group is to be abolished. A share of the machine worth 10K (or about 4%) will go into an 'other' group. The rest will be split between Ali and Michiel's groups. This leaves us with:
* Shorter or fewer fairshare windows, so machine has shorter memory. We could reduce the windows to 48 hours which would mean the machine would only remember just under a fortnight of usage.
 
* Dilute group fairshare (ie give personal fairshare a bigger multiplier than 20)
 
* Drop group fairshare and possibly give Stuart's group bigger personal fairshare instead. The problem with this is working out how much is fair, as it would vary with the number of users in each category.
 
* Max processors per person limit. This would have to be quite high otherwise it reduces utilization.
 
* Max outstanding processor-seconds per person limit. Works well on machines with very variable job lengths. But again how much?
 
   
  +
{| border=1
--[[User:cen1001|Catherine]] 16:01, 16 March 2007 (GMT)
 
  +
|-
  +
|Group||Fairshare %
  +
|-
  +
|Michiel||30
  +
|-
  +
|Ali||17
  +
|-
  +
|Jochen||14
  +
|-
  +
|Stuart||35
  +
|-
  +
|Other||4
  +
|}
   
  +
We will also increase the reservation depth to 3. I will record the current usage and compare it with the usage in three months to see how it changes.
   
  +
--[[User:cen1001|Catherine]] 16:50, 11 September 2008 (BST)
My suggestion is that we could increase the priority penalty for each jobs that is already running. In other words
 
if I have already 10 jobs running, my jobs in queue should get a lower priority. In this way if the machine is empty
 
I can still run many jobs, which will not be possible if a "Max processors per person limit" or "Max outstanding processor-seconds per person limit" is introduced.
 
   
   
  +
I like the look of the above suggestion - lumping everyone together under the portfolio group was a little harsh on people who wanted to use it occasionally, but were blocked out by others in the group.
--[[User:ic247|Ivan]] 10:47, 17 March 2007 (GMT)
 
  +
--[[User:ghb24|ghb24]] 18:02, 11 September 2008 (BST)

Latest revision as of 17:02, 11 September 2008

We appear to need to take another look at tardis's scheduling policy. Greg and Catherine would like everyone to agree on a policy and then give it a reasonable trial before starting to tweak it again, so whatever we come up with we'd like to stick with until January 2009 at the earliest.

Current policy

Priority

This determines what order Idle jobs are on the queue. Type 'showq' to see the sorted queue.

Priority is currently: 20 * ( personal fairshare result + group fairshare result ) + job expansion factor .

Every individual user has a fairshare target of 20% of the machine. If you go over that usage then your fairshare result is negative; under it and it's positive. Everyone is also in one of four groups of users which have group fairshare targets. These are based on how the machine was funded: the 'stuart' group (Stuart Althorpe's research group) gets 38%, the 'michiel' group (Michiel Sprik) gets 20% (this includes their share of the Portfolio funding), the 'jochen' group (Jochen Blumberger) gets 14% and the 'portfolio' group (everyone else) gets 28%. Your group and your personal result are added, so other people in your group can make your fairshare negative by running lots of jobs.

Fairshare targets and usage can be seen by running 'diagnose -f'. The research group targets as labelled as QOS targets rather than group targets- this is a technical consequence of the way we set the system up. 'group' to Maui means 'Unix group' not 'research group'.

The fairshare calculation works on 48-hour slots (the longest queue is 48 hours) and uses the last ten. Older slots count for less than recent ones.

Job expansion factor rises with time spent on the queue, but rises faster for short jobs. The reason for using that and not basic queue time is that it helps the very short (30 min) test jobs to run. It makes practically no difference when compared to the fairshare numbers, but ensures that every job eventually runs.

Priority calculations for all queued jobs can be seen by running 'diagnose -p'

Throttling policies

There is one throttling policy in use: any user may only have four jobs in the 'Idle' state at any given time. This avoids queue stuffing. However it does not help when one person has a very big fairshare bonus and submits a lot of jobs, because every job that gets to run is replaced in the queue immediately.

Reservation and Backfill

The system makes a reservation for the top queued job and then backfills other jobs around that (ie lets them jump the queue if and only if they will not have any effect on the start time of the top queued job). This stops big parallel jobs being crowded out by small jobs, but only once they have got to the top of the queue. Without this the 64 proc jobs would almost never run.


Node groups

We force multinode jobs to run on a set of nodes that are all attached to the same Infiniband switch. This can cause very large jobs (64 and 32-way) to take a long time to start because the system has to clear most of a switch for them. Other jobs may appear to jump the queue in this situation but if you check then those jobs will be being started on switches other than the one the system has picked for the top job. The system picks the optimum switch for the top job, ie the one where the job can start soonest if all the other jobs in the system run to their maximum walltime. The calculation is redone each time the system state changes.

The current problem

32 and 64-way jobs are taking a very long time to start even with the above system.

I think the problem is this: one of these jobs reaches the top of the queue and the system books it a reservation and starts clearing space for it, but then a small, higher-priority job comes along and the top job loses its reservation. The small job takes some of the space. The system can't start the large job so it backfills the rest of the queue into what space is left. Once the small job starts the large job is top but has to start clearing space to run in all over again...and then another small job comes along.

This is made worse by the tendency of groups which run small jobs to be under their fairshare target and the large job groups to be over.

Things we could change and their likely effects

  • Increase the reservation depth to 2 or 3, causing the system to make reservations for the top 2 or 3 queued jobs, not just the top one. This will reduce overall utilization but on average help larger jobs because they won't keep losing their reservations. Small jobs tend to run fast enough that reservations are not an issue for them.
  • Give a straight priority boost to large jobs to try to keep them at the top of the queue. More complex to configure. Would need to decide how much is enough and define a 'large job'. Dilutes fairshare.
  • Increase the multiplier we give to 'time spent waiting' in the priority calculation so that if a job waits a very long time it can't be overtaken by jobs with better fairshare. This is simple and easy to understand. It will dilute the effect of fairshare to some extent. (added 4pmish 10/09/08, based on a suggestion from Nick Maclaren)

Please add any comments below. I'd like to get any changes agreed by 24th September.

--Catherine 14:04, 10 September 2008 (BST)


Increasing the reservation depth to 3 sounds the easiest thing to do first. We can see how that works. If this reduces the overall usage too much we can reduce it to 2. --Chris 12:10, 11 September 2008

After some discussion with the people who put in the money for the machine we've agreed to alter the fairshare groups to reflect current reality rather the past. The portfolio group is to be abolished. A share of the machine worth 10K (or about 4%) will go into an 'other' group. The rest will be split between Ali and Michiel's groups. This leaves us with:

Group Fairshare %
Michiel 30
Ali 17
Jochen 14
Stuart 35
Other 4

We will also increase the reservation depth to 3. I will record the current usage and compare it with the usage in three months to see how it changes.

--Catherine 16:50, 11 September 2008 (BST)


I like the look of the above suggestion - lumping everyone together under the portfolio group was a little harsh on people who wanted to use it occasionally, but were blocked out by others in the group. --ghb24 18:02, 11 September 2008 (BST)