Job Dispatching is unfair

Steps to reproduce

Steps to reproduce:
1. Have a cluster with a hetergenous distribution of cores across nodes (some with 24 cores, some with 8)
2. Process things at scale
3. Note that the lower powered workers are frequently capped out, where the higher powered ones have lots of room

This is cause by the dispatcher blindly assuming that the sum of the current jobs is the best heuristic to dispatch by - sort by current load, then dispatch to the lowest. Instead we should be calculating the ratio of current utilization (current/max load) at dispatch time, and dispatching to the lowest current ratio.

Activity

Show:
Greg Logan
May 3, 2019, 9:30 PM
Edited

From the pull request: https://github.com/opencast/opencast/pull/856:

This pull request changes the logic around job dispatch from lowest current load first, to lowest load factor first. For instance, in a situation with three nodes, with maximum loads of 1, 2, and 4 the dispatching order looks like this:

Node Max load
Node 1 1
Node 2 2
Node 3 4
Current number of jobs running Node 1 Load Node 1 Factor Node 2 Load Node 2 Factor Node 3 Load Node 3 Factor Notes
0 0 0.0 0 0.0 0 0.0
1 1 1.0 0 0.0 0 0.0 Already suboptimal, Node 1 is maxed out, should have gone to Node 3
2 1 1.0 1 0.5 0 0.0 This job should have gone to Node 3 as well
3 1 1.0 1 0.5 1 0.25 Finally!
4 1 1.0 2 1.0 1 0.25 This is the opposite of what we want, Node 3 is basically idle!
5 1 1.0 2 1.0 2 0.5 From here on, the jobs go to Node 3, but only because everyone else is already maxed out!
6 1 1.0 2 1.0 3 0.75
6 1 1.0 2 1.0 4 1.0
The new dispatch looks like this

Current number of jobs running Node 1 Load Node 1 Factor Node 2 Load Node 2 Factor Node 3 Load Node 3 Factor Notes
0 0 0.0 0 0.0 0 0.0
1 0 0.0 0 0.0 1 0.25 In the case of a tie, dispatch to the node with the largest maximum load
2 0 0.0 1 0.5 1 0.25 Dispatching to lowest load factor, breaking the tie by max load
3 1 1.0 1 0.5 1 0.25
4 1 1.0 1 0.5 2 0.5 Largest node is being hit hardest
5 1 1.0 1 0.5 3 0.75 Same tie breaking rules
6 1 1.0 2 1.0 3 0.75
7 1 1.0 2 1.0 4 1.0
Note that the smaller worker only becomes saturated after 6 jobs are running, vs 4 with the original dispatch logic. This becomes even more obvious as the maximum load scales - the larger a worker is relative to the rest of the cluster, the more jobs it will attract before the system will start dispatching to weaker workers.

Long term it would be nice to scale things such that especially weak workers are used only as a last resort, however this approach seems good enough for now, and clusters with completely crippled workers present are likely to be pretty rare.

Fixed and reviewed

Assignee

Greg Logan

Reporter

Stephen Marquard

Severity

Incorrectly Functioning Without Workaround