We're updating the issue view to help you get more done. 

Worker node restarted suddenly can start with an incorrect job load

Steps to reproduce

Steps to reproduce:

1. Start worker node, process some jobs so the worker is busy
2. Kill the worker suddenly in an unclean shutdown (kill -9 the java process, or power-cycle the machine without a soft shutdown/reboot).
3. Start the worker node

Actual Results:

The worker node starts up with a local job load figure that is non-zero.

Example:

2019-03-28 16:14:56,556 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:935) - 174 Adding to load cache: Job 71435344, type org.opencastproject.execute, load 2.0, status RUNNING
2019-03-28 16:14:56,557 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:949) - 174 Current host load: 23.5, job load cache size: 1

The admin node keeps allocating jobs to this worker (because the admin node thinks it has a low job load), but the worker node keeps declining them, for example:

2019-04-03 12:20:03,660 | DEBUG | qtp1562501205-50361 | (AbstractJobProducer:193) - 50361 Declining job 71655516 of type org.opencastproject.composer with load 3 because load of 24.5 would exceed this node's limit of
24.

As a result, the worker node no longer processes any jobs for the cluster (or perhaps processes only small jobs, far under its capacity)

Expected Results:

Worker node should start with job load figure of 0.

Workaround (if any):

Restart the worker node (clean shutdown and restart).

Status

Assignee

Greg Logan

Reporter

Stephen Marquard

Severity

Incorrectly Functioning With Workaround

Tags (folksonomy)

None

Components

Fix versions

Affects versions

6.4

Priority

Major