Worker node restarted suddenly can start with an incorrect job load

Steps to reproduce

Steps to reproduce:

1. Start worker node, process some jobs so the worker is busy
2. Kill the worker suddenly in an unclean shutdown (kill -9 the java process, or power-cycle the machine without a soft shutdown/reboot).
3. Start the worker node

Actual Results:

The worker node starts up with a local job load figure that is non-zero.

Example:

2019-03-28 16:14:56,556 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:935) - 174 Adding to load cache: Job 71435344, type org.opencastproject.execute, load 2.0, status RUNNING
2019-03-28 16:14:56,557 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:949) - 174 Current host load: 23.5, job load cache size: 1

The admin node keeps allocating jobs to this worker (because the admin node thinks it has a low job load), but the worker node keeps declining them, for example:

2019-04-03 12:20:03,660 | DEBUG | qtp1562501205-50361 | (AbstractJobProducer:193) - 50361 Declining job 71655516 of type org.opencastproject.composer with load 3 because load of 24.5 would exceed this node's limit of
24.

As a result, the worker node no longer processes any jobs for the cluster (or perhaps processes only small jobs, far under its capacity)

Expected Results:

Worker node should start with job load figure of 0.

Workaround (if any):

Restart the worker node (clean shutdown and restart).

Activity

Show:
Greg Logan
April 11, 2019, 5:58 PM

It makes sense - from the SR's point of view, there's no signal that the job has died, so therefore the worker's load is (correctly) non-zero. We would need to constantly monitor the workers to ensure they're still working to have this inconsistency resolved, or wait until the SR realises the job(s) have been uncleanly killed and redispatches them.

The mitigation steps make sense, so I've merged the relevant PR.

Stephen Marquard
April 3, 2019, 12:01 PM
Stephen Marquard
April 3, 2019, 11:22 AM

When the ServiceRegistry is activated, the current system load comes from the database, but those jobs should have been cancelled already by RestServiceTracker.open() > addingService() > registerService() > cleanRunningJobs().

It's not clear then why getHostLoads(emf.createEntityManager()).get(hostName).getLoadFactor() returns a non-zero load.

Either way, the result of that should be a host load of 0, so it should be safe to just do this:

diff --git a/modules/serviceregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java b/modules/serviceregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
index b614a77..fce36df 100644
— a/modules/serviceregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
+++ b/modules/serviceregistry/src/main/java/org/opencastproject/serviceregistry/impl/ServiceRegistryJpaImpl.java
@@ -355,8 +355,8 @@ public class ServiceRegistryJpaImpl implements ServiceRegistry, ManagedService {
.getOrElse(DEFAULT_ACCEPT_JOB_LOADS_EXCEEDING);
}

  • localSystemLoad = getHostLoads(emf.createEntityManager()).get(hostName).getLoadFactor();

  • logger.info("Current system load: {}", format("%.1f", localSystemLoad));
    + localSystemLoad = 0;
    + logger.info("Activated");
    }

@Override

, any ideas here?

Fixed and reviewed
Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Greg Logan

Reporter

Stephen Marquard

Severity

Incorrectly Functioning With Workaround