Uploaded image for project: 'Opencast'
  1. MH-13482

Worker node restarted suddenly can start with an incorrect job load

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed and reviewed
    • Affects versions: 6.4
    • Fix versions: 7.0
    • Components: Backend Software
    • Labels:
      None
    • Severity:
      Incorrectly Functioning With Workaround
    • Steps to reproduce:
      Hide
      Steps to reproduce:

      1. Start worker node, process some jobs so the worker is busy
      2. Kill the worker suddenly in an unclean shutdown (kill -9 the java process, or power-cycle the machine without a soft shutdown/reboot).
      3. Start the worker node
       
       Actual Results:
       
      The worker node starts up with a local job load figure that is non-zero.

      Example:
       
      2019-03-28 16:14:56,556 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:935) - 174 Adding to load cache: Job 71435344, type org.opencastproject.execute, load 2.0, status RUNNING
      2019-03-28 16:14:56,557 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:949) - 174 Current host load: 23.5, job load cache size: 1

      The admin node keeps allocating jobs to this worker (because the admin node thinks it has a low job load), but the worker node keeps declining them, for example:

      2019-04-03 12:20:03,660 | DEBUG | qtp1562501205-50361 | (AbstractJobProducer:193) - 50361 Declining job 71655516 of type org.opencastproject.composer with load 3 because load of 24.5 would exceed this node's limit of
       24.

      As a result, the worker node no longer processes any jobs for the cluster (or perhaps processes only small jobs, far under its capacity)

       Expected Results:
       
       Worker node should start with job load figure of 0.

       Workaround (if any):
       
      Restart the worker node (clean shutdown and restart).
      Show
      Steps to reproduce: 1. Start worker node, process some jobs so the worker is busy 2. Kill the worker suddenly in an unclean shutdown (kill -9 the java process, or power-cycle the machine without a soft shutdown/reboot). 3. Start the worker node    Actual Results:   The worker node starts up with a local job load figure that is non-zero. Example:   2019-03-28 16:14:56,556 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:935) - 174 Adding to load cache: Job 71435344, type org.opencastproject.execute, load 2.0, status RUNNING 2019-03-28 16:14:56,557 | DEBUG | qtp1562501205-174 | (ServiceRegistryJpaImpl:949) - 174 Current host load: 23.5, job load cache size: 1 The admin node keeps allocating jobs to this worker (because the admin node thinks it has a low job load), but the worker node keeps declining them, for example: 2019-04-03 12:20:03,660 | DEBUG | qtp1562501205-50361 | (AbstractJobProducer:193) - 50361 Declining job 71655516 of type org.opencastproject.composer with load 3 because load of 24.5 would exceed this node's limit of  24. As a result, the worker node no longer processes any jobs for the cluster (or perhaps processes only small jobs, far under its capacity)  Expected Results:    Worker node should start with job load figure of 0.  Workaround (if any):   Restart the worker node (clean shutdown and restart).

      TestRail: Results

        Attachments

          Activity

            People

            • Assignee:
              greg_logan Greg Logan
              Reporter:
              smarquard Stephen Marquard
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                TestRail: Cases