Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed and reviewed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.1
    • Component/s: Backend Software
    • Labels:
      None
    • Severity:
      Crash/Hang
    • Steps to reproduce:
      Hide
      Within a short time using Opencast jobs got stuck twice in a row.

      SYMPTOMS
      - No exceptions in log file
      - Service tab in admin UI showed the jobs as being queued

      I restarted the system but got a bunch of optimistic lock exceptions (which is a known issue). Because of this I am not sure if the database is still in a healthy state.

      ----
      Breakdown of issue:

      The maximum system load is configured to be *2* (1 node, based on available processor cores)

      By looking into the {{mh_job}} database table, the following job structure could be determined:

      SQL query: {{SELECT * FROM mh_job WHERE mh_job.status IN (0, 2, 7) AND processor_service IS NOT NULL}}

      2063 START_WORKFLOW (state:running, load:0)
        4235 START_OPERATION (state:running, load:0)
          4241 Publish (state:runnig, load:1)

      2058 START_WORKFLOW (state:running, load:0)
        4244 START_OPERATION (state:running, load:0)
          4264 Publish (state:running, load:1)
            4273 Distribute (state:queued, load:1)

      By looking at this, it gets obvious that the system can no longer execute any jobs. The current load (3) is already above the allowed maximum node load (2) and therefore, no other jobs get dispatched anymore.

      Relieving the system from the deadlock was as easy as setting the load of the affected jobs to 0. Processing continued immediately after the manual change in the database.
      Show
      Within a short time using Opencast jobs got stuck twice in a row. SYMPTOMS - No exceptions in log file - Service tab in admin UI showed the jobs as being queued I restarted the system but got a bunch of optimistic lock exceptions (which is a known issue). Because of this I am not sure if the database is still in a healthy state. ---- Breakdown of issue: The maximum system load is configured to be *2* (1 node, based on available processor cores) By looking into the {{mh_job}} database table, the following job structure could be determined: SQL query: {{SELECT * FROM mh_job WHERE mh_job.status IN (0, 2, 7) AND processor_service IS NOT NULL}} 2063 START_WORKFLOW (state:running, load:0)   4235 START_OPERATION (state:running, load:0)     4241 Publish (state:runnig, load:1) 2058 START_WORKFLOW (state:running, load:0)   4244 START_OPERATION (state:running, load:0)     4264 Publish (state:running, load:1)       4273 Distribute (state:queued, load:1) By looking at this, it gets obvious that the system can no longer execute any jobs. The current load (3) is already above the allowed maximum node load (2) and therefore, no other jobs get dispatched anymore. Relieving the system from the deadlock was as easy as setting the load of the affected jobs to 0. Processing continued immediately after the manual change in the database.
    • Tags (folksonomy):

      TestRail: Results

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                lrohner Lukas Rohner
                Reporter:
                lrohner Lukas Rohner
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  TestRail: Cases