Jobs do not always proceed
Steps to reproduce
Within a short time using Opencast jobs got stuck twice in a row.
No exceptions in log file
Service tab in admin UI showed the jobs as being queued
I restarted the system but got a bunch of optimistic lock exceptions (which is a known issue). Because of this I am not sure if the database is still in a healthy state.
Breakdown of issue:
The maximum system load is configured to be 2 (1 node, based on available processor cores)
By looking into the mh_job database table, the following job structure could be determined:
SQL query: SELECT * FROM mh_job WHERE mh_job.status IN (0, 2, 7) AND processor_service IS NOT NULL
2063 START_WORKFLOW (state:running, load:0)
4235 START_OPERATION (state:running, load:0)
4241 Publish (state:runnig, load:1)
2058 START_WORKFLOW (state:running, load:0)
4244 START_OPERATION (state:running, load:0)
4264 Publish (state:running, load:1)
4273 Distribute (state:queued, load:1)
By looking at this, it gets obvious that the system can no longer execute any jobs. The current load (3) is already above the allowed maximum node load (2) and therefore, no other jobs get dispatched anymore.
Relieving the system from the deadlock was as easy as setting the load of the affected jobs to 0. Processing continued immediately after the manual change in the database.
I created a diagram of how job dispatching works and extended it with with the solutions to fix this ticket and MH-11655.