Fail jobs when a max job lifetime is exceeded
We had some jobs that did not complete for some reason. On the admin node they showed as running, but the worker node to which they were assigned was up but not doing anything (as shown by CPU use / load average)
It should be possible to specify a maximum job lifetime (e.g. 24 hours) after which a running job is failed and dispatched again.
Lately we've seen jobs that are DISPATCHED but don't run on the target worker node. Restarting the worker node resolves the issue.
Stephen, our site is also interested in this feature.
Background: Our site has been manually resetting such jobs by extracting the workflow XML, changing the workflow status from RUNNING to PAUSED, changing the current operation from RUNNING (or what ever stuck state) to INSTANTIATED, then re-posting the workflow, then resuming the workflow. We have been Ok doing it manually because it gives us time to investigate the bug. But, it's error prone to manually manipulate the XML text and not scalable to full time production.