Fail jobs when a max job lifetime is exceeded

Description

We had some jobs that did not complete for some reason. On the admin node they showed as running, but the worker node to which they were assigned was up but not doing anything (as shown by CPU use / load average)

It should be possible to specify a maximum job lifetime (e.g. 24 hours) after which a running job is failed and dispatched again.

Activity

Show:
Stephen Marquard
March 31, 2017, 12:40 PM

Lately we've seen jobs that are DISPATCHED but don't run on the target worker node. Restarting the worker node resolves the issue.

Former user
April 13, 2015, 2:03 PM

Stephen, our site is also interested in this feature.

Background: Our site has been manually resetting such jobs by extracting the workflow XML, changing the workflow status from RUNNING to PAUSED, changing the current operation from RUNNING (or what ever stuck state) to INSTANTIATED, then re-posting the workflow, then resuming the workflow. We have been Ok doing it manually because it gives us time to investigate the bug. But, it's error prone to manually manipulate the XML text and not scalable to full time production.

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Unassigned

Reporter

Stephen Marquard