Fail jobs when a max job lifetime is exceeded

Description

We had some jobs that did not complete for some reason. On the admin node they showed as running, but the worker node to which they were assigned was up but not doing anything (as shown by CPU use / load average)

It should be possible to specify a maximum job lifetime (e.g. 24 hours) after which a running job is failed and dispatched again.

Activity

Show:
Former user
April 13, 2015, 2:03 PM

Stephen, our site is also interested in this feature.

Background: Our site has been manually resetting such jobs by extracting the workflow XML, changing the workflow status from RUNNING to PAUSED, changing the current operation from RUNNING (or what ever stuck state) to INSTANTIATED, then re-posting the workflow, then resuming the workflow. We have been Ok doing it manually because it gives us time to investigate the bug. But, it's error prone to manually manipulate the XML text and not scalable to full time production.

Stephen Marquard
March 31, 2017, 12:40 PM

Lately we've seen jobs that are DISPATCHED but don't run on the target worker node. Restarting the worker node resolves the issue.

Assignee

Unassigned

Reporter

Stephen Marquard

Criticality

None

Tags (folksonomy)

None

Components

Affects versions

Configure