I want to deploy nodes to cloud autoscaling groups to control my resource usage. Autoscaling starts and stops instances according to metrics or scheduling.
Stopping instances is problematic as they may still be running jobs. We need therefore some sort of mechanism for pausing termination until the job is complete.
A Lifecycle Hook for termination will put the instance into a terminating wait state, which will expire after a heartbeat time out (max 2hrs). The instance can emit a heartbeat to reset the timeout if jobs are still running. Once jobs are finished the instance can signal that termination can complete.
There are a couple of ways that this can be accomplished:
1. The instance ping AWS to get it's lifecycle state looking for terminate:wait
2. A CloudWatch alarm can trigger a Lambda function on the terminate:wait event. The Lambda should then PUT the Opencast node into terminating state.
The Opencast node if told it needs to terminate, should tell put itself into maintenance to stop receiving jobs from admin. Then periodically check whether its jobs are complete, if not emit a heartbeat, else signal that terminating the instance can continue.