Steps to reproduce:
1. After the video-editor UI has been used and the workflow has been continued in some cases - mostly on a distributed system - we see that the workflow fails. We have not found a reliable way to reproduce this issue. So we seem to have a race condition here somewhere.
2. Within the logs (that we also enhanced with additional status information) we see that the problem happens in the ServiceRegistry.updateInternal method, where an OptimisticLockException happens. That means while persisting a job to the database we see that the jobs status can not be updated as the "version" number of the job has increased while the process that needs the update of the job was still doing something.
3. A more detailed analysis of what is happening showed that the job that needs the updated is always the workflow job itself when it is resumed. It already spans the new PROCESS_SMIL job but at a certain position it is set twice to START_WORKFLOW with the status RUNNING. If the time difference between the two START_WORKFLOWS is getting too low they seem to fail.
While debugging this I noticed that WorkflowServiceImpl.update() which calls the ServiceRegistry.updateJob() which calls ServiceRegistry.updateInternal() uses the MultiResourceLock.synchronize() to ensure that only one thread can start update() on a specific workflow at a time.
I currently cannot tell if MultiResourceLock.synchronize() is working at all. I doubt it an a way, as WorkflowRestService.resume() (aka replaceAndresume REST endpoint) is using MultiResourceLock.synchronize() to call WorkflowServiceImpl.update() that also uses MultiResourceLock.synchronize(). From my point of view this should be a deadlock if MultiResourceLock.synchronize() would really work.