Race condition leads to FileAlreadyExistsException and FileNotFoundException

Steps to reproduce

Under high load, workflow operations sometimes fail with either a
FileAlreadyExistsException or a FileNotFoundException, typically for episode.xml.

Investigation (including adding some additional detailed logging) shows this to arise from a race condition between the workflow operation and index receiver threads.

Symptoms of the same underlying issue:

Symptom 1 - paths don't match warning:

2017-05-24 08:56:46,300 | WARN | pool-89-thread-1 | (WorkspaceImpl:277) - The working file repository and workspace paths don't match. Looking up http://media.uct.ac.za/files/mediapackage/f4c8df18-3fb2-40aa-84a1-c494854c1f28/catalog-0/episode.xml at /data/opencast/work/shared/files/mediapackage/f4c8df18-3fb2-40aa-84a1-c494854c1f28/catalog-0/episode.xml failed

Symptom 2 - file not found:

2017-05-24 08:56:46,393 | DEBUG | pool-63-thread-422 | (ServiceRegistryJpaImpl:872) - Updating Job {id:58706694, operation:START_OPERATION, status:FINISHED}
2017-05-24 08:56:46,456 | ERROR | pool-89-thread-1 | (DublinCoreUtil:115) - Unable to load metadata from catalog 'http://media.uct.ac.za/files/mediapackage/f4c8df18-3fb2-40aa-84a1-c494854c1f28/catalog-0/episode.xml': java.io.FileNotFoundException: /data/opencast/work/shared/workspace/mediapackage/f4c8df18-3fb2-40aa-84a1-c494854c1f28/catalog-0/episode.xml (No such file or directory)
2017-05-24 08:56:46,457 | ERROR | pool-89-thread-1 | (BaseMessageReceiverImpl$MessageWatcher:148) - Problem while getting org.opencastproject.index.service.message.WorkflowMessageReceiverImpl message events java.io.FileNotFoundException: /data/opencast/work/shared/workspace/mediapackage/f4c8df18-3fb2-40aa-84a1-c494854c1f28/catalog-0/episode.xml (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.opencastproject.metadata.dublincore.DublinCoreUtil.loadDublinCore(DublinCoreUtil.java:112)
at org.opencastproject.metadata.dublincore.DublinCoreUtil$1.apply(DublinCoreUtil.java:98)
at org.opencastproject.metadata.dublincore.DublinCoreUtil$1.apply(DublinCoreUtil.java:95)
at org.opencastproject.util.data.Option$1.fmap(Option.java:208)
at org.opencastproject.util.data.Option.map(Option.java:57)
at org.opencastproject.metadata.dublincore.DublinCoreUtil.loadDublinCore(DublinCoreUtil.java:95)
at org.opencastproject.metadata.dublincore.DublinCoreUtil.loadEpisodeDublinCore(DublinCoreUtil.java:76)
at org.opencastproject.index.service.message.WorkflowMessageReceiverImpl.execute(WorkflowMessageReceiverImpl.java:83)

Symptom 3 - file already exists:

2017-05-30 11:56:27,212 | ERROR | pool-63-thread-159 | (SeriesWorkflowOperationHandler:236) - Unable to update episode catalog isPartOf field: java.nio.file.FileAlreadyExistsException: /data/opencast/work/shared/workspace/mediapackage/912313e3-4ef6-4d82-95c6-19bee09c5e0b/catalog-0/episode.xml
java.nio.file.FileAlreadyExistsException: /data/opencast/work/shared/workspace/mediapackage/912313e3-4ef6-4d82-95c6-19bee09c5e0b/catalog-0/episode.xml

The problem appears to be in WorkspaceImpl:

public File get(final URI uri) throws NotFoundException, IOException

which will copy (hardlink) a file from the WFR to the workspace when needed.

This method implements insufficient locking (too late) and so is vulnerable to concurrency issues when it is called at nearly the same time for the same URI by both a workflow operation thread and an index receiver thread.

Assignee

Stephen Marquard

Reporter

Stephen Marquard

Severity

Incorrectly Functioning Without Workaround

Tags (folksonomy)

None

Components

Fix versions

Affects versions

Priority

Major
Configure