Engage Server Locks-up Due to Deadlock in Jetty when run for extended periods

Steps to reproduce:
1. Run a distributed system - one admin, two workers and one engage for multiple days
2. Occasionally, the engage server will lock up.

Actual Results:
every few days the engage server locks up and becomes unresponsive

Expected Results:
engage server runs without locking up

Workaround (if any):
restart the server


Stephen Marquard
April 20, 2016, 7:04 PM

The updated org.apache.felix.http.bundle-2.2.0-opencast20160413.jar has been running in our UCT 1.6.x production system for about a week, so far without problems, and without a repeat of the deadlock / lockup.

Stephen Marquard
April 14, 2016, 8:39 AM

Here is the process for building a replacement jar, including the fix for this issue and MH-9610:

svn co http://svn.apache.org/repos/asf/felix/releases/org.apache.felix.http-2.2.0/
cd org.apache.felix.http-2.2.0
patch -p1 < configurable-http-timeout.patch
patch -p0 < jetty-version.diff
mvn versions:set -DnewVersion=2.2.0-opencast20160413
mvn clean install

Resulting jar is in ./bundle/target/org.apache.felix.http.bundle-2.2.0-opencast20160413.jar

Stephen Marquard
April 13, 2016, 7:48 PM

Testing the replacement jar org.apache.felix.http.bundle-2.2.0-opencast20160413.jar. This replaces org.apache.felix.http.bundle-2.2.0-opencast.jar in /opt/matterhorn/lib/ext/. Clear felix cache, and update the reference in etc/system.properties (under felix.auto.start.1).

This version includes the fix for and upgrades the jetty version from 6.1.24 to 6.1.26 for the deadlock fix.

Stephen Marquard
April 13, 2016, 4:00 PM

Thread discussing the resolution to this is here:


Stephen Marquard
June 15, 2015, 2:11 PM

We seem to have had a similar issue on our 1.6.x admin/engage node (happened twice in about 6 months).

Oddly it only affected http connections which hung, https connections to the same server were fine (both proxied through apache httpd, though restarting httpd did not help).

Attaching thread dump.

