Matterhorn Capture Agent JVM Crashes - libavcodec.so.52

Steps to reproduce

We've had a few capture agent jvm crashes with logs pointing the finger to libavcodec.so.52.

Activity

Show:
Adam McKenzie
May 16, 2012, 7:24 PM

I have successfully captured 16 x 1.5 hour captures with 1 minute apart with the buffer size reduced on arts 143. I am going to re-enable zipping and ingesting to make sure that it is solved. It should be noted that you shouldn't schedule captures 1 minute apart because sometimes it takes longer than a minute to close a capture and write the appropriate metadata causing the next capture to fail to start in 5 seconds.

Christopher Brooks
May 16, 2012, 7:30 PM

Have you verified the length of files?

Can you run a set of 3hr captures as well? The longer the capture the more likely a buffer overrun is going to happen.

Adam McKenzie
May 17, 2012, 6:28 AM

Arts 143ca crashed with:

  1. SIGSEGV (0xb) at pc=0x00007f1e2afde9ec, pid=1466, tid=139767424026368
    #

  2. JRE version: 6.0_20-b20

  3. Java VM: OpenJDK 64-Bit Server VM (19.0-b09 mixed mode linux-amd64 compressed oops)

  4. Derivative: IcedTea6 1.9.13

  5. Distribution: Ubuntu 10.10, package 6b20-1.9.13-0ubuntu1~10.10.1

  6. Problematic frame:

  7. C [libnet.so+0x39ec]
    #

  8. If you would like to submit a bug report, please include

  9. instructions how to reproduce the bug and visit:

  10. https://bugs.launchpad.net/ubuntu/+source/openjdk-6/

  11. The crash happened outside the Java Virtual Machine in native code.

  12. See problematic frame for where to report the bug.
    #

--------------- T H R E A D ---------------

Current thread (0x00007f1e242a7800): JavaThread "6d437cc0-8c82-43ee-9523-c08001c98cf4_Worker-5" [_thread_in_native, id=1534, stack(0x00007f1e23990000,0x00007f1e23a91000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (), si_addr=0x0000000000000000

This one was caused in the libnet library instead of a gstreamer library. I wonder if we are chasing the wrong goose. More analysis will have to be made about the quartz threads and the other threads running on the capture agent.

Adam McKenzie
May 22, 2012, 5:56 PM

This comment will hopefully contain all of the necessary steps to replicate the changes I made to configurations, code and logging to try and diagnose this problem. If you are running the tests in rare cases the crashes can occur not during a capture, instead while the agent is idle. The crashes seem to have a higher chance of occurring if the captures are close together (1.5 hour captures, 1 minute between). For a typical test I will schedule 4 captures @ 1.5 hours long with 1 minute between them. It usually fails on the 2nd or 3rd capture.

Test 1: Sun jdk vs. Openjdk
The capture agents I was using were configured for Sun java so I installed openjdk using the package manager and ran:
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config jar
To switch the jvm version to openjdk. I got the same results from Sun java as I did with openjdk.
If you need to install sun java:
1. Get Java:
wget http://download.oracle.com/otn-pub/java/jdk/7/jdk-7-linux-x64.tar.gz

2. Uncompress it:
tar -xvf jdk-7u1-linux-x64.tar.gz

3. Move to a new location:
sudo mkdir /usr/lib/jvm
sudo mv jdk1.7.0_01/ /usr/lib/jvm/java-7-oracle/

4. Install the binaries:
sudo update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/java-7-oracle/bin/javac 1
sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/java-7-oracle/bin/java 1
sudo update-alternatives --install /usr/bin/jar jar /usr/lib/jvm/java-7-oracle/bin/jar 1

5. Set the correct version of java:
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config jar

Just change the download location and paths to move it to if you want to use 1.6 instead of 1.7.

Test 2: Epiphan_VGA2USB vs. V4LSRC
In the configuration file found at /opt/matterhorn/felix/conf/services/org.opencastproject.capture.impl.ConfigurationManager.properties change the line:
capture.device.Epiphan_VGA2USB.type=EPIPHAN_VGA2USB
to:
capture.device.Epiphan_VGA2USB.type=V4LSRC

Test 3: mpeg2 vs. x264
In the configuration file found at /opt/matterhorn/felix/conf/services/org.opencastproject.capture.impl.ConfigurationManager.properties add the lines (there will be no equivalent lines):
capture.device.camera.codec=x264enc
capture.device.camera.container=mp4mux
capture.device.camera.bitrate=2048

capture.device.vga.codec=x264enc
capture.device.vga.container=mp4mux
capture.device.vga.bitrate=2048

Test 4: Increasing GST_DEBUG levels
I was hoping that increasing the GST levels would give hints as to what gstreamer elements might be causing the issue. I used GST_DEBUG levels of 2 & 3 (4 & 5 too noisy for the duration it takes to get it to reproduce, 1GB/10 minutes and slows down the process to the point where it won't capture). There didn't seem to be any connection with gstreamer activity (the gstreamer log activity stopped after the start capture was completed). To increase the gst debug level run:
export GST_DEBUG=3
Then execute felix manually using the shell script at /opt/matterhorn/felix/bin/start_matterhorn.sh so that you can see the logs or else you will need to change the redirects of the init script to dump the data for you.

Test 5: Setting the LD_PRELOAD to include
Added:
LD_PRELOAD=/usr/lib/jvm/java-6-openjdk/jre/lib/amd64/libjsig.so
To the /etc/init.d/matterhorn script with the rest of the environment variables.

Test 6: Setting Gstreamer buffers from 512MB to 64MB
Change the lines:
capture.device.camera.buffer.bytes=536870912
capture.device.vga.buffer.bytes=536870912
capture.device.audio.buffer.bytes=536870912
To:
capture.device.camera.buffer.bytes=67108864
capture.device.vga.buffer.bytes=67108864
capture.device.audio.buffer.bytes=67108864

Test 7: Disabled zipping and ingestion of new captures (old captures might still be zipped and attempted to ingest)
Apply the following patch file to your 1.3 source files (can be found in /opt/matterhorn/capture-agent/matterhorn-source/)

Index: modules/matterhorn-capture-agent-impl/src/main/java/org/opencastproject/capture/impl/jobs/StopCaptureJob.java
===================================================================
— modules/matterhorn-capture-agent-impl/src/main/java/org/opencastproject/capture/impl/jobs/StopCaptureJob.java (revision 12179)
+++ modules/matterhorn-capture-agent-impl/src/main/java/org/opencastproject/capture/impl/jobs/StopCaptureJob.java (working copy)
@@ -79,7 +79,7 @@
trigger.getJobDataMap().put(JobParameters.SCHEDULER, sched);

// Schedule the serializeJob

  • sched.scheduleJob(job, trigger);
    + // sched.scheduleJob(job, trigger);

logger.info("stopCaptureJob complete");

@@ -94,9 +94,6 @@
e.printStackTrace();
}

  • } catch (SchedulerException e) {

  • logger.error("Couldn't schedule task: {}", e);

  • e.printStackTrace();
    } catch (Exception e) {
    logger.error("Unexpected exception: {}", e);
    e.printStackTrace();

Test 8: Attached gdb to a jvm that crashed
sudo apt-get install
gstreamer0.10-plugins-bad-multiverse-dbg
gstreamer0.10-plugins-ugly-dbg
gstreamer0.10-plugins-base-dbg
gstreamer0.10-ffmpeg-dbg
gstreamer0.10-plugins-bad-dbg
gstreamer0.10-plugins-good-dbg

Follow the instructions at:
https://wiki.ubuntu.com/Backtrace
To attach gdb to the java program.

Adam McKenzie
June 14, 2012, 10:59 PM

Fixed by turning frame dropping back on in the epiphan pipeline. Details are in MH-8851.

Assignee

Adam McKenzie

Reporter

Jonathan Felder

Severity

Crash/Hang

Tags (folksonomy)

None

Components

Affects versions

Priority

Major
Configure