CAs develop high load average over time

Steps to reproduce

3 out of 15 of our capture agents were observed to develop high load average and CPU use when idle. This eventually caused a capture failure on one agent.

Symptoms: constant load average of 1 or greater when idle (not capturing) - highest observed load average was 6. CPU usage at constant 5% of higher.

htop showed threads which were mostly in "D" state (busy with I/O). Executing lsof showed that the matterhorn process was holding open /dev/video often several times, e.g.

root@lesliesocsci-2a:~# lsof | grep matterhorn | grep CHR
java 1859 matterhorn mem CHR 81,0 8641 /dev/video0
java 1859 matterhorn 0r CHR 1,3 0t0 4812 /dev/null
java 1859 matterhorn 1w CHR 1,3 0t0 4812 /dev/null
java 1859 matterhorn 90r CHR 1,8 0t0 4816 /dev/random
java 1859 matterhorn 91r CHR 1,9 0t0 4817 /dev/urandom
java 1859 matterhorn 105u CHR 81,0 0t0 8641 /dev/video0
java 1859 matterhorn 110r CHR 1,8 0t0 4816 /dev/random
java 1859 matterhorn 151u CHR 81,0 0t0 8641 /dev/video0
java 1859 matterhorn 179u CHR 81,0 0t0 8641 /dev/video0
java 1859 matterhorn 192u CHR 81,0 0t0 8641 /dev/video0
java 1859 matterhorn 195u CHR 81,0 0t0 8641 /dev/video0
java 1859 matterhorn 198u CHR 81,0 0t0 8641 /dev/video0

ps -eLf showed threads that had high total CPU use time.

strace on one of these threads showed repeated system calls to VIDIOCMCAPTURE and VIDIOCSYNC (around 3 or 4 per second), e.g.:

root@lesliesocsci-2a:~# strace -p 3892

Process 3892 attached - interrupt to quit
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE, 0x97e7e48) = 0
ioctl(105, VIDIOCSYNC, 0x4c8f7fac) = 480
ioctl(105, VIDIOCMCAPTURE^C <unfinished ...>
Process 3892 detached

However, after installing the "thread dumper" bundle from Sling:

http://repo1.maven.org/maven2/org/apache/sling/org.apache.sling.extensions.threaddump/0.2.2/org.apache.sling.extensions.threaddump-0.2.2.jar

into the Felix console and examining the list of threads running, the threads generating the load are NOT shown. This can be seen by comparing the total number of reported threads from the thread dump with the number of threads reported by ps -eLf. An unaffected CA shows a difference of 2 threads between the ps and thread dump figures, whereas for the affected CA above there was a difference of 8 threads, suggesting that the 6 problematic threads are somehow orphaned but still running and causing CPU load. jstack also did not show these threads, and was less complete than the sling thread dumper.

In the affected CAs, /dev/video0 is the Epiphan VGA2USB frame grabber.

We were not able to find out anything further about the cause of this issue.

Workaround (if any):

Stop and start the Matterhorn service.

Status

Assignee

Adam McKenzie

Reporter

Stephen Marquard

Severity

Incorrectly Functioning With Workaround

Tags (folksonomy)

None

Components

Affects versions

1.3

Priority

Major
Configure