Series index query performs bad on system with many series

Description

When loading the Admin UI Events tab, the filter data is also loaded. These takes some time (8+ sec) on systems with many series defined. Same bad performance happen requesting the event metadata.json.

On both screenshots you can see the request to the filter.json and the metadata.json need 8+ seconds for a response. The testsystem has 1000 series defined.
Internally the SeriesListProvider is called for a list of all series in the system. The call take the most of time.

I've added a bash script to attachments. It will create 1000 series for testing. Here you can find a script to create a large series range for testing in more comfortable way.

Activity

Show:
Stephen Marquard
July 11, 2017, 9:39 AM

The performance impact of this is so significant that I think this PR should be re-submitted against 3.x.

Even though it's not clear why the original code performed better in 3.x than 2.3, the proposed solution is clearly more efficient.

Stephen Marquard
July 11, 2017, 10:01 AM
Edited

The slowest operation seems to be this:

DublinCoreCatalog item = parseDublinCore((String) doc.get(SolrFields.XML_KEY));

which on our dev server is about 4ms per execution. parseDublinCore() is doing a conversion from String to InputStream which lower down is converted back to String, which seems wasteful.

Sven Stauber
July 11, 2017, 10:35 AM

The related PR has been declined a while ago, but I can confirm that Opencast 3.0 is almost unusable without this patch if an Adopter has a large number of series (500+) - that's the reason why had to fix it
We had spent some time to investigate what caused this issue in 3.x but we couldn't identify the commit that introduced it. Binary search through commits indicated that it seems to have been introduced very early in the development phase of Opencast 3.0 likely somewhere in Q3 2016 (Karaf 4 upgrade would be a candidate). At some point of time, we had to stop the forensics, however. Finally, the PR does solve the problem.

Feel free to bring this up at a technical meeting - at the time when the PR was declined, I've told people that Opencast 3.x will not work without that patch when having many series... I'm not keen on doing that again

In case the community wants this patch, we will, of course, re-submit it against 3.x.

Stephen Marquard
July 11, 2017, 11:09 AM
Edited

So to be fair, Waldemar declined the PR so it was withdrawn rather than rejected. I will bring it up at the technical meeting.

From what I can see from trace logging, almost all of the time is being taken by XML parsing. I've tried removing the String > InputStream > String conversion and that didn't make any difference.

DublinCoreXmlFormat.java:

private DublinCoreCatalog readImpl(InputSource in)
throws ParserConfigurationException, SAXException, IOException {
final SAXParserFactory factory = SAXParserFactory.newInstance();
// no DTD
factory.setValidating(false);
// namespaces!
factory.setNamespaceAware(true);
// read document ‘
factory.newSAXParser().parse(in, this);
return dc;
}

In the above code, these 3 lines take 2ms total:

final SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);

and this line is another 2ms:

factory.newSAXParser().parse(in, this);

Stephen Marquard
July 11, 2017, 12:23 PM

I created for the underlying issue.

Fixed and reviewed

Assignee

Waldemar Smirnow

Reporter

Waldemar Smirnow

Tags (folksonomy)

None

Components

Fix versions

Affects versions

Priority

Minor