DublinCore Catalog XML parsing is slow

Description

This appears to be a regression from 2.x.

The XML parse operation in DublinCoreXmlFormat.readImpl(InputSource in) appears to take about 4ms to run, of which 2ms is:

and another 2ms for:

This is problematic when a lot of XML catalogs are parsed for the same request, as for (parsing XML catalogs for a large number of series).

It seems this was much faster in 2.x, although it's not clear why.

Possibly strategies to improve this include re-using the SAXParserFactory rather than creating a new one for each call. Some discussion here:

https://www.ibm.com/developerworks/library/x-perfap2/index.html

Best is to avoid storing commonly-used attributes only inside XML blobs.

Activity

Show:
Karen Dolan
January 7, 2019, 1:47 AM
Karen Dolan
January 5, 2019, 11:07 PM

Assiging this ticket to me while I investigate. Hoping it will be something as simple as preventing exports from other poms.

Former user
August 14, 2017, 2:11 PM
Edited

[EDIT] Verified that SAXParserFactory and SAXParser are NOT calling out when being initialized. So this is not the slow down. https://xerces.apache.org/xerces2-j/features.html
https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet#Java

Also verified that the actual XML parsing is not slow, it's acquiring the SAXParserFactory and SAXParser
that is taking all the time. The actual parse() takes about 0 ms. There are lots of conflicting posts about either only using the Java8 SAXParserFactory or only using the later versions of xerces. One site even reverted back to Java7 to avoid the SAXParserFactory load issue. I'm testing out several of the suggestions and will post findings.

One very interesting clue showed up today in our logs. I added log lines around the SAXParserFactory and SAXParser init and a separate one around the parse(). The init and parse were v1x-style lightening fast until more buckets and services started loading, then the init went 5x sluggish (the parse() was still fast). This implies that it is a problem with extra SAXParserFactory classes being exported into the OSGi environment from other bundles and not from the karaf config. See attached partial log to see the change in SAXParserFactory & SAXParser init time before and after other bundle loads. Also, curios why CatalogUIAdapterConfiguration has duplicate lines.

Fixed and reviewed

Assignee

Karen Dolan

Reporter

Stephen Marquard