You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Cristian Vat (JIRA)" <ji...@apache.org> on 2019/03/02 16:54:00 UTC
[jira] [Created] (TIKA-2837) Performance/Stability problem in
ToHTMLContentHandler
Cristian Vat created TIKA-2837:
----------------------------------
Summary: Performance/Stability problem in ToHTMLContentHandler
Key: TIKA-2837
URL: https://issues.apache.org/jira/browse/TIKA-2837
Project: Tika
Issue Type: Bug
Reporter: Cristian Vat
I got a StackOverflowError while parsing a large PDF file using
ToHTMLContentHandler. Trace:
{noformat}
java.lang.StackOverflowError: null
at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
~[tika-core-1.20.jar:1.20]
at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
....about 1000 recursive calls...
at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
{noformat}
Error was received in a Spring Boot command-line app also doing other
processing.
I couldn't duplicate it with a standalone example, possibly standalone
it doesn't completely fill up the stack.
Also no error in standalone tika app running with GUI or as command-line.
PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from
[https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study]
Generated output has 4681 <meta> tags
Maximum tag depth of generated (X)HTML is 6
I then timed parsing with ToHTMLContentHandler versus directly with
ToXMLContentHandler. After a warmup of a few hundred parse calls times
were:
- ToHTMLContentHandler: avg 500 ms
- ToXMLContentHandler: avg 80-90 ms
Profiling with YourKit showed a hotspot and very deep stack in
recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
in ToXMLContentHandler.java:58, same as was in the StackOverflowError
Checking the code I found ToXMLContentHandler.endElement has a mention
and a fix of old similar issue TIKA-1070:
{code:java}
// Reset the position in the tree, to avoid endless stack overflow
// chains (see TIKA-1070)
currentElement = currentElement.parent;
{code}
But ToHTMLContentHandler.endElement doesn't call super.endElement in
case of empty elements including the <meta> tag. Thus the
currentElement parents keep growing in this case?
I created my own version of ToHTMLContentHandler where I called
super.endElement inside the EMPTY_ELEMENTS if and:
- no more StackOverflowError in the spring boot app
- parse times reduced to XML version one, so 5x speed improvement at least
- output is identical except additional "</meta>" closing tag.
Questions:
- should anybody be using ToHTMLContentHandler instead of
ToXMLContentHandler ? Not sure on the exact use-case since information
seems to be the same and there exist unaffected XML and XHTML content handlers
-- any way that ToHTMLContentHandler could be improved but without
emitting extra "</meta>" closing tag?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)