You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Cristian Vat <cr...@gmail.com> on 2019/03/02 08:09:02 UTC

tika PDF extraction - ToHTMLContentHandler problems

Not sure where this is between a bug and a performance issue..

I got a StackOverflowError while parsing a large PDF file using
ToHTMLContentHandler. Trace:
--
java.lang.StackOverflowError: null
    at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
    at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
    at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
~[tika-core-1.20.jar:1.20]
    at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
....about 1000 recursive calls...
    at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
--

Error was received in a Spring Boot command-line app also doing other
processing.
I couldn't duplicate it with a standalone example, possibly standalone
it doesn't completely fill up the stack.
Also no error in standalone tika app running with GUI or as command-line.

PDF File: "10.1007-s00268-016-3727-3.pdf" (also a DOI, might be more
sources/versions)
Wasn't sure if I should attach it here, it can be downloaded from
https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study
Generated output has 4681 <meta> tags
Maximum tag depth of generated (X)HTML is 6

I then timed parsing with ToHTMLContentHandler versus directly with
ToXMLContentHandler. After a warmup of a few hundred parse calls times
were:
- ToHTMLContentHandler: avg 500 ms
- ToXMLContentHandler: avg 80-90 ms

Profiling with YourKit showed a hotspot and very deep stack in
recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
in ToXMLContentHandler.java:58, same as was in the StackOverflowError

Checking the code I found ToXMLContentHandler.endElement has a mention
and a fix of old similar issue TIKA-1070:
--
// Reset the position in the tree, to avoid endless stack overflow
// chains (see TIKA-1070)
currentElement = currentElement.parent;
--

But ToHTMLContentHandler.endElement doesn't call super.endElement in
case of empty elements including the <meta> tag. Thus the
currentElement parents keep growing in this case?

I created my own version of ToHTMLContentHandler where I called
super.endElement inside the EMPTY_ELEMENTS if and:
- no more StackOverflowError in the spring boot app
- parse times reduced to XML version one, so 5x speed improvement at least
- output is identical except additional "</meta>" closing tag.

Questions:
- should I create an issue for this? Should I attach the PDF file to
issue? (not sure on rights). Shall I just include entire email text in
issue?
- should anybody be using ToHTMLContentHandler instead of
ToXMLContentHandler ? Not sure on the exact use-case since information
seems to be the same...
-- any way that ToHTMLContentHandler could be improved but without
emitting extra "</meta>" closing tag?

Regards,
Cristian Vat

Re: tika PDF extraction - ToHTMLContentHandler problems

Posted by Tim Allison <ta...@apache.org>.
Y. Please do open an issue. Thank you!

On Sat, Mar 2, 2019 at 3:09 AM Cristian Vat <cr...@gmail.com> wrote:

> Not sure where this is between a bug and a performance issue..
>
> I got a StackOverflowError while parsing a large PDF file using
> ToHTMLContentHandler. Trace:
> --
> java.lang.StackOverflowError: null
>     at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
>     at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
>     at
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
> ~[tika-core-1.20.jar:1.20]
>     at
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> ~[tika-core-1.20.jar:1.20]
> ....about 1000 recursive calls...
>     at
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> ~[tika-core-1.20.jar:1.20]
> --
>
> Error was received in a Spring Boot command-line app also doing other
> processing.
> I couldn't duplicate it with a standalone example, possibly standalone
> it doesn't completely fill up the stack.
> Also no error in standalone tika app running with GUI or as command-line.
>
> PDF File: "10.1007-s00268-016-3727-3.pdf" (also a DOI, might be more
> sources/versions)
> Wasn't sure if I should attach it here, it can be downloaded from
>
> https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study
> Generated output has 4681 <meta> tags
> Maximum tag depth of generated (X)HTML is 6
>
> I then timed parsing with ToHTMLContentHandler versus directly with
> ToXMLContentHandler. After a warmup of a few hundred parse calls times
> were:
> - ToHTMLContentHandler: avg 500 ms
> - ToXMLContentHandler: avg 80-90 ms
>
> Profiling with YourKit showed a hotspot and very deep stack in
> recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
> in ToXMLContentHandler.java:58, same as was in the StackOverflowError
>
> Checking the code I found ToXMLContentHandler.endElement has a mention
> and a fix of old similar issue TIKA-1070:
> --
> // Reset the position in the tree, to avoid endless stack overflow
> // chains (see TIKA-1070)
> currentElement = currentElement.parent;
> --
>
> But ToHTMLContentHandler.endElement doesn't call super.endElement in
> case of empty elements including the <meta> tag. Thus the
> currentElement parents keep growing in this case?
>
> I created my own version of ToHTMLContentHandler where I called
> super.endElement inside the EMPTY_ELEMENTS if and:
> - no more StackOverflowError in the spring boot app
> - parse times reduced to XML version one, so 5x speed improvement at least
> - output is identical except additional "</meta>" closing tag.
>
> Questions:
> - should I create an issue for this? Should I attach the PDF file to
> issue? (not sure on rights). Shall I just include entire email text in
> issue?
> - should anybody be using ToHTMLContentHandler instead of
> ToXMLContentHandler ? Not sure on the exact use-case since information
> seems to be the same...
> -- any way that ToHTMLContentHandler could be improved but without
> emitting extra "</meta>" closing tag?
>
> Regards,
> Cristian Vat
>