You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Harvey, Robin" <r....@infinityworks.com> on 2022/04/07 12:33:31 UTC

Partial OCR extractions under memory pressure

Hi,

We've hit an issue with the Tika server recently where large PDF documents
are only partially extracted when the server is under heavy load.  For
example, a 70 page PDF which is normally extracted fine suddenly returns as
just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR
and we have the timeout set to 600 seconds in the XML configuration file.
When a partial extraction happens, we get a 2xx response as normal, so it's
impossible to tell if the extraction actually worked or not.  By observing
the server logs whilst stress testing the Docker container, I can see that
the following exception is closely correlated with the error.

org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-server-standard-2.2.1.jar:2.2.1]
...snip...
Caused by: java.io.IOException: org.apache.tika.exception.TikaException:
TesseractOCRParser timeout
at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
~[tika-server-standard-2.2.1.jar:2.2.1]
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
~[tika-server-standard-2.2.1.jar:2.2.1]
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
~[tika-server-standard-2.2.1.jar:2.2.1]
at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
~[tika-server-standard-2.2.1.jar:2.2.1]

Would you consider this to be a bug?  In my view it would be much better to
get some kind 5XX HTTP response when this error occurs.

Thanks,
--Robin

Re: Partial OCR extractions under memory pressure

Posted by Tim Allison <ta...@apache.org>.
Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
Which handler, xhtml or text?


The underlying issue is that we catch and hold on to IOExceptions per
page in PDFs.  We report them in the metadata in /rmeta, but those
won't come through in /tika.

On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <r....@infinityworks.com> wrote:
>
> Hi,
>
> We've hit an issue with the Tika server recently where large PDF documents are only partially extracted when the server is under heavy load.  For example, a 70 page PDF which is normally extracted fine suddenly returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR and we have the timeout set to 600 seconds in the XML configuration file.  When a partial extraction happens, we get a 2xx response as normal, so it's impossible to tell if the extraction actually worked or not.  By observing the server logs whilst stress testing the Docker container, I can see that the following exception is closely correlated with the error.
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) ~[tika-server-standard-2.2.1.jar:2.2.1]
> ...snip...
> Caused by: java.io.IOException: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
> at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-2.2.1.jar:2.2.1]
> at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) ~[tika-server-standard-2.2.1.jar:2.2.1]
>
> Would you consider this to be a bug?  In my view it would be much better to get some kind 5XX HTTP response when this error occurs.
>
> Thanks,
> --Robin

Re: [External] Re: Partial OCR extractions under memory pressure

Posted by Tim Allison <ta...@apache.org>.
I looked more closely at this and did some testing with our MockParser
throwing an NPE.  I then stumbled across earlier documentation that
re-confirmed my findings:
https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared

In looking more closely at your stacktrace, we are letting that
exception percolate through the PDFParser.  We are not incorrectly
catching it.  The problem is that with any exception in the /tika
endpoint, if the exception happens after a certain amount of data has
been written, then our endpoint returns 200 and starts streaming the
results.  You won't know through the client that there was an
exception...for any exception after a certain amount of data has been
written.  This is true for the timeouts in tesseract and any other NPE
or other exception thrown during the parse.

If you want to guarantee that you see exceptions, you can use the json
output option of the /tika endpoint (send "accept: application/json"
as a header).  The downside to that is that it buffers the extracted
text in memory and then writes it all to json and returns it.  So
there's a tradeoff.

With the json output, I get a 200, but the stacktrace is returned in
the response:

{"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"author":"Nikolai
Lobachevsky","X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"X-TIKA:EXCEPTION:container_exception":"org.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.mock.MockParser@785b3ba9\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:188)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)\n\tat
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)\n\tat
org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)\n\tat
org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:598)\n\tat
org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:571)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
java.lang.reflect.Method.invoke(Method.java:498)\n\tat
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat
org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat
org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)\n\tat
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by:
java.lang.NullPointerException: null pointer message\n\tat
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat
org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:418)\n\tat
org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:364)\n\tat
org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:152)\n\tat
org.apache.tika.parser.mock.MockParser.parse(MockParser.java:133)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\t...
41 more\n","X-TIKA:digest:SHA1":"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP","X-TIKA:content":"<html
xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta
name=\"X-TIKA:Parsed-By\"
content=\"org.apache.tika.parser.DefaultParser\" />\n<meta
name=\"X-TIKA:Parsed-By\"
content=\"org.apache.tika.parser.mock.MockParser\" />\n<meta
name=\"author\" content=\"Nikolai Lobachevsky\" />\n<meta
name=\"X-TIKA:digest:SHA1\"
content=\"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP\" />\n<meta
name=\"X-TIKA:digest:MD5\"
content=\"0ce160383b1fc9add7b82819d6b7bb01\" />\n<meta
name=\"Content-Type\" content=\"application/mock+xml\"
/>\n<title></title>\n</head>\n<body><p>some contentsome contentsome
contentsome contentsome contentsome contentsome contentsome
contentsome contentsome

On Thu, Apr 7, 2022 at 1:13 PM Tim Allison <ta...@apache.org> wrote:
>
> Thank you.  This is a tricky one.  That endpoint streams output.  It
> doesn't buffer the results and then return results.  That means that
> we have to return 200 and start streaming the extracted content.
>
> That said, I can look at percolating the exception through the
> PDFParser through the handler so that you'll get an exception from the
> server, as with any other parse exception.
>
> Please open an issue on our JIRA.
>
> Fellow devs, what do you think?
>
> On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin
> <r....@infinityworks.com> wrote:
> >
> > The REST endpoint we're using is /rmeta/text, not totally sure which handler TBH.  The request looks like this:
> >
> > PUT /rmeta/text HTTP/1.1
> > Host: localhost:9998
> > User-Agent: python-requests/2.27.1
> > Accept-Encoding: gzip, deflate
> > Accept: */*
> > Connection: keep-alive
> > X-Tika-PDFOcrStrategy: ocr_only
> > X-Tika-Skip-Embedded: true
> > Content-Length: 259385
> >
> >
> > On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <ta...@apache.org> wrote:
> >>
> >> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments.
> >>
> >> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> >> Which handler, xhtml or text?
> >>
> >>
> >> The underlying issue is that we catch and hold on to IOExceptions per
> >> page in PDFs.  We report them in the metadata in /rmeta, but those
> >> won't come through in /tika.
> >>
> >> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <r....@infinityworks.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We've hit an issue with the Tika server recently where large PDF documents are only partially extracted when the server is under heavy load.  For example, a 70 page PDF which is normally extracted fine suddenly returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR and we have the timeout set to 600 seconds in the XML configuration file.  When a partial extraction happens, we get a 2xx response as normal, so it's impossible to tell if the extraction actually worked or not.  By observing the server logs whilst stress testing the Docker container, I can see that the following exception is closely correlated with the error.
> >> >
> >> > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > ...snip...
> >> > Caused by: java.io.IOException: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) ~[tika-server-standard-2.2.1.jar:2.2.1]
> >> >
> >> > Would you consider this to be a bug?  In my view it would be much better to get some kind 5XX HTTP response when this error occurs.
> >> >
> >> > Thanks,
> >> > --Robin

Re: [External] Re: Partial OCR extractions under memory pressure

Posted by "Harvey, Robin" <r....@infinityworks.com>.
Thanks Tim, that's really interesting and gives us something to work with.

On Thu, Apr 7, 2022 at 7:13 PM Tim Allison <ta...@apache.org> wrote:

> I looked more closely at this and did some testing with our MockParser
> throwing an NPE.  I then stumbled across earlier documentation that
> re-confirmed my findings:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_TIKA_TikaServerEndpointsCompared&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=9B8I44OZuYR9-HJNBDAa5QRj40eXbc0Mc4sBd2pqdlc&e=
>
> In looking more closely at your stacktrace, we are letting that
> exception percolate through the PDFParser.  We are not incorrectly
> catching it.  The problem is that with any exception in the /tika
> endpoint, if the exception happens after a certain amount of data has
> been written, then our endpoint returns 200 and starts streaming the
> results.  You won't know through the client that there was an
> exception...for any exception after a certain amount of data has been
> written.  This is true for the timeouts in tesseract and any other NPE
> or other exception thrown during the parse.
>
> If you want to guarantee that you see exceptions, you can use the json
> output option of the /tika endpoint (send "accept: application/json"
> as a header).  The downside to that is that it buffers the extracted
> text in memory and then writes it all to json and returns it.  So
> there's a tradeoff.
>
> With the json output, I get a 200, but the stacktrace is returned in
> the response:
>
>
> {"X-TIKA:Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"author":"Nikolai
>
> Lobachevsky","X-TIKA:Parsed-By-Full-Set":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.mock.MockParser"],"X-TIKA:EXCEPTION:container_exception":"org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.mock.MockParser@785b3ba9\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\tat
>
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:188)\n\tat
>
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:152)\n\tat
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:55)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:347)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.parseToMetadata(TikaResource.java:598)\n\tat
>
> org.apache.tika.server.core.resource.TikaResource.getJson(TikaResource.java:571)\n\tat
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
> java.lang.reflect.Method.invoke(Method.java:498)\n\tat
>
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)\n\tat
>
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)\n\tat
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)\n\tat
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)\n\tat
>
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)\n\tat
>
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)\n\tat
>
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)\n\tat
>
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)\n\tat
>
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)\n\tat
>
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)\n\tat
>
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat
>
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\n\tat
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\n\tat
>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat
> org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat
> org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\n\tat
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\n\tat
> java.lang.Thread.run(Thread.java:748)\nCaused by:
> java.lang.NullPointerException: null pointer message\n\tat
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)\n\tat
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat
> org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:418)\n\tat
> org.apache.tika.parser.mock.MockParser.throwIt(MockParser.java:364)\n\tat
>
> org.apache.tika.parser.mock.MockParser.executeAction(MockParser.java:152)\n\tat
> org.apache.tika.parser.mock.MockParser.parse(MockParser.java:133)\n\tat
>
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)\n\t...
> 41
> more\n","X-TIKA:digest:SHA1":"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP","X-TIKA:content":"<html
> xmlns=\"
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_1999_xhtml-255C&d=DwIFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=jaO_GHLdpm1_CPg8zsa6Vdwixm3ZbBbuCcqceE_lLOA&m=7zt2ZXsnK7NjabgyeqbyDincmaHuDOVE9AePZumSb6uxBuDW0U7W0pgs6eM387Sx&s=oyKwwTFcRO4gQzDNU1J8juIYxEOQvGF_siOOPJA4zjE&e=
> ">\n<head>\n<meta
> name=\"X-TIKA:Parsed-By\"
> content=\"org.apache.tika.parser.DefaultParser\" />\n<meta
> name=\"X-TIKA:Parsed-By\"
> content=\"org.apache.tika.parser.mock.MockParser\" />\n<meta
> name=\"author\" content=\"Nikolai Lobachevsky\" />\n<meta
> name=\"X-TIKA:digest:SHA1\"
> content=\"4YJF4N6NTZORGRCH5ANIYKNSSBAHIFHP\" />\n<meta
> name=\"X-TIKA:digest:MD5\"
> content=\"0ce160383b1fc9add7b82819d6b7bb01\" />\n<meta
> name=\"Content-Type\" content=\"application/mock+xml\"
> />\n<title></title>\n</head>\n<body><p>some contentsome contentsome
> contentsome contentsome contentsome contentsome contentsome
> contentsome contentsome
>
> On Thu, Apr 7, 2022 at 1:13 PM Tim Allison <ta...@apache.org> wrote:
> >
> > Thank you.  This is a tricky one.  That endpoint streams output.  It
> > doesn't buffer the results and then return results.  That means that
> > we have to return 200 and start streaming the extracted content.
> >
> > That said, I can look at percolating the exception through the
> > PDFParser through the handler so that you'll get an exception from the
> > server, as with any other parse exception.
> >
> > Please open an issue on our JIRA.
> >
> > Fellow devs, what do you think?
> >
> > On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin
> > <r....@infinityworks.com> wrote:
> > >
> > > The REST endpoint we're using is /rmeta/text, not totally sure which
> handler TBH.  The request looks like this:
> > >
> > > PUT /rmeta/text HTTP/1.1
> > > Host: localhost:9998
> > > User-Agent: python-requests/2.27.1
> > > Accept-Encoding: gzip, deflate
> > > Accept: */*
> > > Connection: keep-alive
> > > X-Tika-PDFOcrStrategy: ocr_only
> > > X-Tika-Skip-Embedded: true
> > > Content-Length: 259385
> > >
> > >
> > > On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <ta...@apache.org>
> wrote:
> > >>
> > >> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly
> with links and attachments.
> > >>
> > >> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> > >> Which handler, xhtml or text?
> > >>
> > >>
> > >> The underlying issue is that we catch and hold on to IOExceptions per
> > >> page in PDFs.  We report them in the metadata in /rmeta, but those
> > >> won't come through in /tika.
> > >>
> > >> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <
> r.harvey@infinityworks.com> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > We've hit an issue with the Tika server recently where large PDF
> documents are only partially extracted when the server is under heavy
> load.  For example, a 70 page PDF which is normally extracted fine suddenly
> returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to
> force OCR and we have the timeout set to 600 seconds in the XML
> configuration file.  When a partial extraction happens, we get a 2xx
> response as normal, so it's impossible to tell if the extraction actually
> worked or not.  By observing the server logs whilst stress testing the
> Docker container, I can see that the following exception is closely
> correlated with the error.
> > >> >
> > >> > org.apache.tika.exception.TikaException: Unable to extract PDF
> content
> > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > ...snip...
> > >> > Caused by: java.io.IOException:
> org.apache.tika.exception.TikaException: TesseractOCRParser timeout
> > >> > at
> org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > >> >
> > >> > Would you consider this to be a bug?  In my view it would be much
> better to get some kind 5XX HTTP response when this error occurs.
> > >> >
> > >> > Thanks,
> > >> > --Robin
>

Re: [External] Re: Partial OCR extractions under memory pressure

Posted by Tim Allison <ta...@apache.org>.
Thank you.  This is a tricky one.  That endpoint streams output.  It
doesn't buffer the results and then return results.  That means that
we have to return 200 and start streaming the extracted content.

That said, I can look at percolating the exception through the
PDFParser through the handler so that you'll get an exception from the
server, as with any other parse exception.

Please open an issue on our JIRA.

Fellow devs, what do you think?

On Thu, Apr 7, 2022 at 12:35 PM Harvey, Robin
<r....@infinityworks.com> wrote:
>
> The REST endpoint we're using is /rmeta/text, not totally sure which handler TBH.  The request looks like this:
>
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: python-requests/2.27.1
> Accept-Encoding: gzip, deflate
> Accept: */*
> Connection: keep-alive
> X-Tika-PDFOcrStrategy: ocr_only
> X-Tika-Skip-Embedded: true
> Content-Length: 259385
>
>
> On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <ta...@apache.org> wrote:
>>
>> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments.
>>
>> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
>> Which handler, xhtml or text?
>>
>>
>> The underlying issue is that we catch and hold on to IOExceptions per
>> page in PDFs.  We report them in the metadata in /rmeta, but those
>> won't come through in /tika.
>>
>> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <r....@infinityworks.com> wrote:
>> >
>> > Hi,
>> >
>> > We've hit an issue with the Tika server recently where large PDF documents are only partially extracted when the server is under heavy load.  For example, a 70 page PDF which is normally extracted fine suddenly returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to force OCR and we have the timeout set to 600 seconds in the XML configuration file.  When a partial extraction happens, we get a 2xx response as normal, so it's impossible to tell if the extraction actually worked or not.  By observing the server logs whilst stress testing the Docker container, I can see that the following exception is closely correlated with the error.
>> >
>> > org.apache.tika.exception.TikaException: Unable to extract PDF content
>> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > ...snip...
>> > Caused by: java.io.IOException: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
>> > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61) ~[tika-server-standard-2.2.1.jar:2.2.1]
>> >
>> > Would you consider this to be a bug?  In my view it would be much better to get some kind 5XX HTTP response when this error occurs.
>> >
>> > Thanks,
>> > --Robin

Re: [External] Re: Partial OCR extractions under memory pressure

Posted by "Harvey, Robin" <r....@infinityworks.com>.
The REST endpoint we're using is /rmeta/text, not totally sure which
handler TBH.  The request looks like this:

PUT /rmeta/text HTTP/1.1
Host: localhost:9998
User-Agent: python-requests/2.27.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
X-Tika-PDFOcrStrategy: ocr_only
X-Tika-Skip-Embedded: true
Content-Length: 259385


On Thu, Apr 7, 2022 at 2:46 PM Tim Allison <ta...@apache.org> wrote:

> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with
> links and attachments.
>
> Y. I agree, I think.  Which endpoint are you using /tika or /rmeta?
> Which handler, xhtml or text?
>
>
> The underlying issue is that we catch and hold on to IOExceptions per
> page in PDFs.  We report them in the metadata in /rmeta, but those
> won't come through in /tika.
>
> On Thu, Apr 7, 2022 at 8:34 AM Harvey, Robin <r....@infinityworks.com>
> wrote:
> >
> > Hi,
> >
> > We've hit an issue with the Tika server recently where large PDF
> documents are only partially extracted when the server is under heavy
> load.  For example, a 70 page PDF which is normally extracted fine suddenly
> returns as just 4 or 5 pages.  We use the X-Tika-PDFOcrStrategy header to
> force OCR and we have the timeout set to 600 seconds in the XML
> configuration file.  When a partial extraction happens, we get a 2xx
> response as normal, so it's impossible to tell if the extraction actually
> worked or not.  By observing the server logs whilst stress testing the
> Docker container, I can see that the following exception is closely
> correlated with the error.
> >
> > org.apache.tika.exception.TikaException: Unable to extract PDF content
> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:78)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:169)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > ...snip...
> > Caused by: java.io.IOException: org.apache.tika.exception.TikaException:
> TesseractOCRParser timeout
> > at org.apache.tika.parser.pdf.OCR2XHTML.processPage(OCR2XHTML.java:95)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1063)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> > at org.apache.tika.parser.pdf.OCR2XHTML.process(OCR2XHTML.java:61)
> ~[tika-server-standard-2.2.1.jar:2.2.1]
> >
> > Would you consider this to be a bug?  In my view it would be much better
> to get some kind 5XX HTTP response when this error occurs.
> >
> > Thanks,
> > --Robin
>