You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Coldrick (Jira)" <ji...@apache.org> on 2020/10/22 09:25:00 UTC
[jira] [Commented] (TIKA-2939) Figure out how to allow OCR'ing of large PDFs via tika-server

    [ https://issues.apache.org/jira/browse/TIKA-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218896#comment-17218896 ] 

Daniel Coldrick commented on TIKA-2939:
---------------------------------------

Does anyone know if there is a solution to this?

> Figure out how to allow OCR'ing of large PDFs via tika-server
> -------------------------------------------------------------
>
>                 Key: TIKA-2939
>                 URL: https://issues.apache.org/jira/browse/TIKA-2939
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>            Reporter: Tim Allison
>            Priority: Minor
>
> Tesseract can take quite a bit of time on large PDFs, which can lead to timeouts in jax-rs and the connection closing:
> {noformat}
> Caused by: com.ctc.wstx.exc.WstxIOException: Closed
>         at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:262)
>         at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:104)
> Caused by: org.eclipse.jetty.io.EofException: Closed
>         at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
>         at org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322)
>         at org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
>         at com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:742)
>         at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:176)
>         at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:260)
> {noformat}
> I tried expanding the timeouts on the client side: 
> {noformat}
>  RequestConfig config = RequestConfig.custom()
>                 .setConnectTimeout(TIMEOUT * 1000)
>                 .setConnectionRequestTimeout(TIMEOUT * 1000)
>                 .setSocketTimeout(TIMEOUT * 1000).build();
> {noformat}
> But this doesn't solve the problem.
> How can we/can we increase the timeout on the server side and is there a maximum?
> If we can't fix the problem with timeouts, we should figure out a way to let people select only a few pages for OCR so that clients can iterate through large PDFs.
> This issue is different from TIKA-1871 in that the problem isn't chunking the large document to get the file to tika-server; rather the problem is the amount of time it can take tika-server to run OCR on every page of a large PDF and return the full results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)