You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Furkan KAMACI <fu...@gmail.com> on 2018/12/04 08:58:32 UTC

External Tika Server

Hi,

I try to test external OCR capabilities of Tika Server with ManifoldCF
2.11. Documents are parsed when I curl documents into Tika Server directly.
However, when I try to parse them via Tika Server I get that error at *most*
of the documents (not all of them):

INFO  meta (application/msword)
WARN  meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at
org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at
org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
timeout
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more

How can I solve it?

Kind Regards,
Furkan KAMACI

Re: External Tika Server

Posted by Furkan KAMACI <fu...@gmail.com>.
I use 1.19.1

On Wed, Dec 5, 2018 at 4:14 PM Bisonti Mario <Ma...@vimar.com>
wrote:

> Hallo.
>
> Which is your tika server version?
>
>
>
> You could try to download last build version from here, to check if it
> works.
>
>
>
> https://builds.apache.org/job/Tika-trunk/lastStableBuild/
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* mercoledì 5 dicembre 2018 13:37
> *A:* user@manifoldcf.apache.org
> *Cc:* Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Mario,
>
>
>
> Thanks for the answer. I still get an error message at a pdf at which
> parsing via HTTP works but via ManifoldCF doesn't. I get that error:
>
>
>
> WARN  meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7e76e3f5
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>
>                at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
>                at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
>                at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
>                at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
>                at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
>                at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
> Source)
>
>                at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>                at java.lang.reflect.Method.invoke(Method.java:498)
>
>                at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
>                at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
>                at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
>                at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
>                at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
>                at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
>                at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
>                at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
>                at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
>                at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
>                at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
>                at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
>                at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
>                at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
>                at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
>                at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
>                at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
>                at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
>                at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
>                at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
>                at
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
>                at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
>                at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
>                at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
>                at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
>                at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.awt.image.RasterFormatException: (y + height) is outside
> raster
>
>                at
> sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
>
>                at
> sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
>
>                at
> sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
>
>                at
> sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
>
>                at
> sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
>
>                at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
>
>                at
> sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
>
>                at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
>
>                at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
>
>                at
> org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
>
>                at
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
>
>                at
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>
>                at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>
>                at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
>
>                at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
>
>                at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
>
>                at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
>                at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
>                at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
>                at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
>                at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
>                ... 42 more
>
> INFO  tika (application/pdf)
>
> WARN  No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10
>
>
>
> On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>
> wrote:
>
>
>
> In my tika server, I added:
>
> -spawnChild -taskTimeoutMillis 1000000
>
> To bypass the timeout problem
>
>
>
> Mario
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* martedì 4 dicembre 2018 10:16
> *A:* user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Rafa,
>
>
>
> I can parse same document via HTTP URL of Tika Server. I thought that
> there maybe a timeout parameter within ManifoldCF while communicating with
> Tika Server :)
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
>
> On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:
>
> Hi Furkan,
>
>
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
>
>
> Rafa
>
>
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
>
>
> INFO  meta (application/msword)
>
> WARN  meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> ... 44 more
>
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> ... 50 more
>
> Caused by: java.util.concurrent.TimeoutException
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>
> ... 55 more
>
>
>
> How can I solve it?
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>

R: External Tika Server

Posted by Bisonti Mario <Ma...@vimar.com>.
Hallo.
Which is your tika server version?

You could try to download last build version from here, to check if it works.

https://builds.apache.org/job/Tika-trunk/lastStableBuild/


Da: Furkan KAMACI <fu...@gmail.com>
Inviato: mercoledì 5 dicembre 2018 13:37
A: user@manifoldcf.apache.org
Cc: Rafa Haro <rh...@apache.org>
Oggetto: Re: External Tika Server

Hi Mario,

Thanks for the answer. I still get an error message at a pdf at which parsing via HTTP works but via ManifoldCF doesn't. I get that error:

WARN  meta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7e76e3f5<ma...@7e76e3f5>
               at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
               at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
               at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
               at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
               at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
               at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
               at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
               at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
               at java.lang.reflect.Method.invoke(Method.java:498)
               at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
               at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
               at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
               at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
               at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
               at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
               at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
               at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
               at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
               at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
               at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
               at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
               at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
               at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
               at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
               at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
               at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
               at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
               at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
               at org.eclipse.jetty.server.Server.handle(Server.java:531)
               at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
               at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
               at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
               at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
               at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
               at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
               at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
               at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
               at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
               at java.lang.Thread.run(Thread.java:748)
Caused by: java.awt.image.RasterFormatException: (y + height) is outside raster
               at sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
               at sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
               at sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
               at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
               at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
               at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
               at sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
               at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
               at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
               at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
               at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
               at org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
               at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
               at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
               at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
               at org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
               at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
               at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
               at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
               at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
               at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
               at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
               at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
               at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
               at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
               at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
               at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
               at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
               at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
               at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
               ... 42 more
INFO  tika (application/pdf)
WARN  No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10

On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>> wrote:

In my tika server, I added:
-spawnChild -taskTimeoutMillis 1000000
To bypass the timeout problem

Mario


Da: Furkan KAMACI <fu...@gmail.com>>
Inviato: martedì 4 dicembre 2018 10:16
A: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>; Rafa Haro <rh...@apache.org>>
Oggetto: Re: External Tika Server

Hi Rafa,

I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :)

Kind Regards,
Furkan KAMACI

On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org>> wrote:
Hi Furkan,

You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side

Rafa

On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>> wrote:
Hi,

I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them):

INFO  meta (application/msword)
WARN  meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more

How can I solve it?

Kind Regards,
Furkan KAMACI

Re: External Tika Server

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Mario,

Thanks for the answer. I still get an error message at a pdf at which
parsing via HTTP works but via ManifoldCF doesn't. I get that error:

WARN  meta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@7e76e3f5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at
org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at
org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.awt.image.RasterFormatException: (y + height) is outside
raster
at
sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
at
sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
at
sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
at
sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
at
org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
at
org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
at
org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
at
org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
at
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 42 more
INFO  tika (application/pdf)
WARN  No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10

On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>
wrote:

>
>
> In my tika server, I added:
>
> -spawnChild -taskTimeoutMillis 1000000
>
> To bypass the timeout problem
>
>
>
> Mario
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* martedì 4 dicembre 2018 10:16
> *A:* user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Rafa,
>
>
>
> I can parse same document via HTTP URL of Tika Server. I thought that
> there maybe a timeout parameter within ManifoldCF while communicating with
> Tika Server :)
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
>
> On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:
>
> Hi Furkan,
>
>
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
>
>
> Rafa
>
>
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
>
>
> INFO  meta (application/msword)
>
> WARN  meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> ... 44 more
>
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> ... 50 more
>
> Caused by: java.util.concurrent.TimeoutException
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>
> ... 55 more
>
>
>
> How can I solve it?
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>

R: External Tika Server

Posted by Bisonti Mario <Ma...@vimar.com>.
In my tika server, I added:
-spawnChild -taskTimeoutMillis 1000000
To bypass the timeout problem

Mario


Da: Furkan KAMACI <fu...@gmail.com>
Inviato: martedì 4 dicembre 2018 10:16
A: user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
Oggetto: Re: External Tika Server

Hi Rafa,

I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :)

Kind Regards,
Furkan KAMACI

On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org>> wrote:
Hi Furkan,

You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side

Rafa

On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>> wrote:
Hi,

I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them):

INFO  meta (application/msword)
WARN  meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more

How can I solve it?

Kind Regards,
Furkan KAMACI

Re: External Tika Server

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Rafa,

I can parse same document via HTTP URL of Tika Server. I thought that there
maybe a timeout parameter within ManifoldCF while communicating with Tika
Server :)

Kind Regards,
Furkan KAMACI

On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:

> Hi Furkan,
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
> Rafa
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I try to test external OCR capabilities of Tika Server with ManifoldCF
>> 2.11. Documents are parsed when I curl documents into Tika Server directly.
>> However, when I try to parse them via Tika Server I get that error at
>> *most* of the documents (not all of them):
>>
>> INFO  meta (application/msword)
>> WARN  meta: Text extraction failed
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>> at
>> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>> at
>> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>> at
>> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>> at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>> at
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>> at
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>> at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>> at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>> at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>> at
>> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a
>> page
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>> ... 44 more
>> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
>> timeout
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>> ... 50 more
>> Caused by: java.util.concurrent.TimeoutException
>> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>> ... 55 more
>>
>> How can I solve it?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>

Re: External Tika Server

Posted by Rafa Haro <rh...@apache.org>.
Hi Furkan,

You seem to be getting a Timeout from Tesseract. This might be happening
with large documents (too many pages). Maybe there is some configuration
parameter for increasing timeouts that you can use at Tika side

Rafa

On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com> wrote:

> Hi,
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
> INFO  meta (application/msword)
> WARN  meta: Text extraction failed
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> ... 44 more
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
> ... 50 more
> Caused by: java.util.concurrent.TimeoutException
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
> ... 55 more
>
> How can I solve it?
>
> Kind Regards,
> Furkan KAMACI
>