You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Furkan KAMACI <fu...@gmail.com> on 2018/12/04 08:58:32 UTC
External Tika Server
Hi,
I try to test external OCR capabilities of Tika Server with ManifoldCF
2.11. Documents are parsed when I curl documents into Tika Server directly.
However, when I try to parse them via Tika Server I get that error at *most*
of the documents (not all of them):
INFO meta (application/msword)
WARN meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at
org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at
org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
timeout
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more
How can I solve it?
Kind Regards,
Furkan KAMACI
Re: External Tika Server
Posted by Furkan KAMACI <fu...@gmail.com>.
I use 1.19.1
On Wed, Dec 5, 2018 at 4:14 PM Bisonti Mario <Ma...@vimar.com>
wrote:
> Hallo.
>
> Which is your tika server version?
>
>
>
> You could try to download last build version from here, to check if it
> works.
>
>
>
> https://builds.apache.org/job/Tika-trunk/lastStableBuild/
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* mercoledì 5 dicembre 2018 13:37
> *A:* user@manifoldcf.apache.org
> *Cc:* Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Mario,
>
>
>
> Thanks for the answer. I still get an error message at a pdf at which
> parsing via HTTP works but via ManifoldCF doesn't. I get that error:
>
>
>
> WARN meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7e76e3f5
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>
> at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
> Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at
> org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.awt.image.RasterFormatException: (y + height) is outside
> raster
>
> at
> sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
>
> at
> sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
>
> at
> sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
>
> at
> sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
>
> at
> sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
>
> at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
>
> at
> sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
>
> at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
>
> at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
>
> at
> org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
>
> at
> org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
>
> at
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
>
> at
> org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
>
> at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
>
> at
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
>
> at
> org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
>
> at
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>
> at
> org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
>
> at
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> at
> org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> ... 42 more
>
> INFO tika (application/pdf)
>
> WARN No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10
>
>
>
> On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>
> wrote:
>
>
>
> In my tika server, I added:
>
> -spawnChild -taskTimeoutMillis 1000000
>
> To bypass the timeout problem
>
>
>
> Mario
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* martedì 4 dicembre 2018 10:16
> *A:* user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Rafa,
>
>
>
> I can parse same document via HTTP URL of Tika Server. I thought that
> there maybe a timeout parameter within ManifoldCF while communicating with
> Tika Server :)
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
>
> On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:
>
> Hi Furkan,
>
>
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
>
>
> Rafa
>
>
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
>
>
> INFO meta (application/msword)
>
> WARN meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> ... 44 more
>
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> ... 50 more
>
> Caused by: java.util.concurrent.TimeoutException
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>
> ... 55 more
>
>
>
> How can I solve it?
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
R: External Tika Server
Posted by Bisonti Mario <Ma...@vimar.com>.
Hallo.
Which is your tika server version?
You could try to download last build version from here, to check if it works.
https://builds.apache.org/job/Tika-trunk/lastStableBuild/
Da: Furkan KAMACI <fu...@gmail.com>
Inviato: mercoledì 5 dicembre 2018 13:37
A: user@manifoldcf.apache.org
Cc: Rafa Haro <rh...@apache.org>
Oggetto: Re: External Tika Server
Hi Mario,
Thanks for the answer. I still get an error message at a pdf at which parsing via HTTP works but via ManifoldCF doesn't. I get that error:
WARN meta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7e76e3f5<ma...@7e76e3f5>
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.awt.image.RasterFormatException: (y + height) is outside raster
at sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
at sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
at sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
at sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
at org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
at org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
at org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 42 more
INFO tika (application/pdf)
WARN No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10
On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>> wrote:
In my tika server, I added:
-spawnChild -taskTimeoutMillis 1000000
To bypass the timeout problem
Mario
Da: Furkan KAMACI <fu...@gmail.com>>
Inviato: martedì 4 dicembre 2018 10:16
A: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>; Rafa Haro <rh...@apache.org>>
Oggetto: Re: External Tika Server
Hi Rafa,
I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :)
Kind Regards,
Furkan KAMACI
On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org>> wrote:
Hi Furkan,
You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side
Rafa
On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>> wrote:
Hi,
I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them):
INFO meta (application/msword)
WARN meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more
How can I solve it?
Kind Regards,
Furkan KAMACI
Re: External Tika Server
Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Mario,
Thanks for the answer. I still get an error message at a pdf at which
parsing via HTTP works but via ManifoldCF doesn't. I get that error:
WARN meta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@7e76e3f5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at
org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at
org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.awt.image.RasterFormatException: (y + height) is outside
raster
at
sun.awt.image.IntegerInterleavedRaster.createWritableChild(IntegerInterleavedRaster.java:470)
at
sun.awt.image.IntegerInterleavedRaster.createChild(IntegerInterleavedRaster.java:514)
at
sun.java2d.pipe.GeneralCompositePipe.renderPathTile(GeneralCompositePipe.java:106)
at sun.java2d.pipe.AAShapePipe.renderTiles(AAShapePipe.java:201)
at sun.java2d.pipe.AAShapePipe.renderPath(AAShapePipe.java:159)
at sun.java2d.pipe.AAShapePipe.fill(AAShapePipe.java:68)
at
sun.java2d.pipe.PixelToParallelogramConverter.fill(PixelToParallelogramConverter.java:164)
at sun.java2d.pipe.ValidatePipe.fill(ValidatePipe.java:160)
at sun.java2d.SunGraphics2D.fill(SunGraphics2D.java:2527)
at org.apache.pdfbox.rendering.GroupGraphics.fill(GroupGraphics.java:418)
at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:759)
at
org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processTransparencyGroup(PDFStreamEngine.java:238)
at org.apache.pdfbox.rendering.PageDrawer.access$1800(PageDrawer.java:112)
at
org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1641)
at
org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.<init>(PageDrawer.java:1484)
at
org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1425)
at
org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:66)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:254)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:245)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:329)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 42 more
INFO tika (application/pdf)
WARN No Unicode mapping for arrowhookright (45) in font LSUPIB+CMMI10
On Tue, Dec 4, 2018 at 3:36 PM Bisonti Mario <Ma...@vimar.com>
wrote:
>
>
> In my tika server, I added:
>
> -spawnChild -taskTimeoutMillis 1000000
>
> To bypass the timeout problem
>
>
>
> Mario
>
>
>
>
>
> *Da:* Furkan KAMACI <fu...@gmail.com>
> *Inviato:* martedì 4 dicembre 2018 10:16
> *A:* user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
> *Oggetto:* Re: External Tika Server
>
>
>
> Hi Rafa,
>
>
>
> I can parse same document via HTTP URL of Tika Server. I thought that
> there maybe a timeout parameter within ManifoldCF while communicating with
> Tika Server :)
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
>
> On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:
>
> Hi Furkan,
>
>
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
>
>
> Rafa
>
>
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
>
>
> INFO meta (application/msword)
>
> WARN meta: Text extraction failed
>
> org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>
> ... 44 more
>
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>
> ... 50 more
>
> Caused by: java.util.concurrent.TimeoutException
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>
> ... 55 more
>
>
>
> How can I solve it?
>
>
>
> Kind Regards,
>
> Furkan KAMACI
>
>
R: External Tika Server
Posted by Bisonti Mario <Ma...@vimar.com>.
In my tika server, I added:
-spawnChild -taskTimeoutMillis 1000000
To bypass the timeout problem
Mario
Da: Furkan KAMACI <fu...@gmail.com>
Inviato: martedì 4 dicembre 2018 10:16
A: user@manifoldcf.apache.org; Rafa Haro <rh...@apache.org>
Oggetto: Re: External Tika Server
Hi Rafa,
I can parse same document via HTTP URL of Tika Server. I thought that there maybe a timeout parameter within ManifoldCF while communicating with Tika Server :)
Kind Regards,
Furkan KAMACI
On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org>> wrote:
Hi Furkan,
You seem to be getting a Timeout from Tesseract. This might be happening with large documents (too many pages). Maybe there is some configuration parameter for increasing timeouts that you can use at Tika side
Rafa
On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>> wrote:
Hi,
I try to test external OCR capabilities of Tika Server with ManifoldCF 2.11. Documents are parsed when I curl documents into Tika Server directly. However, when I try to parse them via Tika Server I get that error at most of the documents (not all of them):
INFO meta (application/msword)
WARN meta: Text extraction failed
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
at org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
at org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
... 44 more
Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser timeout
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
at org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
... 50 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
... 55 more
How can I solve it?
Kind Regards,
Furkan KAMACI
Re: External Tika Server
Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Rafa,
I can parse same document via HTTP URL of Tika Server. I thought that there
maybe a timeout parameter within ManifoldCF while communicating with Tika
Server :)
Kind Regards,
Furkan KAMACI
On Tue, Dec 4, 2018 at 12:13 PM Rafa Haro <rh...@apache.org> wrote:
> Hi Furkan,
>
> You seem to be getting a Timeout from Tesseract. This might be happening
> with large documents (too many pages). Maybe there is some configuration
> parameter for increasing timeouts that you can use at Tika side
>
> Rafa
>
> On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I try to test external OCR capabilities of Tika Server with ManifoldCF
>> 2.11. Documents are parsed when I curl documents into Tika Server directly.
>> However, when I try to parse them via Tika Server I get that error at
>> *most* of the documents (not all of them):
>>
>> INFO meta (application/msword)
>> WARN meta: Text extraction failed
>> org.apache.tika.exception.TikaException: Unable to extract PDF content
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>> at
>> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
>> at
>> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
>> at
>> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>> at
>> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
>> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>> at
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>> at
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>> at
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>> at
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>> at
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>> at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>> at org.eclipse.jetty.server.Server.handle(Server.java:531)
>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>> at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
>> at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
>> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>> at
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>> at
>> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a
>> page
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
>> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
>> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>> at
>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>> ... 44 more
>> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
>> timeout
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
>> at
>> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
>> ... 50 more
>> Caused by: java.util.concurrent.TimeoutException
>> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>> at
>> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
>> ... 55 more
>>
>> How can I solve it?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>
Re: External Tika Server
Posted by Rafa Haro <rh...@apache.org>.
Hi Furkan,
You seem to be getting a Timeout from Tesseract. This might be happening
with large documents (too many pages). Maybe there is some configuration
parameter for increasing timeouts that you can use at Tika side
Rafa
On Tue, Dec 4, 2018 at 9:58 AM Furkan KAMACI <fu...@gmail.com> wrote:
> Hi,
>
> I try to test external OCR capabilities of Tika Server with ManifoldCF
> 2.11. Documents are parsed when I curl documents into Tika Server directly.
> However, when I try to parse them via Tika Server I get that error at
> *most* of the documents (not all of them):
>
> INFO meta (application/msword)
> WARN meta: Text extraction failed
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:402)
> at
> org.apache.tika.server.resource.MetadataResource.parseMetadata(MetadataResource.java:126)
> at
> org.apache.tika.server.resource.MetadataResource.getMetadata(MetadataResource.java:60)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
> at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
> at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to end a page
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:428)
> at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:162)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
> at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> ... 44 more
> Caused by: org.apache.tika.exception.TikaException: TesseractOCRParser
> timeout
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:562)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:434)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:338)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parseInline(TesseractOCRParser.java:310)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.doOCROnCurrentPage(AbstractPDF2XHTML.java:337)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endPage(AbstractPDF2XHTML.java:418)
> ... 50 more
> Caused by: java.util.concurrent.TimeoutException
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:551)
> ... 55 more
>
> How can I solve it?
>
> Kind Regards,
> Furkan KAMACI
>