You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jon Dragt <jd...@fhsws.com> on 2012/03/23 17:14:00 UTC

Unexpected Tika Exception extracting text from a PDF file.

Howdy Folks,

I'm stumped and hope somebody can give me some clues on how to work around
this occasional error I'm getting.

 

I've got a .Net console program using SolrNet to scour certain folders at
certain times and extract text from PDF files and index them. It succeeds on
a majority of the files, but it fails on several test files. Though I'm new
to this environment, I gather the SolrNet library calls on Solr (v. 3.5.0)
to do this, which in turn calls on the Tika library (v. 0.10) , which calls
on the PDFBox library (v. 1.6.0).

 

To try and isolate the problem I took SolrNet and .Net out of the equation
and switched to a Linux console. I downloaded the pdfbox-app-1.6.0.jar and
executed:

Java -jar pdfbox-app-1.6.0.jar ExtractText -console a.pdf 

Everything worked fine.

 

I moved up to Tika. Downloaded tika-app-0.10.jar and executed:

Java -jar tika-app-0.10.jar -t a.pdf

And again everything worked fine.

 

I then executed:

Curl
'http://localhost:8993/solr/MyCore/update/extract?map.content=text&commit-tr
ue' -F file=@a.pdf

And it failed with the following output (Note: the above command works fine
with other pdf files, but fails on these few pdf files)

 

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>

<title>Error 500 org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

 

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@58c5f8

                at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:219)

                at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
reamHandlerBase.java:67)

                at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)

                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)

                at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
56)

                at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
252)

                at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1212)

                at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

                at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

                at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

                at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

                at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

                at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ection.java:230)

                at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)

                at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

                at org.mortbay.jetty.Server.handle(Server.java:326)

                at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

                at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
945)

                at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)

                at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

                at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

                at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
8)

                at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582
)

Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

                at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)

                at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:213)

                ... 22 more

Caused by: java.lang.NullPointerException

                at
org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832)

                at
org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293)

                at
org.apache.pdfbox.pdmodel.font.PDFont.&lt;init&gt;(PDFont.java:178)

                at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.&lt;init&gt;(PDSimpleFont.java:7
9)

                at
org.apache.pdfbox.pdmodel.font.PDType1Font.&lt;init&gt;(PDType1Font.java:139
)

                at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1
09)

                at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7
6)

                at
org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)

                at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:243)

                at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:22
5)

                at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:441)

                at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365
)

                at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321)

                at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241)

                at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)

                at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

                ... 25 more

</title>

</head>

<body><h2>HTTP ERROR 500</h2>

<p>Problem accessing /solr/karaoke/update/extract. Reason:

<pre>    org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

 

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@58c5f8

                at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:219)

                at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
reamHandlerBase.java:67)

                at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:129)

                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)

                at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
56)

                at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
252)

                at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
.java:1212)

                at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

                at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

                at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

                at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

                at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

                at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
ection.java:230)

                at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:11
4)

                at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

                at org.mortbay.jetty.Server.handle(Server.java:326)

                at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

                at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:
945)

                at
org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)

                at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

                at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

                at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:22
8)

                at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582
)

Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@58c5f8

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

                at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)

                at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingD
ocumentLoader.java:213)

                ... 22 more

Caused by: java.lang.NullPointerException

                at
org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832)

                at
org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293)

                at
org.apache.pdfbox.pdmodel.font.PDFont.&lt;init&gt;(PDFont.java:178)

                at
org.apache.pdfbox.pdmodel.font.PDSimpleFont.&lt;init&gt;(PDSimpleFont.java:7
9)

                at
org.apache.pdfbox.pdmodel.font.PDType1Font.&lt;init&gt;(PDType1Font.java:139
)

                at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:1
09)

                at
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:7
6)

                at
org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)

                at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
:243)

                at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:22
5)

                at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:441)

                at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365
)

                at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321)

                at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241)

                at
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)

                at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)

                at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)

                ... 25 more

 

Can anybody explain to me what's going on here and how I can get around this
problem?

 

Jon