You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by rohanpatil <ro...@gmail.com> on 2010/07/07 13:01:30 UTC

Tika 0.7 And Solr

Hi,

I am using Solr provided by lucidimagination and it has tika 0.5 and uses
pdfbox 0.8.
And it has problems extracting content from large(>200kb) v1.5 PDFs.

I saw that pdfbox 1.x resolves this issue.
When i upgraded the extraction jars i got the following errors.

Jul 7, 2010 2:38:56 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
	at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1108)
	at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:573)
	at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235)
	at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
	at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
	at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
	at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
	at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
	at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
	at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException:
org.bouncycastle.jce.provider.BouncyCastleProvider
	at java.net.URLClassLoader$1.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClassInternal(Unknown Source)
	... 28 more

Using Tika 0.5 i get the following Error

Jul 7, 2010 2:39:21 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 22
Jul 7, 2010 2:39:21 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@d81b4
	at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
	at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
	at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
	at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
	at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
	at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
	at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
	at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
	at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
	at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.pdf.PDFParser@d81b4
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
	at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
	... 18 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	... 21 more
Caused by: java.util.NoSuchElementException
	at java.util.AbstractList$Itr.next(Unknown Source)
	at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
	... 25 more

I am in a critical stage of my project! Any help will be greatly
appreciated! 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tika-0-7-And-Solr-tp948840p948840.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Tika 0.7 And Solr

Posted by rohanpatil <ro...@gmail.com>.
Hi,

i got it working... That jar was present in the dependency folder.. The
problem was,  i didn't know that the jars have to be copied to the lib
directory of web-inf... Sombody gotta update the docs..

Thanks anyways!!
Rohan :-)
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tika-0-7-And-Solr-tp948840p949661.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Tika 0.7 And Solr

Posted by Ken Krugler <kk...@transpac.com>.
Hi Rohan,

On Jul 7, 2010, at 4:01am, rohanpatil wrote:

> I am using Solr provided by lucidimagination and it has tika 0.5 and  
> uses
> pdfbox 0.8.
> And it has problems extracting content from large(>200kb) v1.5 PDFs.
>
> I saw that pdfbox 1.x resolves this issue.
> When i upgraded the extraction jars i got the following errors.
>
> Jul 7, 2010 2:38:56 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NoClassDefFoundError:
> org/bouncycastle/jce/provider/BouncyCastleProvider

Back in January I'd run into the same issue:

> I believe the issue is that the PDFBox pom.xml declares the  
> dependency on the missing BouncyCastleProvider jar as "optional".
>
>    <dependency>
>      <groupId>bouncycastle</groupId>
>      <artifactId>bcprov-jdk14</artifactId>
>      <version>136</version>
>      <optional>true</optional>
>    </dependency>
>
> As explained in the Maven documentation, this means that Tika needs  
> to explicitly include the jar:
>
> http://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html
>
> I see a few other optional dependencies in the PDFBox pom.xml, but  
> perhaps the only one that's really critical is the above.
>
> Let me know if anybody else has input on this, otherwise I'll file  
> an issue and fix it.

To fix it, you could manually install the bcprov-jdk14.jar

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g