You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Peter van Dijk <ph...@hotmail.com> on 2010/06/20 15:20:33 UTC

some pdf files fail

After using nutch for a while; i figured that some pdf files can't be indexed:

java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage

I've already fixed my pdf plugins so that the jai_core.jar and jai_codec.jar files are in place, and the respective lines in parse-pdf/plugin.xml are uncommented. However some pdf continue to fail?! Other pdf files just work!
 		 	   		  
_________________________________________________________________
Al je email accounts in 1 inbox. Het kan in Hotmail.
http://www.microsoft.com/netherlands/windowslive/Views/productdetail.aspx?product=Hotmail

Re: some pdf files fail

Posted by Gora Mohanty <go...@srijan.in>.
On Sun, 20 Jun 2010 17:22:12 +0200
reinhard schwab <re...@aon.at> wrote:

> if you read the README.txt file, you will read
> 
> Apache Nutch README
> 
> Important note: Due to licensing issues we cannot provide two
> libraries that are normally provided with PDFBox (jai_core.jar,
> jai_codec.jar), the parser library we use for parsing PDF files.
> If you encounter unexpected problems when
> working with PDF files please
[...]

I think he is saying that he followed those instructions, and some
PDFs still fail.

Peter, are you still getting the error message about
java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage
or are the PDFs failing for some other reason? I know that the
instructions for the Java Advanced Imaging API worked for us.
If JAI is still failing, are you sure that the JAI .jar files
are in the right place? Where did you put them?

Regards,
Gora

Re: some pdf files fail

Posted by reinhard schwab <re...@aon.at>.
if you read the README.txt file, you will read

Apache Nutch README

Important note: Due to licensing issues we cannot provide two libraries that
are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser
library we use for parsing PDF files. If you encounter unexpected
problems when
working with PDF files please

1. download the two missing libraries  from:
   http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

2. Put them to directory src/plugin/parse-pdf/lib
3. follow the instructions in file src/plugin/parse-pdf/plugin.xml
4. Rebuild nutch.


Peter van Dijk schrieb:
> After using nutch for a while; i figured that some pdf files can't be indexed:
>
> java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage
>
> I've already fixed my pdf plugins so that the jai_core.jar and jai_codec.jar files are in place, and the respective lines in parse-pdf/plugin.xml are uncommented. However some pdf continue to fail?! Other pdf files just work!
>  		 	   		  
> _________________________________________________________________
> Al je email accounts in 1 inbox. Het kan in Hotmail.
> http://www.microsoft.com/netherlands/windowslive/Views/productdetail.aspx?product=Hotmail
>