You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sævaldur Arnar Gunnarsson <ad...@hugsmidjan.is> on 2007/05/18 05:09:33 UTC

parser not found for contentType=application/pdf

Hi, I'm evaluating Nutch as a search platform for a large Icelandic
website.
The website has a quite large collection of Adobe Acrobat documents
(PDF) stored on a Lotus Domino server.

I run nutch with 
./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/
-depth 9999 -topN 9999

Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the
following error:
Error parsing:
http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf

With best regards,
-- 
Sævaldur Arnar Gunnarsson
System Administrator | RHCE

Hugsmiðja ehf.
Snorrabraut 56 | 105 Reykjavík
S: 550 0900 | G: 659 0007

Re: parser not found for contentType=application/pdf

Posted by Dennis Kubes <nu...@dragonflymc.com>.
In the nutch-default.xml file you have the configuration option 
plugin.includes.  Copy that property to the nutch-site.xml file and 
change the parse-(text|html|js) to look like this 
parse-(text|html|js|pdf)  This will enable the pdf parser plugin.

Dennis Kubes

Sævaldur Arnar Gunnarsson wrote:
> Hi, I'm evaluating Nutch as a search platform for a large Icelandic
> website.
> The website has a quite large collection of Adobe Acrobat documents
> (PDF) stored on a Lotus Domino server.
> 
> I run nutch with 
> ./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/
> -depth 9999 -topN 9999
> 
> Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the
> following error:
> Error parsing:
> http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf
> 
> With best regards,