You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Diane Palla <pa...@shu.edu> on 2005/08/31 21:39:47 UTC

Fw: PDF support? Does crawl parse pdf files? How do I get it work?

Does Nutch have a way to parse pdf files, that is, "application/pdf" 
content type files?

I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
<mainurl>/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf


If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?


Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
palladia@shu.edu