You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Diane Palla <pa...@shu.edu> on 2005/08/31 21:39:47 UTC
Fw: PDF support? Does crawl parse pdf files? How do I get it work?
Does Nutch have a way to parse pdf files, that is, "application/pdf"
content type files?
I noticed a plugin variable setting in default.properties:
plugin.pdf=org.apache.nutch.parse.pdf*
I never changed this file.
Is that the right value?
I am using Nutch 0.7.
What do I have to do make parse pdf files?
When I do the crawl, I get this error with application/pdf files:
050831 145126 fetch okay, but can't parse
<mainurl>/research/126900/126969/126969.pdf, reason: failed(2,203):
Content-Type not text/html: application/pdf
If it's not possible, what future version of Nutch do developers expect to
support application/pdf types and have such parsing of pdf files
available?
Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
palladia@shu.edu