You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Johannes Söllner <jo...@gmx.net> on 2005/09/17 00:27:47 UTC
pdf parsing
Hi,
Can anybody tell me how to activate the pdf parser shipped with nutch?
I continue to get the message
050915 234524 parsing:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-text/plugin.xml
050915 234524 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-ext
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-pdf
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-rss
And consequently
050915 234531 fetch okay, but can't parse http:// [...] /961.pdf, reason:
failed(2,203): Content-Type not text/html: application/pdf
plugins/parse-pdf/plugin.xml seems to be O.K. and PDFBox is in place.
Where does nutch decide whether to load or skip a plugin?
regards, jc
--
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen f�r GMX Partner: http://www.gmx.net/de/go/partner
AW: pdf parsing
Posted by johannes soellner <jo...@gmx.net>.
O.K., sorry, I missed this thread (reply)
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00794.html
Can't test it today, but I think this will work for me as well, as the issue
seems to be the same (module activation).
Might probably be worthwile to put an assisting statement to the log like
"050915 234531 fetch okay, but can't parse [...], reason:
failed(2,203): Content-Type not text/html: application/pdf. Have you
activated content type application/pdf in conf/nutch-default.xml?"
At least as long as module auto-loading is not available. That would help a
nutch newbie like me ;-)
Thanks, Johannes
-----Ursprungliche Nachricht-----
Von: Johannes Sollner [mailto:johannes.soellner@gmx.net]
Gesendet: Samstag, 17. September 2005 00:28
An: nutch-user@lucene.apache.org
Betreff: pdf parsing
Hi,
Can anybody tell me how to activate the pdf parser shipped with nutch?
I continue to get the message
050915 234524 parsing:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-text/plugin.xml
050915 234524 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-ext
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-pdf
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-rss
And consequently
050915 234531 fetch okay, but can't parse http:// [...] /961.pdf, reason:
failed(2,203): Content-Type not text/html: application/pdf
plugins/parse-pdf/plugin.xml seems to be O.K. and PDFBox is in place.
Where does nutch decide whether to load or skip a plugin?
regards, jc
--
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen f|r GMX Partner: http://www.gmx.net/de/go/partner