You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Johannes Söllner <jo...@gmx.net> on 2005/09/17 00:27:47 UTC

pdf parsing

Hi,

Can anybody tell me how to activate the pdf parser shipped with nutch?

I continue to get the message

050915 234524 parsing:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-text/plugin.xml
050915 234524 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-ext
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-pdf
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-rss

And consequently 

050915 234531 fetch okay, but can't parse http:// [...] /961.pdf, reason:
failed(2,203): Content-Type not text/html: application/pdf

plugins/parse-pdf/plugin.xml seems to be O.K. and PDFBox is in place.

Where does nutch decide whether to load or skip a plugin?

regards, jc




-- 
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen f�r GMX Partner: http://www.gmx.net/de/go/partner

AW: pdf parsing

Posted by johannes soellner <jo...@gmx.net>.
O.K., sorry, I missed this thread (reply)
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00794.html

Can't test it today, but I think this will work for me as well, as the issue
seems to be the same (module activation).

Might probably be worthwile to put an assisting statement to the log like
"050915 234531 fetch okay, but can't parse [...], reason:
failed(2,203): Content-Type not text/html: application/pdf. Have you
activated content type application/pdf in conf/nutch-default.xml?"

At least as long as module auto-loading is not available. That would help a
nutch newbie like me ;-)

Thanks, Johannes

-----Ursprungliche Nachricht-----
Von: Johannes Sollner [mailto:johannes.soellner@gmx.net]
Gesendet: Samstag, 17. September 2005 00:28
An: nutch-user@lucene.apache.org
Betreff: pdf parsing



Hi,

Can anybody tell me how to activate the pdf parser shipped with nutch?

I continue to get the message

050915 234524 parsing:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-text/plugin.xml
050915 234524 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-ext
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-pdf
050915 234524 not including:
/home/jcs/workfield/CeMM/software/nutch-0.7/plugins/parse-rss

And consequently

050915 234531 fetch okay, but can't parse http:// [...] /961.pdf, reason:
failed(2,203): Content-Type not text/html: application/pdf

plugins/parse-pdf/plugin.xml seems to be O.K. and PDFBox is in place.

Where does nutch decide whether to load or skip a plugin?

regards, jc




--
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen f|r GMX Partner: http://www.gmx.net/de/go/partner