You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/05 12:09:25 UTC
error parsing Microsoft documents
Hi,
My logs have reports of this error several times:
Error parsing: http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/ewr/$FILE/ewr.doc: failed(2,0): Can't be handled as Microsoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256
I searched for in the mailing list and found the following post
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200610.mbox/%3C6911914.post@talk.nabble.com%3E
which states:
The reason for failure means that you can't parse these files using the
lib-parsems plugins, because they use a "fast save" format, which is not
supported.
Your only option is to use some other external parser through parse-ext
plugin.
Does that mean if I take out the parse-msword in nutch-site.xml and replace this with parse-ext it should work? Or (I suspect) is it a bit more complicated than that?
Thanks for your help.
Ed.
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/