You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/05 12:09:25 UTC

error parsing Microsoft documents

Hi,

My logs have reports of this error several times:

Error parsing: http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/ewr/$FILE/ewr.doc: failed(2,0): Can't be handled as Microsoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256

I searched for in the mailing list and found the following post 

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200610.mbox/%3C6911914.post@talk.nabble.com%3E

which states:

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 
supported.

Your only option is to use some other external parser through parse-ext 
plugin.



Does that mean if I take out the parse-msword in nutch-site.xml and replace this with parse-ext it should work? Or (I suspect) is it a bit more complicated than that? 

Thanks for your help.

Ed.

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/