You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Omar <or...@yahoo.com> on 2006/10/04 20:08:53 UTC
Inconsistent behaviour while parsing pdf/word/ppt files
Hi,
I'm having some inconsistent issues with parsing pdf/word/ppt files. For
some files the parsing & indexing works fine, except for a few. Here is a
blurb of the logs:
-- crawler log
Error parsing: http://mysite/applications.pdf: failed(2,202): Content
truncated at 11974 bytes. Parser can't handle incomplete pdf file.
-- hadoop log
2006-10-02 17:49:30,187 WARN fetcher.Fetcher - Error parsing:
http://mysite/test.doc: failed(2,202): Content truncated at 11981 bytes.
Parser can't handle incomplete file.
Now, I do have in the nutch-site.xml file the content limit set to "-1" so
it doesn't truncate. It doesn't seem to work. Has anybody seen something
similar? Do I have to delete the property from the nutch-default.xml just in
case?
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
</description>
</property>
Finally I do have a separate engine indexing the same documents so I don't
think it is an issue with the webserver.
Thanks for any help.
Omar
--
View this message in context: http://www.nabble.com/Inconsistent-behaviour-while-parsing-pdf-word-ppt-files-tf2384012.html#a6644990
Sent from the Nutch - User mailing list archive at Nabble.com.