You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Omar <or...@yahoo.com> on 2006/10/04 20:08:53 UTC

Inconsistent behaviour while parsing pdf/word/ppt files

Hi,

I'm having some inconsistent issues with parsing pdf/word/ppt files. For
some files the parsing & indexing works fine, except for a few. Here is a
blurb of the logs:

-- crawler log

Error parsing: http://mysite/applications.pdf: failed(2,202): Content
truncated at 11974 bytes. Parser can't handle incomplete pdf file.

-- hadoop log

2006-10-02 17:49:30,187 WARN  fetcher.Fetcher - Error parsing:
http://mysite/test.doc: failed(2,202): Content truncated at 11981 bytes.
Parser can't handle incomplete file.

Now, I do have in the nutch-site.xml file the content limit set to "-1" so
it doesn't truncate. It doesn't seem to work. Has anybody seen something
similar? Do I have to delete the property from the nutch-default.xml just in
case? 

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  </description>
</property>

Finally I do have a separate engine indexing the same documents so I don't
think it is an issue with the webserver.

Thanks for any help.

Omar
-- 
View this message in context: http://www.nabble.com/Inconsistent-behaviour-while-parsing-pdf-word-ppt-files-tf2384012.html#a6644990
Sent from the Nutch - User mailing list archive at Nabble.com.