You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tomi NA <he...@gmail.com> on 2006/09/02 11:19:08 UTC

Re: Could anyone teache me how to index the title or content of PDF?

On 9/1/06, Frank Huang <ka...@yahoo.com.tw> wrote:

> But when I execute ./nutch crawl there show some messages like "fetch okay
> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

Haven't had time to go through the complete code (not sure I'd
understand it, anyway), but this looks like you need to set
file.content.limit to, say, 16777216. If you're crawling over http
rather than intranet shares, the property you need to set is
http.content.limit.

Hope it helps.

t.n.a.

Re: Could anyone teache me how to index the title or content of PDF?

Posted by King Kong <ch...@hotmail.com>.



Frank Huang wrote:
> 
> Thanks for your help.
> 
> I crawl over http and set  http.content.limit like following in
> nutch-default:
> <property>
>   <name>http.content.limit</name>
>   <value>16777216</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> 
> but it still show the same error:
> fetch okay,but can`t parse http://(omit...).pdf " reason:failed
> <omit..>content
> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
> 
> what did I mistake ? thanks 
> 
> 


You must set http.content.limit=-1 . 

-- 
View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6120073
Sent from the Nutch - User forum at Nabble.com.

Re: Could anyone teache me how to index the title or content of PDF?

Posted by Frank Huang <ka...@yahoo.com.tw>.

Thanks for your help.

I crawl over http and set  http.content.limit like following in
nutch-default:
<property>
  <name>http.content.limit</name>
  <value>16777216</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

but it still show the same error:
fetch okay,but can`t parse http://(omit...).pdf " reason:failed
<omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

what did I mistake ? thanks 

Tomi NA wrote:
> 
> On 9/1/06, Frank Huang <ka...@yahoo.com.tw> wrote:
> 
>> But when I execute ./nutch crawl there show some messages like "fetch
>> okay
>> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
>> truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
> 
> Haven't had time to go through the complete code (not sure I'd
> understand it, anyway), but this looks like you need to set
> file.content.limit to, say, 16777216. If you're crawling over http
> rather than intranet shares, the property you need to .set is
> http.content.limit
> 
> Hope it helps.
> 
> 
> t.n.a.
> 
> 

-- 
View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6119492
Sent from the Nutch - User forum at Nabble.com.