You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Renxia Wang <re...@usc.edu> on 2015/02/15 19:10:33 UTC

Does Limiting the (ftp|http).content.size Affect the Parsing and Deduplication?

Hi all,

I am running Nutch on may own laptop and I'd like to set a limit for the
(ftp|http).content.size so that the crawl will not be downloading huge file
for a long time and possibly cause java heap size issue. However, I wonder
if downloading the files(especially those compressed file, like zip, rar,
etc) partially can fail the parsing and deduplication processing, as the
file is incomplete?

Thanks,

Renxia

Re: Does Limiting the (ftp|http).content.size Affect the Parsing and Deduplication?

Posted by Siddharth Mahendra Dasani <sd...@usc.edu>.

hey my crawler is giving a java.io.IOException after like 40-50 mins of
crawl.. Were you guys facing this issue??

On Sun, Feb 15, 2015 at 10:10 AM, Renxia Wang <re...@usc.edu> wrote:

> Hi all,
>
> I am running Nutch on may own laptop and I'd like to set a limit for the
> (ftp|http).content.size so that the crawl will not be downloading huge file
> for a long time and possibly cause java heap size issue. However, I wonder
> if downloading the files(especially those compressed file, like zip, rar,
> etc) partially can fail the parsing and deduplication processing, as the
> file is incomplete?
>
> Thanks,
>
> Renxia
>