You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manu Reddy <ma...@gmail.com> on 2012/12/14 18:08:30 UTC

Re: Best practices for running Nutch

Could you tell us some problems that can arise if we run the fetcher for
long?? Can you guide to us some place where we can find more information on
the same.

Thanks in advance.
Manu




On Mon, Nov 19, 2012 at 8:08 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi
>
> I'm not sure this applies to you because i don't know what you mean by
> > `running crawler`; never run the fetcher for longer than an hour orso.
> >
>
> Thank you for the reply. I have seen fetcher run for more time like 3-5
> hours and more.
>
> I think based on both of your suggestions, i will try to see if i can
> switch my database and  not to use the crawl class but use the crawl
> script.
>
> I will start a new script and try to see if it makes any changes in the
> performance.
>
> Thank you,
> Kiran.
>
> --
> Kiran Chitturi
>



-- 
Manu Reddy,
mobile no: 091-8143405225,
Hyderabad.

RE: Best practices for running Nutch

Posted by Markus Jelsma <ma...@openindex.io>.
A long running fetcher:
- allows possible memory leak to accumulate until disaster;
- looses more records if it terminates;
- can run even longer because more records to shuffle in map reduce;
- is harder to debug, log files can become too large to easily grep through
- is more likely to have a hanging fetcher thread;
- ....

The fetcher.timelimit and throughput settings are your friend. There is no problem in terminating a fetcher gracefully via these settings, just update and generate again and unfetched URL's will be fetched eventually.

The downside is that you have to update and index more often and therefore read the crawldb more often.
 
-----Original message-----
> From:Manu Reddy <ma...@gmail.com>
> Sent: Fri 14-Dec-2012 18:15
> To: user@nutch.apache.org
> Subject: Re: Best practices for running Nutch
> 
> Could you tell us some problems that can arise if we run the fetcher for
> long?? Can you guide to us some place where we can find more information on
> the same.
> 
> Thanks in advance.
> Manu
> 
> 
> 
> 
> On Mon, Nov 19, 2012 at 8:08 PM, kiran chitturi
> <ch...@gmail.com>wrote:
> 
> > Hi
> >
> > I'm not sure this applies to you because i don't know what you mean by
> > > `running crawler`; never run the fetcher for longer than an hour orso.
> > >
> >
> > Thank you for the reply. I have seen fetcher run for more time like 3-5
> > hours and more.
> >
> > I think based on both of your suggestions, i will try to see if i can
> > switch my database and  not to use the crawl class but use the crawl
> > script.
> >
> > I will start a new script and try to see if it makes any changes in the
> > performance.
> >
> > Thank you,
> > Kiran.
> >
> > --
> > Kiran Chitturi
> >
> 
> 
> 
> -- 
> Manu Reddy,
> mobile no: 091-8143405225,
> Hyderabad.
>