You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mr Hadoop <mr...@gmail.com> on 2009/12/04 13:10:47 UTC

Can nutch pause, stop and start where it left off?

I am just staring to learn nutch.  One question I wanted to know is that can
nutch pause, stop and start indexing a site on a incremental  daily basis?
My concern with nutch is that nutch behaving like a hog and crawling
everything with huge bandwidth consumption and pissing off the many site
owners.

Can some experts shed some light in this?

Re: Can nutch pause, stop and start where it left off?

Posted by MilleBii <mi...@gmail.com>.
Nutch behaves ...
So by default it will not fetch more 1 url every 5s (setting
changeable)  to a given host (by name or ip depending on the nutch
conf file).
So actually you will find the opposite it is very slow for a single
site... Speed comes when you fetch several sites in parallel.


2009/12/4, Jesse Hires <jh...@gmail.com>:
> use the -topN flag to only grab a small number of URLs.
> Also I believe there is also a setting you can put in nutch-site.xml that
> can be used to slow down how many URLs you grab over time.
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop <mr...@gmail.com> wrote:
>
>> I am just staring to learn nutch.  One question I wanted to know is that
>> can
>> nutch pause, stop and start indexing a site on a incremental  daily basis?
>> My concern with nutch is that nutch behaving like a hog and crawling
>> everything with huge bandwidth consumption and pissing off the many site
>> owners.
>>
>> Can some experts shed some light in this?
>>
>


-- 
-MilleBii-

Re: Can nutch pause, stop and start where it left off?

Posted by Jesse Hires <jh...@gmail.com>.
use the -topN flag to only grab a small number of URLs.
Also I believe there is also a setting you can put in nutch-site.xml that
can be used to slow down how many URLs you grab over time.

Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop <mr...@gmail.com> wrote:

> I am just staring to learn nutch.  One question I wanted to know is that
> can
> nutch pause, stop and start indexing a site on a incremental  daily basis?
> My concern with nutch is that nutch behaving like a hog and crawling
> everything with huge bandwidth consumption and pissing off the many site
> owners.
>
> Can some experts shed some light in this?
>