You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/02/04 02:42:22 UTC

Crawling and re-crawling huge sites

Hi all,

 I am crawling a really huge site, and the crawl has been running like for
almost 5 days now and its still continuing.

 So until this crawl ends, I will not be able to see the results? What do I
do to get the results as the crawl still goes on?

 Also, in this case how do I configure re-crawls? What would be an optimal
re-crawl interval?

Thanks,
Abi

Re: Crawling and re-crawling huge sites

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi folks,

 Thanks for your help. I will try these and get back if I have more
questions.

Regards,
Gokul

On Sat, Feb 5, 2011 at 4:17 AM, Charan K <ch...@gmail.com> wrote:

> Hi Abishek,
>
>  You need to limit your crawl cycles. Say 2 million per fetch. Check the
> topN param for generate.
>
>  By default a URL will be eligible for recrawl after 30 days, which can be
> configured though.
>
>  You can have continous crawl script with 2m urls for each cycle. You can
> purge old segments after a period, since most of it would gave been
> recrawled by then.
>  Hope it helps
>
>  Thanks
>  Charan
>
> On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <am...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I saw this page on the wiki, may be could help you
> > http://wiki.apache.org/nutch/MonitoringNutchCrawls
> >
> > Good Luck
> >
> > Amine.
> >
> > On 4 February 2011 15:02, .: Abhishek :. <ab...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> Any help on this would be highly appreciated. I am still stuck :(
> >>
> >> Thanks,
> >> Abi
> >>
> >> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <ab...@gmail.com>
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I am crawling a really huge site, and the crawl has been running like
> >> for
> >>> almost 5 days now and its still continuing.
> >>>
> >>> So until this crawl ends, I will not be able to see the results? What
> do
> >> I
> >>> do to get the results as the crawl still goes on?
> >>>
> >>> Also, in this case how do I configure re-crawls? What would be an
> >> optimal
> >>> re-crawl interval?
> >>>
> >>> Thanks,
> >>> Abi
> >>>
> >>
>

Re: Crawling and re-crawling huge sites

Posted by Charan K <ch...@gmail.com>.
Hi Abishek,

  You need to limit your crawl cycles. Say 2 million per fetch. Check the topN param for generate.

  By default a URL will be eligible for recrawl after 30 days, which can be configured though.
 
  You can have continous crawl script with 2m urls for each cycle. You can purge old segments after a period, since most of it would gave been recrawled by then.
 Hope it helps

 Thanks
 Charan

On Feb 4, 2011, at 6:32 AM, Amine BENHAMZA <am...@gmail.com> wrote:

> Hi,
> 
> I saw this page on the wiki, may be could help you
> http://wiki.apache.org/nutch/MonitoringNutchCrawls
> 
> Good Luck
> 
> Amine.
> 
> On 4 February 2011 15:02, .: Abhishek :. <ab...@gmail.com> wrote:
> 
>> Hi all,
>> 
>> Any help on this would be highly appreciated. I am still stuck :(
>> 
>> Thanks,
>> Abi
>> 
>> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <ab...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> I am crawling a really huge site, and the crawl has been running like
>> for
>>> almost 5 days now and its still continuing.
>>> 
>>> So until this crawl ends, I will not be able to see the results? What do
>> I
>>> do to get the results as the crawl still goes on?
>>> 
>>> Also, in this case how do I configure re-crawls? What would be an
>> optimal
>>> re-crawl interval?
>>> 
>>> Thanks,
>>> Abi
>>> 
>> 

Re: Crawling and re-crawling huge sites

Posted by Amine BENHAMZA <am...@gmail.com>.
Hi,

I saw this page on the wiki, may be could help you
http://wiki.apache.org/nutch/MonitoringNutchCrawls

Good Luck

Amine.

On 4 February 2011 15:02, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi all,
>
>  Any help on this would be highly appreciated. I am still stuck :(
>
> Thanks,
> Abi
>
> On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <ab...@gmail.com> wrote:
>
> > Hi all,
> >
> >  I am crawling a really huge site, and the crawl has been running like
> for
> > almost 5 days now and its still continuing.
> >
> >  So until this crawl ends, I will not be able to see the results? What do
> I
> > do to get the results as the crawl still goes on?
> >
> >  Also, in this case how do I configure re-crawls? What would be an
> optimal
> > re-crawl interval?
> >
> > Thanks,
> > Abi
> >
>

Re: Crawling and re-crawling huge sites

Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi all,

 Any help on this would be highly appreciated. I am still stuck :(

Thanks,
Abi

On Fri, Feb 4, 2011 at 8:42 AM, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi all,
>
>  I am crawling a really huge site, and the crawl has been running like for
> almost 5 days now and its still continuing.
>
>  So until this crawl ends, I will not be able to see the results? What do I
> do to get the results as the crawl still goes on?
>
>  Also, in this case how do I configure re-crawls? What would be an optimal
> re-crawl interval?
>
> Thanks,
> Abi
>