You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Andrei Hajdukewycz <ah...@mozilla.com> on 2006/09/04 22:42:42 UTC

Recrawling

Hi,
I've crawled a site of roughly 30,000-40,000 pages using the
bin/nutch crawl command, which went quite smoothly. Now,
however, I'm trying to recrawl it using the script at
http://wiki.apache.org/nutch/IntranetRecrawl?action=show .

However, when I run the recrawl, generally I end up fetching
80-100k pages instead of 30-40k, with many pages fetched more
than once. 

I assume this is due to the number of generate+fetch cycles I'm
running, which  is 5. I'm looking for advice on settings to optimize
this so I end up with less multiple fetching but still proper 
coverage over the site.

"depth" as per the script is set to 5, topN unspecified, 31
days added to force refetch of everything.

My relevant settings nutch-site.xml are as follows:
db.ignore.internal.links = false,
db.ignore.external.links = true,
fetcher.server.delay = 1.0,
fetcher.threads.fetch = 3,
fetcher.threads.per.host = 3,
db.default.fetch.interval = 1

Any help would be most appreciated!
Andrei

Re: Recrawling

Posted by Raghavendra Prabhu <rr...@gmail.com>.

I am not sure. But i think this wud be the reason

When u crawl the site the first time with a specified depth, the other urls
are detected

But the second time u crawl, the urls are already there and the depth is
relative to these new urls

In both the cases the depth is the same, but since depth is with proportion
to how deep it goes for each url, in second case it will be more as many
urls are already there.


Regards,
Prabhu



On 9/5/06, Andrei Hajdukewycz <ah...@mozilla.com> wrote:
>
> Hi,
> I've crawled a site of roughly 30,000-40,000 pages using the
> bin/nutch crawl command, which went quite smoothly. Now,
> however, I'm trying to recrawl it using the script at
> http://wiki.apache.org/nutch/IntranetRecrawl?action=show .
>
> However, when I run the recrawl, generally I end up fetching
> 80-100k pages instead of 30-40k, with many pages fetched more
> than once.
>
> I assume this is due to the number of generate+fetch cycles I'm
> running, which  is 5. I'm looking for advice on settings to optimize
> this so I end up with less multiple fetching but still proper
> coverage over the site.
>
> "depth" as per the script is set to 5, topN unspecified, 31
> days added to force refetch of everything.
>
> My relevant settings nutch-site.xml are as follows:
> db.ignore.internal.links = false,
> db.ignore.external.links = true,
> fetcher.server.delay = 1.0,
> fetcher.threads.fetch = 3,
> fetcher.threads.per.host = 3,
> db.default.fetch.interval = 1
>
> Any help would be most appreciated!
> Andrei
>

Re: Recrawling

Posted by ytthet <ye...@gmail.com>.

Folks,

Have you found any solutions?

I am facing the same issue.

Thanks,

YT. Thet

Tomi N/A wrote:
> 
> On 9/6/06, Andrei Hajdukewycz <ah...@mozilla.com> wrote:
>> Another problem I've noticed is that it seems the db grows *rapidly* with
>> each successive recrawl. Mine started at 379MB, and it seems to increase
>> by roughly 350MB every time I run a recrawl, despite there not being
>> anywhere near that many additional pages.
>>
>> This seems like a pretty severe problem, honestly, obviously there's a
>> lot of duplicated data in the segments.
> 
> I have the same problem: my index grew from 1.5GB after the original
> crawl to over 5GB(!) after the recrawl...from the looks of it, I might
> as well crawl anew every time. :\
> 
> t.n.a.
> 
> 

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Recrawling-tp609663p1915415.html
Sent from the Nutch - User mailing list archive at Nabble.com.