You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Somnath Banerjee <so...@gmail.com> on 2007/04/30 17:12:03 UTC

Crawling fixed set of urls (newbie question)

Hi,

    I thought I have a very simple requirement. I just want to crawl a fixed
set of 2.3M urls. Following the tutorial I injected the urls in the crawl
db, generated a fetch list and started fetching. After 5 days I found it has
fetched 3M pages and fetching is still going on. I stopped the process and
now looking at the past posts in this group I just realized that I lost 5
days of crawl.

    Why it fetched more pages than it has in the fetch list. Is it because I
left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
command I didn't specify the "depth" parameter. Can somebody please help me
in understanding the process. In case it is already discussed if possible
please point me to the appropriate post.

    From this mailing list what I gathered that I should generate small set
of fetch lists and merge the fetched contents. Since I my url set is fixed I
don't want nutch to discover new urls. My understanding is  "./bin/nutch
updatedb" will discover new urls and next time I do "./bin/nutch generate"
it will add those discovered urls in the fetch list. Given that I just want
to crawl my fixed list of urls what is the best way to do that.

Thanks in advance,
-Som
PS: I'm using nutch-0.9 in case that is required

Re: Crawling fixed set of urls (newbie question)

Posted by Somnath Banerjee <so...@gmail.com>.

I can see it has only one segment (the segment created by ./bin/nutch
generate). Any reason why it is crawling more pages than given in the
fetchlist?

Thanks,
Somnath

On 5/1/07, qi wu <ch...@gmail.com> wrote:
>
> How many segements were generated during your crawl ?
> If you have more than one segements, then some newly parsed outlinks in
> the page might be appended to crawldb.
> To prevent this,you can try  updatedb with option "-noAddtions"  in
> nutch91.
>
> ----- Original Message -----
> From: "Somnath Banerjee" <so...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Monday, April 30, 2007 11:12 PM
> Subject: Crawling fixed set of urls (newbie question)
>
>
> > Hi,
> >
> >    I thought I have a very simple requirement. I just want to crawl a
> fixed
> > set of 2.3M urls. Following the tutorial I injected the urls in the
> crawl
> > db, generated a fetch list and started fetching. After 5 days I found it
> has
> > fetched 3M pages and fetching is still going on. I stopped the process
> and
> > now looking at the past posts in this group I just realized that I lost
> 5
> > days of crawl.
> >
> >    Why it fetched more pages than it has in the fetch list. Is it
> because I
> > left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
> > command I didn't specify the "depth" parameter. Can somebody please help
> me
> > in understanding the process. In case it is already discussed if
> possible
> > please point me to the appropriate post.
> >
> >    From this mailing list what I gathered that I should generate small
> set
> > of fetch lists and merge the fetched contents. Since I my url set is
> fixed I
> > don't want nutch to discover new urls. My understanding is  "./bin/nutch
> > updatedb" will discover new urls and next time I do "./bin/nutch
> generate"
> > it will add those discovered urls in the fetch list. Given that I just
> want
> > to crawl my fixed list of urls what is the best way to do that.
> >
> > Thanks in advance,
> > -Som
> > PS: I'm using nutch-0.9 in case that is required
> >

Re: Crawling fixed set of urls (newbie question)

Posted by qi wu <ch...@gmail.com>.

How many segements were generated during your crawl ? 
If you have more than one segements, then some newly parsed outlinks in the page might be appended to crawldb.
To prevent this,you can try  updatedb with option "-noAddtions"  in nutch91.

----- Original Message ----- 
From: "Somnath Banerjee" <so...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, April 30, 2007 11:12 PM
Subject: Crawling fixed set of urls (newbie question)


> Hi,
> 
>    I thought I have a very simple requirement. I just want to crawl a fixed
> set of 2.3M urls. Following the tutorial I injected the urls in the crawl
> db, generated a fetch list and started fetching. After 5 days I found it has
> fetched 3M pages and fetching is still going on. I stopped the process and
> now looking at the past posts in this group I just realized that I lost 5
> days of crawl.
> 
>    Why it fetched more pages than it has in the fetch list. Is it because I
> left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
> command I didn't specify the "depth" parameter. Can somebody please help me
> in understanding the process. In case it is already discussed if possible
> please point me to the appropriate post.
> 
>    From this mailing list what I gathered that I should generate small set
> of fetch lists and merge the fetched contents. Since I my url set is fixed I
> don't want nutch to discover new urls. My understanding is  "./bin/nutch
> updatedb" will discover new urls and next time I do "./bin/nutch generate"
> it will add those discovered urls in the fetch list. Given that I just want
> to crawl my fixed list of urls what is the best way to do that.
> 
> Thanks in advance,
> -Som
> PS: I'm using nutch-0.9 in case that is required
>