You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by qi wu <ch...@gmail.com> on 2007/05/01 04:51:32 UTC

Re: Crawling fixed set of urls (newbie question)

How many segements were generated during your crawl ? 
If you have more than one segements, then some newly parsed outlinks in the page might be appended to crawldb.
To prevent this,you can try  updatedb with option "-noAddtions"  in nutch91.

----- Original Message ----- 
From: "Somnath Banerjee" <so...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, April 30, 2007 11:12 PM
Subject: Crawling fixed set of urls (newbie question)


> Hi,
> 
>    I thought I have a very simple requirement. I just want to crawl a fixed
> set of 2.3M urls. Following the tutorial I injected the urls in the crawl
> db, generated a fetch list and started fetching. After 5 days I found it has
> fetched 3M pages and fetching is still going on. I stopped the process and
> now looking at the past posts in this group I just realized that I lost 5
> days of crawl.
> 
>    Why it fetched more pages than it has in the fetch list. Is it because I
> left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
> command I didn't specify the "depth" parameter. Can somebody please help me
> in understanding the process. In case it is already discussed if possible
> please point me to the appropriate post.
> 
>    From this mailing list what I gathered that I should generate small set
> of fetch lists and merge the fetched contents. Since I my url set is fixed I
> don't want nutch to discover new urls. My understanding is  "./bin/nutch
> updatedb" will discover new urls and next time I do "./bin/nutch generate"
> it will add those discovered urls in the fetch list. Given that I just want
> to crawl my fixed list of urls what is the best way to do that.
> 
> Thanks in advance,
> -Som
> PS: I'm using nutch-0.9 in case that is required
>

Re: Crawling fixed set of urls (newbie question)

Posted by Somnath Banerjee <so...@gmail.com>.
I can see it has only one segment (the segment created by ./bin/nutch
generate). Any reason why it is crawling more pages than given in the
fetchlist?

Thanks,
Somnath

On 5/1/07, qi wu <ch...@gmail.com> wrote:
>
> How many segements were generated during your crawl ?
> If you have more than one segements, then some newly parsed outlinks in
> the page might be appended to crawldb.
> To prevent this,you can try  updatedb with option "-noAddtions"  in
> nutch91.
>
> ----- Original Message -----
> From: "Somnath Banerjee" <so...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Monday, April 30, 2007 11:12 PM
> Subject: Crawling fixed set of urls (newbie question)
>
>
> > Hi,
> >
> >    I thought I have a very simple requirement. I just want to crawl a
> fixed
> > set of 2.3M urls. Following the tutorial I injected the urls in the
> crawl
> > db, generated a fetch list and started fetching. After 5 days I found it
> has
> > fetched 3M pages and fetching is still going on. I stopped the process
> and
> > now looking at the past posts in this group I just realized that I lost
> 5
> > days of crawl.
> >
> >    Why it fetched more pages than it has in the fetch list. Is it
> because I
> > left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
> > command I didn't specify the "depth" parameter. Can somebody please help
> me
> > in understanding the process. In case it is already discussed if
> possible
> > please point me to the appropriate post.
> >
> >    From this mailing list what I gathered that I should generate small
> set
> > of fetch lists and merge the fetched contents. Since I my url set is
> fixed I
> > don't want nutch to discover new urls. My understanding is  "./bin/nutch
> > updatedb" will discover new urls and next time I do "./bin/nutch
> generate"
> > it will add those discovered urls in the fetch list. Given that I just
> want
> > to crawl my fixed list of urls what is the best way to do that.
> >
> > Thanks in advance,
> > -Som
> > PS: I'm using nutch-0.9 in case that is required
> >