You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2012/12/21 13:29:30 UTC
Difference in params - depth and topN
Hello All,
There is a site that has total 5 URLS.
- When this site is crawled with input param for depth 3 , all 5 sites
are crawled in one shot.
- And when it is crawled with params : depth 1 topN 5 TWO times, for
this first time only one URL is crawled and second time rest 4 are crawled.
- And when params: depth 1 topN 3 - after 3 times it crawled all the 5
sites.
Didn't understand what does these 2 parameters mean. Can anyone brief or
redirect to url that explains this? Below are the list of url and readdb
stats.
*stats:*
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
status 2 (db_fetched): 5
CrawlDb statistics: done
*URLS : *
http://liveforyou.blogspot.in/
http://liveforyou.blogspot.in/2012/12/blogging.html
http://liveforyou.blogspot.in/2011/09/test.html
http://liveforyou.blogspot.in/2012_12_01_archive.html
http://liveforyou.blogspot.in/2011_09_01_archive.html
Re: Difference in params - depth and topN
Posted by David Philip <da...@gmail.com>.
Hi,
Thank you for the reply.
-David
On Mon, Dec 24, 2012 at 4:18 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> HI
>
> -----Original message-----
> > From:David Philip <da...@gmail.com>
> > Sent: Mon 24-Dec-2012 09:50
> > To: user@nutch.apache.org
> > Subject: Re: Difference in params - depth and topN
> >
> > Hi Markus,
> > What is the default value for topN when it is not passed through
> > command? I mean simply passing the depth param and no topN - (bin/nutch
> > crawl urls -dir crawl -depth 3)
>
> There is no default, if not specified the generator will select all URL's
> that are eligible for fetch.
>
> >
> > Also,If the depth is number of crawl cycles, can you please brief me on
> the
> > logic behind it to crawl all the 5 URL when depth param passed is 3
> (-depth
> > 3)?
>
> Can be multiple reasons:
> - not all outlinks are correct
> - limit on number of url's per host or domain
> - transient error
>
> >
> > Thanks
> > David.
> >
> > On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Hi - Depth means how many crawl cycles are executes and topN means how
> > > many URL's per cycle are selected.
> > >
> > > -----Original message-----
> > > > From:David Philip <da...@gmail.com>
> > > > Sent: Fri 21-Dec-2012 13:50
> > > > To: user@nutch.apache.org
> > > > Subject: Difference in params - depth and topN
> > > >
> > > > Hello All,
> > > >
> > > > There is a site that has total 5 URLS.
> > > >
> > > >
> > > > - When this site is crawled with input param for depth 3 , all 5
> sites
> > > > are crawled in one shot.
> > > >
> > > > - And when it is crawled with params : depth 1 topN 5 TWO times,
> > > for
> > > > this first time only one URL is crawled and second time rest 4 are
> > > crawled.
> > > >
> > > > - And when params: depth 1 topN 3 - after 3 times it crawled all
> the
> > > 5
> > > > sites.
> > > >
> > > > Didn't understand what does these 2 parameters mean. Can anyone
> brief or
> > > > redirect to url that explains this? Below are the list of url and
> readdb
> > > > stats.
> > > >
> > > > *stats:*
> > > > Statistics for CrawlDb: crawl/crawldb
> > > > TOTAL urls: 5
> > > > status 2 (db_fetched): 5
> > > > CrawlDb statistics: done
> > > >
> > > > *URLS : *
> > > > http://liveforyou.blogspot.in/
> > > > http://liveforyou.blogspot.in/2012/12/blogging.html
> > > > http://liveforyou.blogspot.in/2011/09/test.html
> > > > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > > > http://liveforyou.blogspot.in/2011_09_01_archive.html
> > > >
> > >
> >
>
RE: Difference in params - depth and topN
Posted by Markus Jelsma <ma...@openindex.io>.
HI
-----Original message-----
> From:David Philip <da...@gmail.com>
> Sent: Mon 24-Dec-2012 09:50
> To: user@nutch.apache.org
> Subject: Re: Difference in params - depth and topN
>
> Hi Markus,
> What is the default value for topN when it is not passed through
> command? I mean simply passing the depth param and no topN - (bin/nutch
> crawl urls -dir crawl -depth 3)
There is no default, if not specified the generator will select all URL's that are eligible for fetch.
>
> Also,If the depth is number of crawl cycles, can you please brief me on the
> logic behind it to crawl all the 5 URL when depth param passed is 3 (-depth
> 3)?
Can be multiple reasons:
- not all outlinks are correct
- limit on number of url's per host or domain
- transient error
>
> Thanks
> David.
>
> On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > Hi - Depth means how many crawl cycles are executes and topN means how
> > many URL's per cycle are selected.
> >
> > -----Original message-----
> > > From:David Philip <da...@gmail.com>
> > > Sent: Fri 21-Dec-2012 13:50
> > > To: user@nutch.apache.org
> > > Subject: Difference in params - depth and topN
> > >
> > > Hello All,
> > >
> > > There is a site that has total 5 URLS.
> > >
> > >
> > > - When this site is crawled with input param for depth 3 , all 5 sites
> > > are crawled in one shot.
> > >
> > > - And when it is crawled with params : depth 1 topN 5 TWO times,
> > for
> > > this first time only one URL is crawled and second time rest 4 are
> > crawled.
> > >
> > > - And when params: depth 1 topN 3 - after 3 times it crawled all the
> > 5
> > > sites.
> > >
> > > Didn't understand what does these 2 parameters mean. Can anyone brief or
> > > redirect to url that explains this? Below are the list of url and readdb
> > > stats.
> > >
> > > *stats:*
> > > Statistics for CrawlDb: crawl/crawldb
> > > TOTAL urls: 5
> > > status 2 (db_fetched): 5
> > > CrawlDb statistics: done
> > >
> > > *URLS : *
> > > http://liveforyou.blogspot.in/
> > > http://liveforyou.blogspot.in/2012/12/blogging.html
> > > http://liveforyou.blogspot.in/2011/09/test.html
> > > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > > http://liveforyou.blogspot.in/2011_09_01_archive.html
> > >
> >
>
Re: Difference in params - depth and topN
Posted by David Philip <da...@gmail.com>.
Hi Markus,
What is the default value for topN when it is not passed through
command? I mean simply passing the depth param and no topN - (bin/nutch
crawl urls -dir crawl -depth 3)
Also,If the depth is number of crawl cycles, can you please brief me on the
logic behind it to crawl all the 5 URL when depth param passed is 3 (-depth
3)?
Thanks
David.
On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hi - Depth means how many crawl cycles are executes and topN means how
> many URL's per cycle are selected.
>
> -----Original message-----
> > From:David Philip <da...@gmail.com>
> > Sent: Fri 21-Dec-2012 13:50
> > To: user@nutch.apache.org
> > Subject: Difference in params - depth and topN
> >
> > Hello All,
> >
> > There is a site that has total 5 URLS.
> >
> >
> > - When this site is crawled with input param for depth 3 , all 5 sites
> > are crawled in one shot.
> >
> > - And when it is crawled with params : depth 1 topN 5 TWO times,
> for
> > this first time only one URL is crawled and second time rest 4 are
> crawled.
> >
> > - And when params: depth 1 topN 3 - after 3 times it crawled all the
> 5
> > sites.
> >
> > Didn't understand what does these 2 parameters mean. Can anyone brief or
> > redirect to url that explains this? Below are the list of url and readdb
> > stats.
> >
> > *stats:*
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls: 5
> > status 2 (db_fetched): 5
> > CrawlDb statistics: done
> >
> > *URLS : *
> > http://liveforyou.blogspot.in/
> > http://liveforyou.blogspot.in/2012/12/blogging.html
> > http://liveforyou.blogspot.in/2011/09/test.html
> > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > http://liveforyou.blogspot.in/2011_09_01_archive.html
> >
>
RE: Difference in params - depth and topN
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - Depth means how many crawl cycles are executes and topN means how many URL's per cycle are selected.
-----Original message-----
> From:David Philip <da...@gmail.com>
> Sent: Fri 21-Dec-2012 13:50
> To: user@nutch.apache.org
> Subject: Difference in params - depth and topN
>
> Hello All,
>
> There is a site that has total 5 URLS.
>
>
> - When this site is crawled with input param for depth 3 , all 5 sites
> are crawled in one shot.
>
> - And when it is crawled with params : depth 1 topN 5 TWO times, for
> this first time only one URL is crawled and second time rest 4 are crawled.
>
> - And when params: depth 1 topN 3 - after 3 times it crawled all the 5
> sites.
>
> Didn't understand what does these 2 parameters mean. Can anyone brief or
> redirect to url that explains this? Below are the list of url and readdb
> stats.
>
> *stats:*
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls: 5
> status 2 (db_fetched): 5
> CrawlDb statistics: done
>
> *URLS : *
> http://liveforyou.blogspot.in/
> http://liveforyou.blogspot.in/2012/12/blogging.html
> http://liveforyou.blogspot.in/2011/09/test.html
> http://liveforyou.blogspot.in/2012_12_01_archive.html
> http://liveforyou.blogspot.in/2011_09_01_archive.html
>
Difference in params - depth and topN
Posted by David Philip <da...@gmail.com>.
Hello All,
There is a site that has total 5 URLS.
- When this site is crawled with input param for depth 3 , all 5 sites
are crawled in one shot.
- And when it is crawled with params : depth 1 topN 5 TWO times, for
this first time only one URL is crawled and second time rest 4 are crawled.
- And when params: depth 1 topN 3 - after 3 times it crawled all the 5
sites.
Didn't understand what does these 2 parameters mean. Can anyone brief or
redirect to url that explains this? Below are the list of url and readdb
stats.
*stats:*
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
status 2 (db_fetched): 5
CrawlDb statistics: done
*URLS : *
http://liveforyou.blogspot.in/
http://liveforyou.blogspot.in/2012/12/blogging.html
http://liveforyou.blogspot.in/2011/09/test.html
http://liveforyou.blogspot.in/2012_12_01_archive.html
http://liveforyou.blogspot.in/2011_09_01_archive.html