You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by David Philip <da...@gmail.com> on 2012/12/21 13:29:30 UTC

Difference in params - depth and topN

Hello All,

   There is a site that has total 5 URLS.


   - When this site is crawled with input param for depth 3 , all 5 sites
   are crawled in one shot.

   - And when it is crawled with  params : depth 1 topN 5  TWO times,  for
   this first time only one URL is crawled and second time rest 4 are crawled.

   - And when params: depth 1 topN 3  - after 3 times it crawled all the 5
   sites.

Didn't understand what does these 2 parameters mean. Can anyone brief or
redirect to url that explains this? Below are the list of url and readdb
stats.

*stats:*
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
status 2 (db_fetched): 5
CrawlDb statistics: done

*URLS : *
http://liveforyou.blogspot.in/
http://liveforyou.blogspot.in/2012/12/blogging.html
http://liveforyou.blogspot.in/2011/09/test.html
http://liveforyou.blogspot.in/2012_12_01_archive.html
http://liveforyou.blogspot.in/2011_09_01_archive.html

Re: Difference in params - depth and topN

Posted by David Philip <da...@gmail.com>.

Hi,

  Thank you for the reply.

-David

On Mon, Dec 24, 2012 at 4:18 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> HI
>
> -----Original message-----
> > From:David Philip <da...@gmail.com>
> > Sent: Mon 24-Dec-2012 09:50
> > To: user@nutch.apache.org
> > Subject: Re: Difference in params - depth and topN
> >
> > Hi Markus,
> >     What is the default value for topN when it is not passed through
> > command? I mean simply passing the depth param and no topN - (bin/nutch
> > crawl urls -dir crawl -depth 3)
>
> There is no default, if not specified the generator will select all URL's
> that are eligible for fetch.
>
> >
> > Also,If the depth is number of crawl cycles, can you please brief me on
> the
> > logic behind it to crawl all the 5 URL when depth param passed is 3
> (-depth
> > 3)?
>
> Can be multiple reasons:
> - not all outlinks are correct
> - limit on number of url's per host or domain
> - transient error
>
> >
> > Thanks
> > David.
> >
> > On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Hi - Depth means how many crawl cycles are executes and topN means how
> > > many URL's per cycle are selected.
> > >
> > > -----Original message-----
> > > > From:David Philip <da...@gmail.com>
> > > > Sent: Fri 21-Dec-2012 13:50
> > > > To: user@nutch.apache.org
> > > > Subject: Difference in params - depth and topN
> > > >
> > > > Hello All,
> > > >
> > > >    There is a site that has total 5 URLS.
> > > >
> > > >
> > > >    - When this site is crawled with input param for depth 3 , all 5
> sites
> > > >    are crawled in one shot.
> > > >
> > > >    - And when it is crawled with  params : depth 1 topN 5  TWO times,
> > >  for
> > > >    this first time only one URL is crawled and second time rest 4 are
> > > crawled.
> > > >
> > > >    - And when params: depth 1 topN 3  - after 3 times it crawled all
> the
> > > 5
> > > >    sites.
> > > >
> > > > Didn't understand what does these 2 parameters mean. Can anyone
> brief or
> > > > redirect to url that explains this? Below are the list of url and
> readdb
> > > > stats.
> > > >
> > > > *stats:*
> > > > Statistics for CrawlDb: crawl/crawldb
> > > > TOTAL urls: 5
> > > > status 2 (db_fetched): 5
> > > > CrawlDb statistics: done
> > > >
> > > > *URLS : *
> > > > http://liveforyou.blogspot.in/
> > > > http://liveforyou.blogspot.in/2012/12/blogging.html
> > > > http://liveforyou.blogspot.in/2011/09/test.html
> > > > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > > > http://liveforyou.blogspot.in/2011_09_01_archive.html
> > > >
> > >
> >
>

RE: Difference in params - depth and topN

Posted by Markus Jelsma <ma...@openindex.io>.

HI
 
-----Original message-----
> From:David Philip <da...@gmail.com>
> Sent: Mon 24-Dec-2012 09:50
> To: user@nutch.apache.org
> Subject: Re: Difference in params - depth and topN
> 
> Hi Markus,
>     What is the default value for topN when it is not passed through
> command? I mean simply passing the depth param and no topN - (bin/nutch
> crawl urls -dir crawl -depth 3)

There is no default, if not specified the generator will select all URL's that are eligible for fetch.

> 
> Also,If the depth is number of crawl cycles, can you please brief me on the
> logic behind it to crawl all the 5 URL when depth param passed is 3 (-depth
> 3)?

Can be multiple reasons:
- not all outlinks are correct
- limit on number of url's per host or domain
- transient error

> 
> Thanks
> David.
> 
> On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Hi - Depth means how many crawl cycles are executes and topN means how
> > many URL's per cycle are selected.
> >
> > -----Original message-----
> > > From:David Philip <da...@gmail.com>
> > > Sent: Fri 21-Dec-2012 13:50
> > > To: user@nutch.apache.org
> > > Subject: Difference in params - depth and topN
> > >
> > > Hello All,
> > >
> > >    There is a site that has total 5 URLS.
> > >
> > >
> > >    - When this site is crawled with input param for depth 3 , all 5 sites
> > >    are crawled in one shot.
> > >
> > >    - And when it is crawled with  params : depth 1 topN 5  TWO times,
> >  for
> > >    this first time only one URL is crawled and second time rest 4 are
> > crawled.
> > >
> > >    - And when params: depth 1 topN 3  - after 3 times it crawled all the
> > 5
> > >    sites.
> > >
> > > Didn't understand what does these 2 parameters mean. Can anyone brief or
> > > redirect to url that explains this? Below are the list of url and readdb
> > > stats.
> > >
> > > *stats:*
> > > Statistics for CrawlDb: crawl/crawldb
> > > TOTAL urls: 5
> > > status 2 (db_fetched): 5
> > > CrawlDb statistics: done
> > >
> > > *URLS : *
> > > http://liveforyou.blogspot.in/
> > > http://liveforyou.blogspot.in/2012/12/blogging.html
> > > http://liveforyou.blogspot.in/2011/09/test.html
> > > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > > http://liveforyou.blogspot.in/2011_09_01_archive.html
> > >
> >
>

Re: Difference in params - depth and topN

Posted by David Philip <da...@gmail.com>.

Hi Markus,
    What is the default value for topN when it is not passed through
command? I mean simply passing the depth param and no topN - (bin/nutch
crawl urls -dir crawl -depth 3)

Also,If the depth is number of crawl cycles, can you please brief me on the
logic behind it to crawl all the 5 URL when depth param passed is 3 (-depth
3)?

Thanks
David.

On Fri, Dec 21, 2012 at 6:25 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi - Depth means how many crawl cycles are executes and topN means how
> many URL's per cycle are selected.
>
> -----Original message-----
> > From:David Philip <da...@gmail.com>
> > Sent: Fri 21-Dec-2012 13:50
> > To: user@nutch.apache.org
> > Subject: Difference in params - depth and topN
> >
> > Hello All,
> >
> >    There is a site that has total 5 URLS.
> >
> >
> >    - When this site is crawled with input param for depth 3 , all 5 sites
> >    are crawled in one shot.
> >
> >    - And when it is crawled with  params : depth 1 topN 5  TWO times,
>  for
> >    this first time only one URL is crawled and second time rest 4 are
> crawled.
> >
> >    - And when params: depth 1 topN 3  - after 3 times it crawled all the
> 5
> >    sites.
> >
> > Didn't understand what does these 2 parameters mean. Can anyone brief or
> > redirect to url that explains this? Below are the list of url and readdb
> > stats.
> >
> > *stats:*
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls: 5
> > status 2 (db_fetched): 5
> > CrawlDb statistics: done
> >
> > *URLS : *
> > http://liveforyou.blogspot.in/
> > http://liveforyou.blogspot.in/2012/12/blogging.html
> > http://liveforyou.blogspot.in/2011/09/test.html
> > http://liveforyou.blogspot.in/2012_12_01_archive.html
> > http://liveforyou.blogspot.in/2011_09_01_archive.html
> >
>

RE: Difference in params - depth and topN

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Depth means how many crawl cycles are executes and topN means how many URL's per cycle are selected. 
 
-----Original message-----
> From:David Philip <da...@gmail.com>
> Sent: Fri 21-Dec-2012 13:50
> To: user@nutch.apache.org
> Subject: Difference in params - depth and topN
> 
> Hello All,
> 
>    There is a site that has total 5 URLS.
> 
> 
>    - When this site is crawled with input param for depth 3 , all 5 sites
>    are crawled in one shot.
> 
>    - And when it is crawled with  params : depth 1 topN 5  TWO times,  for
>    this first time only one URL is crawled and second time rest 4 are crawled.
> 
>    - And when params: depth 1 topN 3  - after 3 times it crawled all the 5
>    sites.
> 
> Didn't understand what does these 2 parameters mean. Can anyone brief or
> redirect to url that explains this? Below are the list of url and readdb
> stats.
> 
> *stats:*
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls: 5
> status 2 (db_fetched): 5
> CrawlDb statistics: done
> 
> *URLS : *
> http://liveforyou.blogspot.in/
> http://liveforyou.blogspot.in/2012/12/blogging.html
> http://liveforyou.blogspot.in/2011/09/test.html
> http://liveforyou.blogspot.in/2012_12_01_archive.html
> http://liveforyou.blogspot.in/2011_09_01_archive.html
>

Difference in params - depth and topN

Posted by David Philip <da...@gmail.com>.

Hello All,

   There is a site that has total 5 URLS.


   - When this site is crawled with input param for depth 3 , all 5 sites
   are crawled in one shot.

   - And when it is crawled with  params : depth 1 topN 5  TWO times,  for
   this first time only one URL is crawled and second time rest 4 are crawled.

   - And when params: depth 1 topN 3  - after 3 times it crawled all the 5
   sites.

Didn't understand what does these 2 parameters mean. Can anyone brief or
redirect to url that explains this? Below are the list of url and readdb
stats.

*stats:*
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
status 2 (db_fetched): 5
CrawlDb statistics: done

*URLS : *
http://liveforyou.blogspot.in/
http://liveforyou.blogspot.in/2012/12/blogging.html
http://liveforyou.blogspot.in/2011/09/test.html
http://liveforyou.blogspot.in/2012_12_01_archive.html
http://liveforyou.blogspot.in/2011_09_01_archive.html