You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gaurav Bagga <gb...@gmail.com> on 2011/08/30 22:12:40 UTC

Regarding Decrease in number of domains in readdb -stats -sort

I am using nutch 1.0 and after every updatedb, I take the stats with the
sort parameter which gives the details statistics regarding the domains and
their count(number of urls for that domain in crawldb).
But I see that there is a variable number of domains that do not make into
the next round of statistics.

Example:
Suppose a domain will be in 4 rounds of crawling (by looking at readdb stats
-sort usage) but it will disappear from the next rounds.
Or some domain will be there for first two rounds but will disappear from
stats for the next few rounds and then reappear again.

Is it possible that the domains may be removed from the crawldb or/and then
added later?

Regards
Gaurav

Re: Regarding Decrease in number of domains in readdb -stats -sort

Posted by Markus Jelsma <ma...@openindex.io>.

> I mean that if in one cycle N domains show in DB
> in the next cycle there is N - x domains left.
> Number of domains left in crawldb decreases sometimes.
> 

That should not be possible at all. Perhaps the output of stats is not 
complete or a misinterpretation.

> 
> Same with the number of fetched urls.

That is possible but the numbers should add up. A fetched url can become a 404 
(db_gone) or a not_modified status.

> My understanding is that after every crawl cycle, the number of fetched
> urls should keep increasing, i.e.  the number is cumulative of the number
> from previous cycle and this cycle. But it decreases as well.

Please try the domain statistics tool and you may also want to readdb -dump 
between cycles and compare. Url's will change status over time. Either 404 or 
not modified or become a redirect.

You may also want to limit the number of url's (e.g. 10 or 20) in a fetch 
cycle so you have a few url's to compare between dumps. Check the changed 
status of those few url's.


> 
> Don't know if this is possible.
> 
> 
> Gaurav
> 
> 
> On Tue, Aug 30, 2011 at 1:24 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Hi
> > 
> > > I am using nutch 1.0 and after every updatedb, I take the stats with
> > > the sort parameter which gives the details statistics regarding the
> > > domains
> > 
> > and
> > 
> > > their count(number of urls for that domain in crawldb).
> > > But I see that there is a variable number of domains that do not make
> > 
> > into
> > 
> > > the next round of statistics.
> > 
> > Is my understanding of the above correct that you have N domains in the
> > DB but
> > not all N domains have incremented counts after a crawl cycle?
> > 
> > > Example:
> > > Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> > > stats -sort usage) but it will disappear from the next rounds.
> > > Or some domain will be there for first two rounds but will disappear
> > > from stats for the next few rounds and then reappear again.
> > 
> > Disappear from stats? I am not sure how readdb writes stats but you may
> > want
> > to try the domainstatistics tool (more recent Nutch). That tool can write
> > a complete list of domains and number of url's per domain.
> > 
> > > Is it possible that the domains may be removed from the crawldb or/and
> > 
> > then
> > 
> > > added later?
> > > 
> > > Regards
> > > Gaurav

Re: Regarding Decrease in number of domains in readdb -stats -sort

Posted by gaurav bagga <ga...@gmail.com>.

I mean that if in one cycle N domains show in DB
in the next cycle there is N - x domains left.
Number of domains left in crawldb decreases sometimes.

Same with the number of fetched urls.
My understanding is that after every crawl cycle, the number of fetched urls
should keep increasing, i.e.  the number is cumulative of the number from
previous cycle and this cycle. But it decreases as well.

Don't know if this is possible.

Gaurav

On Tue, Aug 30, 2011 at 1:24 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi
>
> > I am using nutch 1.0 and after every updatedb, I take the stats with the
> > sort parameter which gives the details statistics regarding the domains
> and
> > their count(number of urls for that domain in crawldb).
> > But I see that there is a variable number of domains that do not make
> into
> > the next round of statistics.
> >
>
> Is my understanding of the above correct that you have N domains in the DB
> but
> not all N domains have incremented counts after a crawl cycle?
>
> > Example:
> > Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> > stats -sort usage) but it will disappear from the next rounds.
> > Or some domain will be there for first two rounds but will disappear from
> > stats for the next few rounds and then reappear again.
>
> Disappear from stats? I am not sure how readdb writes stats but you may
> want
> to try the domainstatistics tool (more recent Nutch). That tool can write a
> complete list of domains and number of url's per domain.
>
> >
> > Is it possible that the domains may be removed from the crawldb or/and
> then
> > added later?
> >
> > Regards
> > Gaurav
>

Re: Regarding Decrease in number of domains in readdb -stats -sort

Posted by Markus Jelsma <ma...@openindex.io>.

Hi

> I am using nutch 1.0 and after every updatedb, I take the stats with the
> sort parameter which gives the details statistics regarding the domains and
> their count(number of urls for that domain in crawldb).
> But I see that there is a variable number of domains that do not make into
> the next round of statistics.
>

Is my understanding of the above correct that you have N domains in the DB but 
not all N domains have incremented counts after a crawl cycle? 

> Example:
> Suppose a domain will be in 4 rounds of crawling (by looking at readdb
> stats -sort usage) but it will disappear from the next rounds.
> Or some domain will be there for first two rounds but will disappear from
> stats for the next few rounds and then reappear again.

Disappear from stats? I am not sure how readdb writes stats but you may want 
to try the domainstatistics tool (more recent Nutch). That tool can write a 
complete list of domains and number of url's per domain.

> 
> Is it possible that the domains may be removed from the crawldb or/and then
> added later?
> 
> Regards
> Gaurav