You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bhavin pandya <bv...@gmail.com> on 2009/12/15 08:10:43 UTC
Why readdb and readseg shows different figures?
Hi,
I am using Nutch 1.0.
For simple excercise i have crawled one single domain and after that i
tried both command readdb and readseg...
Both showing different figures. Which one i should consider? does
something went wrong while crawling?
Here is the output of both command.
OUTPUT FROM READDB:
----------------------------------------
CrawlDb statistics start: crawled/crawldb
Statistics for CrawlDb: crawled/crawldb
TOTAL urls: 84178
retry 0: 84175
retry 1: 3
min score: 0.0
avg score: 7.1693314E-5
max score: 1.2
status 1 (db_unfetched): 80475
status 2 (db_fetched): 3634
status 3 (db_gone): 8
status 4 (db_redir_temp): 29
status 5 (db_redir_perm): 32
CrawlDb statistics: done
OUTPUT FROM READSEG:
-------------------------------------------
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20091212212627 1 2009-12-12T21:28:29
2009-12-12T21:28:29 1 1
20091212212951 81 2009-12-12T21:32:20
2009-12-12T21:32:54 105 80
20091212213347 3691 2009-12-12T21:36:13
2009-12-12T22:16:39 3738 3621
20091212222210 84178 2009-12-12T22:24:30
2009-12-13T11:08:28 85189 81806
20091213151344 84178 2009-12-13T15:16:37
2009-12-14T05:50:45 85195 81824
Thanks.
Bhavin
Re: Why readdb and readseg shows different figures?
Posted by bhavin pandya <bv...@gmail.com>.
Hi,
Thanks for your prompt reply.
But as per readdb it has 3634 fetched pages.
>> status 1 (db_unfetched): 80475
>> status 2 (db_fetched): 3634
While as per readseg if i add fetched/parsed pages for all segment it
comes to much more. (1 + 81 + 3691 + 84178 + 84178)
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20091212212627 1 2009-12-12T21:28:29
2009-12-12T21:28:29 1 1
20091212212951 81 2009-12-12T21:32:20
2009-12-12T21:32:54 105 80
20091212213347 3691 2009-12-12T21:36:13
2009-12-12T22:16:39 3738 3621
20091212222210 84178 2009-12-12T22:24:30
2009-12-13T11:08:28 85189 81806
20091213151344 84178 2009-12-13T15:16:37
2009-12-14T05:50:45 85195 81824
I dont understand does last figure in readseg (81824) shows count for
that perticular segment (20091213151344 ) or total parsed pages
across all segments????
Thanks
-Bhavin
On Tue, Dec 15, 2009 at 1:20 PM, MilleBii <mi...@gmail.com> wrote:
> Every thing seems right.
> Both stats are interesting and it all depends on what you are looking for.
>
> Readdb gives you global stats where readseg is about each segments ie
> fetch/parse run.
>
> 2009/12/15, bhavin pandya <bv...@gmail.com>:
>> Hi,
>>
>> I am using Nutch 1.0.
>>
>> For simple excercise i have crawled one single domain and after that i
>> tried both command readdb and readseg...
>> Both showing different figures. Which one i should consider? does
>> something went wrong while crawling?
>>
>> Here is the output of both command.
>>
>> OUTPUT FROM READDB:
>> ----------------------------------------
>> CrawlDb statistics start: crawled/crawldb
>> Statistics for CrawlDb: crawled/crawldb
>> TOTAL urls: 84178
>> retry 0: 84175
>> retry 1: 3
>> min score: 0.0
>> avg score: 7.1693314E-5
>> max score: 1.2
>> status 1 (db_unfetched): 80475
>> status 2 (db_fetched): 3634
>> status 3 (db_gone): 8
>> status 4 (db_redir_temp): 29
>> status 5 (db_redir_perm): 32
>> CrawlDb statistics: done
>>
>>
>> OUTPUT FROM READSEG:
>> -------------------------------------------
>> NAME GENERATED FETCHER START FETCHER END
>> FETCHED PARSED
>> 20091212212627 1 2009-12-12T21:28:29
>> 2009-12-12T21:28:29 1 1
>> 20091212212951 81 2009-12-12T21:32:20
>> 2009-12-12T21:32:54 105 80
>> 20091212213347 3691 2009-12-12T21:36:13
>> 2009-12-12T22:16:39 3738 3621
>> 20091212222210 84178 2009-12-12T22:24:30
>> 2009-12-13T11:08:28 85189 81806
>> 20091213151344 84178 2009-12-13T15:16:37
>> 2009-12-14T05:50:45 85195 81824
>>
>>
>> Thanks.
>> Bhavin
>>
>
>
> --
> -MilleBii-
>
--
- Bhavin
Re: Why readdb and readseg shows different figures?
Posted by MilleBii <mi...@gmail.com>.
Every thing seems right.
Both stats are interesting and it all depends on what you are looking for.
Readdb gives you global stats where readseg is about each segments ie
fetch/parse run.
2009/12/15, bhavin pandya <bv...@gmail.com>:
> Hi,
>
> I am using Nutch 1.0.
>
> For simple excercise i have crawled one single domain and after that i
> tried both command readdb and readseg...
> Both showing different figures. Which one i should consider? does
> something went wrong while crawling?
>
> Here is the output of both command.
>
> OUTPUT FROM READDB:
> ----------------------------------------
> CrawlDb statistics start: crawled/crawldb
> Statistics for CrawlDb: crawled/crawldb
> TOTAL urls: 84178
> retry 0: 84175
> retry 1: 3
> min score: 0.0
> avg score: 7.1693314E-5
> max score: 1.2
> status 1 (db_unfetched): 80475
> status 2 (db_fetched): 3634
> status 3 (db_gone): 8
> status 4 (db_redir_temp): 29
> status 5 (db_redir_perm): 32
> CrawlDb statistics: done
>
>
> OUTPUT FROM READSEG:
> -------------------------------------------
> NAME GENERATED FETCHER START FETCHER END
> FETCHED PARSED
> 20091212212627 1 2009-12-12T21:28:29
> 2009-12-12T21:28:29 1 1
> 20091212212951 81 2009-12-12T21:32:20
> 2009-12-12T21:32:54 105 80
> 20091212213347 3691 2009-12-12T21:36:13
> 2009-12-12T22:16:39 3738 3621
> 20091212222210 84178 2009-12-12T22:24:30
> 2009-12-13T11:08:28 85189 81806
> 20091213151344 84178 2009-12-13T15:16:37
> 2009-12-14T05:50:45 85195 81824
>
>
> Thanks.
> Bhavin
>
--
-MilleBii-