You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bhavin pandya <bv...@gmail.com> on 2009/12/15 08:10:43 UTC

Why readdb and readseg shows different figures?

Hi,

I am using Nutch 1.0.

For simple excercise i have crawled one single domain and after that i
tried both command readdb and readseg...
Both showing different figures. Which one i should consider? does
something went wrong while crawling?

Here is the output of both command.

OUTPUT FROM READDB:
----------------------------------------
CrawlDb statistics start: crawled/crawldb
Statistics for CrawlDb: crawled/crawldb
TOTAL urls:     84178
retry 0:        84175
retry 1:        3
min score:      0.0
avg score:      7.1693314E-5
max score:      1.2
status 1 (db_unfetched):        80475
status 2 (db_fetched):  3634
status 3 (db_gone):     8
status 4 (db_redir_temp):       29
status 5 (db_redir_perm):       32
CrawlDb statistics: done


OUTPUT FROM READSEG:
-------------------------------------------
NAME            GENERATED       FETCHER START           FETCHER END
         FETCHED PARSED
20091212212627  1               2009-12-12T21:28:29
2009-12-12T21:28:29     1       1
20091212212951  81              2009-12-12T21:32:20
2009-12-12T21:32:54     105     80
20091212213347  3691            2009-12-12T21:36:13
2009-12-12T22:16:39     3738    3621
20091212222210  84178           2009-12-12T22:24:30
2009-12-13T11:08:28     85189   81806
20091213151344  84178           2009-12-13T15:16:37
2009-12-14T05:50:45     85195   81824


Thanks.
Bhavin

Re: Why readdb and readseg shows different figures?

Posted by bhavin pandya <bv...@gmail.com>.
Hi,
Thanks for your prompt reply.

But as per readdb it has 3634 fetched pages.

>> status 1 (db_unfetched):        80475
>> status 2 (db_fetched):  3634

While as per readseg  if i add fetched/parsed pages for all segment it
comes to much more. (1 + 81 + 3691 + 84178 + 84178)

NAME            GENERATED       FETCHER START           FETCHER END
         FETCHED PARSED
20091212212627  1               2009-12-12T21:28:29
2009-12-12T21:28:29     1       1
20091212212951  81              2009-12-12T21:32:20
2009-12-12T21:32:54     105     80
20091212213347  3691            2009-12-12T21:36:13
2009-12-12T22:16:39     3738    3621
20091212222210  84178           2009-12-12T22:24:30
2009-12-13T11:08:28     85189   81806
20091213151344  84178           2009-12-13T15:16:37
2009-12-14T05:50:45     85195   81824

I dont understand does last figure in readseg (81824)  shows count for
that perticular segment (20091213151344 )  or total parsed pages
across all segments????

Thanks
-Bhavin


On Tue, Dec 15, 2009 at 1:20 PM, MilleBii <mi...@gmail.com> wrote:
> Every thing seems right.
> Both stats are interesting and it all depends on what you are looking for.
>
> Readdb gives you global stats where readseg is about each segments ie
> fetch/parse run.
>
> 2009/12/15, bhavin pandya <bv...@gmail.com>:
>> Hi,
>>
>> I am using Nutch 1.0.
>>
>> For simple excercise i have crawled one single domain and after that i
>> tried both command readdb and readseg...
>> Both showing different figures. Which one i should consider? does
>> something went wrong while crawling?
>>
>> Here is the output of both command.
>>
>> OUTPUT FROM READDB:
>> ----------------------------------------
>> CrawlDb statistics start: crawled/crawldb
>> Statistics for CrawlDb: crawled/crawldb
>> TOTAL urls:     84178
>> retry 0:        84175
>> retry 1:        3
>> min score:      0.0
>> avg score:      7.1693314E-5
>> max score:      1.2
>> status 1 (db_unfetched):        80475
>> status 2 (db_fetched):  3634
>> status 3 (db_gone):     8
>> status 4 (db_redir_temp):       29
>> status 5 (db_redir_perm):       32
>> CrawlDb statistics: done
>>
>>
>> OUTPUT FROM READSEG:
>> -------------------------------------------
>> NAME            GENERATED       FETCHER START           FETCHER END
>>          FETCHED PARSED
>> 20091212212627  1               2009-12-12T21:28:29
>> 2009-12-12T21:28:29     1       1
>> 20091212212951  81              2009-12-12T21:32:20
>> 2009-12-12T21:32:54     105     80
>> 20091212213347  3691            2009-12-12T21:36:13
>> 2009-12-12T22:16:39     3738    3621
>> 20091212222210  84178           2009-12-12T22:24:30
>> 2009-12-13T11:08:28     85189   81806
>> 20091213151344  84178           2009-12-13T15:16:37
>> 2009-12-14T05:50:45     85195   81824
>>
>>
>> Thanks.
>> Bhavin
>>
>
>
> --
> -MilleBii-
>



-- 
- Bhavin

Re: Why readdb and readseg shows different figures?

Posted by MilleBii <mi...@gmail.com>.
Every thing seems right.
Both stats are interesting and it all depends on what you are looking for.

Readdb gives you global stats where readseg is about each segments ie
fetch/parse run.

2009/12/15, bhavin pandya <bv...@gmail.com>:
> Hi,
>
> I am using Nutch 1.0.
>
> For simple excercise i have crawled one single domain and after that i
> tried both command readdb and readseg...
> Both showing different figures. Which one i should consider? does
> something went wrong while crawling?
>
> Here is the output of both command.
>
> OUTPUT FROM READDB:
> ----------------------------------------
> CrawlDb statistics start: crawled/crawldb
> Statistics for CrawlDb: crawled/crawldb
> TOTAL urls:     84178
> retry 0:        84175
> retry 1:        3
> min score:      0.0
> avg score:      7.1693314E-5
> max score:      1.2
> status 1 (db_unfetched):        80475
> status 2 (db_fetched):  3634
> status 3 (db_gone):     8
> status 4 (db_redir_temp):       29
> status 5 (db_redir_perm):       32
> CrawlDb statistics: done
>
>
> OUTPUT FROM READSEG:
> -------------------------------------------
> NAME            GENERATED       FETCHER START           FETCHER END
>          FETCHED PARSED
> 20091212212627  1               2009-12-12T21:28:29
> 2009-12-12T21:28:29     1       1
> 20091212212951  81              2009-12-12T21:32:20
> 2009-12-12T21:32:54     105     80
> 20091212213347  3691            2009-12-12T21:36:13
> 2009-12-12T22:16:39     3738    3621
> 20091212222210  84178           2009-12-12T22:24:30
> 2009-12-13T11:08:28     85189   81806
> 20091213151344  84178           2009-12-13T15:16:37
> 2009-12-14T05:50:45     85195   81824
>
>
> Thanks.
> Bhavin
>


-- 
-MilleBii-