You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/07/28 15:53:09 UTC

[jira] [Created] (NUTCH-1071) Crawldb update to total counts per status

Crawldb update to total counts per status
-----------------------------------------

                 Key: NUTCH-1071
                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Julien Nioche
            Assignee: Julien Nioche
            Priority: Trivial
             Fix For: 1.4


The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Markus Jelsma <ma...@openindex.io>.
On Friday 29 July 2011 17:06:15 Julien Nioche wrote:
> Markus,
> 
> Have just committed a change to CrawlDBReducer (rev 1152254)
> 
> see line 155
> ->         reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
> old*.getStatus())).increment(1);
> 
> was using the wrong object :-(

Don't be sad, it's finally weekend! Oh, and by the way, it's fixed as well! 
Both readdb and updatedb show identical numbers.

Thanks! 

> 
> Would you mind giving it a try?
> 
> Thanks
> 
> Julien

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Julien Nioche <li...@gmail.com>.
Markus,

Have just committed a change to CrawlDBReducer (rev 1152254)

see line 155
->         reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
old*.getStatus())).increment(1);

was using the wrong object :-(

Would you mind giving it a try?

Thanks

Julien

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Markus Jelsma <ma...@openindex.io>.
Another cycle is complete, the results are incremental but similar.  The 
TOTAL_URLS of stats is reduce_output_records + 1, seems alright. 

Can't find any order in the numbers.

On Friday 29 July 2011 11:43:19 Julien Nioche wrote:
> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0       1,036,840       1,036,840
> > db_redir_perm   0       1,195,539       1,195,539
> > unknown         0       2,315   2,315
> > db_unfetched    0       16,909,397      16,909,397
> > db_notmodified  0       1,264,001       1,264,001
> > db_gone         0       955,701         955,701
> > db_fetched      0       19,545,591      19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):    26788643
> > status 2 (db_fetched):      12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > > 
> > >                  Key: NUTCH-1071
> > >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >              
> > >              Project: Nutch
> > >           
> > >           Issue Type: Improvement
> > >     
> > >     Affects Versions: 1.4
> > >     
> > >             Reporter: Julien Nioche
> > >             Assignee: Julien Nioche
> > >             Priority: Trivial
> > >             
> > >              Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

There ran a lot of indexers between readdb and updatedb and one domainstats 
job. Nothing that could interfere with the numbers afaik. I'm running a 
complete cycle again and will do a readdb again after the update and recheck.

I've no idea what the problem could. I don't know whether to trust readdb or 
updatedb stats and there's no way to check which output is true or is there.

Thanks

> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0       1,036,840       1,036,840
> > db_redir_perm   0       1,195,539       1,195,539
> > unknown         0       2,315   2,315
> > db_unfetched    0       16,909,397      16,909,397
> > db_notmodified  0       1,264,001       1,264,001
> > db_gone         0       955,701         955,701
> > db_fetched      0       19,545,591      19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):    26788643
> > status 2 (db_fetched):      12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > > 
> > >                  Key: NUTCH-1071
> > >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >              
> > >              Project: Nutch
> > >           
> > >           Issue Type: Improvement
> > >     
> > >     Affects Versions: 1.4
> > >     
> > >             Reporter: Julien Nioche
> > >             Assignee: Julien Nioche
> > >             Priority: Trivial
> > >             
> > >              Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,

Can't really think of a reason why they could differ. You called 'readdb
-stats' right after the crawldb?
Could it be a problem with readdb -stats?  And why are we seeing an
'unknown' status in the crawldb update?

That's definitely interesting

Julien

On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:

> Hi Julien,
>
> Can you explain the following? I've got here some output from a readdb
> -stats
> job and the output of the most recent crawldb update job. They differ a
> lot!
>
> update:
> db_redir_temp   0       1,036,840       1,036,840
> db_redir_perm   0       1,195,539       1,195,539
> unknown         0       2,315   2,315
> db_unfetched    0       16,909,397      16,909,397
> db_notmodified  0       1,264,001       1,264,001
> db_gone         0       955,701         955,701
> db_fetched      0       19,545,591      19,545,591
>
> stats:
> TOTAL urls: 40909384
> status 1 (db_unfetched):    26788643
> status 2 (db_fetched):      12345476
> status 3 (db_gone): 763463
> status 4 (db_redir_temp):   461511
> status 5 (db_redir_perm):   431595
> status 6 (db_notmodified):  118696
>
> Thanks
>
> > Crawldb update to total counts per status
> > -----------------------------------------
> >
> >                  Key: NUTCH-1071
> >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> >              Project: Nutch
> >           Issue Type: Improvement
> >     Affects Versions: 1.4
> >             Reporter: Julien Nioche
> >             Assignee: Julien Nioche
> >             Priority: Trivial
> >              Fix For: 1.4
> >
> >
> > The reduce phase of the crawldb update outputs all the entries that will
> be
> > found in the updated crawldb. We can use the counters to summarise the
> > number of URLs per status, which is a bit like the readdb -stats
> > functionality except that it does not require an additional step. This is
> > a useful way of monitoring the progress of a crawl using the Hadoop
> > JobTracker UI.
> >
> > --
> > This message is automatically generated by JIRA.
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: (NUTCH-1071) Crawldb update to total counts per status

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Julien,

Can you explain the following? I've got here some output from a readdb -stats 
job and the output of the most recent crawldb update job. They differ a lot!

update:
db_redir_temp 	0 	1,036,840 	1,036,840
db_redir_perm 	0 	1,195,539 	1,195,539
unknown 	0 	2,315 	2,315
db_unfetched 	0 	16,909,397 	16,909,397
db_notmodified 	0 	1,264,001 	1,264,001
db_gone 	0 	955,701 	955,701
db_fetched 	0 	19,545,591 	19,545,591

stats:
TOTAL urls: 40909384
status 1 (db_unfetched):    26788643
status 2 (db_fetched):      12345476
status 3 (db_gone): 763463
status 4 (db_redir_temp):   461511
status 5 (db_redir_perm):   431595
status 6 (db_notmodified):  118696

Thanks

> Crawldb update to total counts per status
> -----------------------------------------
> 
>                  Key: NUTCH-1071
>                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
>              Project: Nutch
>           Issue Type: Improvement
>     Affects Versions: 1.4
>             Reporter: Julien Nioche
>             Assignee: Julien Nioche
>             Priority: Trivial
>              Fix For: 1.4
> 
> 
> The reduce phase of the crawldb update outputs all the entries that will be
> found in the updated crawldb. We can use the counters to summarise the
> number of URLs per status, which is a bit like the readdb -stats
> functionality except that it does not require an additional step. This is
> a useful way of monitoring the progress of a crawl using the Hadoop
> JobTracker UI.
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1071) Crawldb update to total counts per status

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072358#comment-13072358 ] 

Markus Jelsma commented on NUTCH-1071:
--------------------------------------

Great work! Very useful indeed.

> Crawldb update to total counts per status
> -----------------------------------------
>
>                 Key: NUTCH-1071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1071) Crawldb update to total counts per status

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-1071.
--------------------------------


> Crawldb update to total counts per status
> -----------------------------------------
>
>                 Key: NUTCH-1071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1071) Crawldb update to total counts per status

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1071.
----------------------------------

    Resolution: Fixed

Committed revision 1151852.


> Crawldb update to total counts per status
> -----------------------------------------
>
>                 Key: NUTCH-1071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1071) Crawldb update to total counts per status

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1071:
---------------------------------

    Attachment: NUTCH-1071.patch

> Crawldb update to total counts per status
> -----------------------------------------
>
>                 Key: NUTCH-1071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1071
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step. 
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira