You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/07/28 15:53:09 UTC
[jira] [Created] (NUTCH-1071) Crawldb update to total counts per
status
Crawldb update to total counts per status
-----------------------------------------
Key: NUTCH-1071
URL: https://issues.apache.org/jira/browse/NUTCH-1071
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Trivial
Fix For: 1.4
The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step.
This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Markus Jelsma <ma...@openindex.io>.
On Friday 29 July 2011 17:06:15 Julien Nioche wrote:
> Markus,
>
> Have just committed a change to CrawlDBReducer (rev 1152254)
>
> see line 155
> -> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
> old*.getStatus())).increment(1);
>
> was using the wrong object :-(
Don't be sad, it's finally weekend! Oh, and by the way, it's fixed as well!
Both readdb and updatedb show identical numbers.
Thanks!
>
> Would you mind giving it a try?
>
> Thanks
>
> Julien
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Julien Nioche <li...@gmail.com>.
Markus,
Have just committed a change to CrawlDBReducer (rev 1152254)
see line 155
-> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(*
old*.getStatus())).increment(1);
was using the wrong object :-(
Would you mind giving it a try?
Thanks
Julien
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Markus Jelsma <ma...@openindex.io>.
Another cycle is complete, the results are incremental but similar. The
TOTAL_URLS of stats is reduce_output_records + 1, seems alright.
Can't find any order in the numbers.
On Friday 29 July 2011 11:43:19 Julien Nioche wrote:
> Hi Markus,
>
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats? And why are we seeing an
> 'unknown' status in the crawldb update?
>
> That's definitely interesting
>
> Julien
>
> On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi Julien,
> >
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> >
> > update:
> > db_redir_temp 0 1,036,840 1,036,840
> > db_redir_perm 0 1,195,539 1,195,539
> > unknown 0 2,315 2,315
> > db_unfetched 0 16,909,397 16,909,397
> > db_notmodified 0 1,264,001 1,264,001
> > db_gone 0 955,701 955,701
> > db_fetched 0 19,545,591 19,545,591
> >
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched): 26788643
> > status 2 (db_fetched): 12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp): 461511
> > status 5 (db_redir_perm): 431595
> > status 6 (db_notmodified): 118696
> >
> > Thanks
> >
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > >
> > > Key: NUTCH-1071
> > > URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >
> > > Project: Nutch
> > >
> > > Issue Type: Improvement
> > >
> > > Affects Versions: 1.4
> > >
> > > Reporter: Julien Nioche
> > > Assignee: Julien Nioche
> > > Priority: Trivial
> > >
> > > Fix For: 1.4
> > >
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> >
> > be
> >
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> >
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
There ran a lot of indexers between readdb and updatedb and one domainstats
job. Nothing that could interfere with the numbers afaik. I'm running a
complete cycle again and will do a readdb again after the update and recheck.
I've no idea what the problem could. I don't know whether to trust readdb or
updatedb stats and there's no way to check which output is true or is there.
Thanks
> Hi Markus,
>
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats? And why are we seeing an
> 'unknown' status in the crawldb update?
>
> That's definitely interesting
>
> Julien
>
> On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi Julien,
> >
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> >
> > update:
> > db_redir_temp 0 1,036,840 1,036,840
> > db_redir_perm 0 1,195,539 1,195,539
> > unknown 0 2,315 2,315
> > db_unfetched 0 16,909,397 16,909,397
> > db_notmodified 0 1,264,001 1,264,001
> > db_gone 0 955,701 955,701
> > db_fetched 0 19,545,591 19,545,591
> >
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched): 26788643
> > status 2 (db_fetched): 12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp): 461511
> > status 5 (db_redir_perm): 431595
> > status 6 (db_notmodified): 118696
> >
> > Thanks
> >
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > >
> > > Key: NUTCH-1071
> > > URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >
> > > Project: Nutch
> > >
> > > Issue Type: Improvement
> > >
> > > Affects Versions: 1.4
> > >
> > > Reporter: Julien Nioche
> > > Assignee: Julien Nioche
> > > Priority: Trivial
> > >
> > > Fix For: 1.4
> > >
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> >
> > be
> >
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> >
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,
Can't really think of a reason why they could differ. You called 'readdb
-stats' right after the crawldb?
Could it be a problem with readdb -stats? And why are we seeing an
'unknown' status in the crawldb update?
That's definitely interesting
Julien
On 29 July 2011 10:23, Markus Jelsma <ma...@openindex.io> wrote:
> Hi Julien,
>
> Can you explain the following? I've got here some output from a readdb
> -stats
> job and the output of the most recent crawldb update job. They differ a
> lot!
>
> update:
> db_redir_temp 0 1,036,840 1,036,840
> db_redir_perm 0 1,195,539 1,195,539
> unknown 0 2,315 2,315
> db_unfetched 0 16,909,397 16,909,397
> db_notmodified 0 1,264,001 1,264,001
> db_gone 0 955,701 955,701
> db_fetched 0 19,545,591 19,545,591
>
> stats:
> TOTAL urls: 40909384
> status 1 (db_unfetched): 26788643
> status 2 (db_fetched): 12345476
> status 3 (db_gone): 763463
> status 4 (db_redir_temp): 461511
> status 5 (db_redir_perm): 431595
> status 6 (db_notmodified): 118696
>
> Thanks
>
> > Crawldb update to total counts per status
> > -----------------------------------------
> >
> > Key: NUTCH-1071
> > URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > Project: Nutch
> > Issue Type: Improvement
> > Affects Versions: 1.4
> > Reporter: Julien Nioche
> > Assignee: Julien Nioche
> > Priority: Trivial
> > Fix For: 1.4
> >
> >
> > The reduce phase of the crawldb update outputs all the entries that will
> be
> > found in the updated crawldb. We can use the counters to summarise the
> > number of URLs per status, which is a bit like the readdb -stats
> > functionality except that it does not require an additional step. This is
> > a useful way of monitoring the progress of a crawl using the Hadoop
> > JobTracker UI.
> >
> > --
> > This message is automatically generated by JIRA.
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: (NUTCH-1071) Crawldb update to total counts per status
Posted by Markus Jelsma <ma...@openindex.io>.
Hi Julien,
Can you explain the following? I've got here some output from a readdb -stats
job and the output of the most recent crawldb update job. They differ a lot!
update:
db_redir_temp 0 1,036,840 1,036,840
db_redir_perm 0 1,195,539 1,195,539
unknown 0 2,315 2,315
db_unfetched 0 16,909,397 16,909,397
db_notmodified 0 1,264,001 1,264,001
db_gone 0 955,701 955,701
db_fetched 0 19,545,591 19,545,591
stats:
TOTAL urls: 40909384
status 1 (db_unfetched): 26788643
status 2 (db_fetched): 12345476
status 3 (db_gone): 763463
status 4 (db_redir_temp): 461511
status 5 (db_redir_perm): 431595
status 6 (db_notmodified): 118696
Thanks
> Crawldb update to total counts per status
> -----------------------------------------
>
> Key: NUTCH-1071
> URL: https://issues.apache.org/jira/browse/NUTCH-1071
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
> Fix For: 1.4
>
>
> The reduce phase of the crawldb update outputs all the entries that will be
> found in the updated crawldb. We can use the counters to summarise the
> number of URLs per status, which is a bit like the readdb -stats
> functionality except that it does not require an additional step. This is
> a useful way of monitoring the progress of a crawl using the Hadoop
> JobTracker UI.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1071) Crawldb update to total counts per
status
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072358#comment-13072358 ]
Markus Jelsma commented on NUTCH-1071:
--------------------------------------
Great work! Very useful indeed.
> Crawldb update to total counts per status
> -----------------------------------------
>
> Key: NUTCH-1071
> URL: https://issues.apache.org/jira/browse/NUTCH-1071
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step.
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1071) Crawldb update to total counts per
status
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche closed NUTCH-1071.
--------------------------------
> Crawldb update to total counts per status
> -----------------------------------------
>
> Key: NUTCH-1071
> URL: https://issues.apache.org/jira/browse/NUTCH-1071
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step.
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1071) Crawldb update to total counts per
status
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche resolved NUTCH-1071.
----------------------------------
Resolution: Fixed
Committed revision 1151852.
> Crawldb update to total counts per status
> -----------------------------------------
>
> Key: NUTCH-1071
> URL: https://issues.apache.org/jira/browse/NUTCH-1071
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step.
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1071) Crawldb update to total counts per
status
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-1071:
---------------------------------
Attachment: NUTCH-1071.patch
> Crawldb update to total counts per status
> -----------------------------------------
>
> Key: NUTCH-1071
> URL: https://issues.apache.org/jira/browse/NUTCH-1071
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1071.patch
>
>
> The reduce phase of the crawldb update outputs all the entries that will be found in the updated crawldb. We can use the counters to summarise the number of URLs per status, which is a bit like the readdb -stats functionality except that it does not require an additional step.
> This is a useful way of monitoring the progress of a crawl using the Hadoop JobTracker UI.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira