You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "stack@archive.org" <st...@archive.org> on 2006/04/29 03:29:21 UTC

CrawlDbReducer and the lone STATUS_SIGNATURE record

CrawlDbReducer#reduce doesn't have a switch case for 
CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121) 
block which throws a RuntimeException.   This causes my update db job to 
never succeed.

This has just recently started happening.

Enabling logging I see that what usually happens is that a CrawlDatum 
with a STATUS_SIGNATURE status comes through first and is set to be 
'highest' (line #49) but then the next record through takes over the 
'highest' role because its status is higher, usually 'fetch_success' or 
'linked' in my case.

But for reasons not clear to me, I'll sometimes have a lone CrawlDatum 
with a status of STATUS_SIGNATURE (A mapout lost a record?) with no 
following 'fetch_success' or 'linked' CrawlDatum.  

This probably shouldn't fail the job.

Attached is a patch that logs a warning and keeps going but probably not 
the right soln.

Thanks,
St.Ack



Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Stack wrote:
> Andrzej Bialecki wrote:
>> (redirected to nutch-dev)
> Pardon me.  I intended to send nutch-dev, not hadoop-dev.
>> ...
>> How weird, This Should Never Happen(tm) ... ;) Lost map output should 
>> show up in logs, or perhaps even should've killed the job, isn't that 
>> so? 
> Yes.  I'd  have thought.

Patch applied, please keep an eye on the log messages, if they reappear 
we should try to determine their cause.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Posted by Michael Stack <st...@archive.org>.
Andrzej Bialecki wrote:
> (redirected to nutch-dev)
Pardon me.  I intended to send nutch-dev, not hadoop-dev.
> ...
> How weird, This Should Never Happen(tm) ... ;) Lost map output should 
> show up in logs, or perhaps even should've killed the job, isn't that so? 
Yes.  I'd  have thought.

> I'll apply your patch for now, but we need to keep an eye on this.
Grand.
St.Ack

Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Posted by Andrzej Bialecki <ab...@getopt.org>.
(redirected to nutch-dev)

stack@archive.org wrote:
> CrawlDbReducer#reduce doesn't have a switch case for 
> CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121) 
> block which throws a RuntimeException.   This causes my update db job 
> to never succeed.
>
> This has just recently started happening.
>
> Enabling logging I see that what usually happens is that a CrawlDatum 
> with a STATUS_SIGNATURE status comes through first and is set to be 
> 'highest' (line #49) but then the next record through takes over the 
> 'highest' role because its status is higher, usually 'fetch_success' 
> or 'linked' in my case.
>
> But for reasons not clear to me, I'll sometimes have a lone CrawlDatum 
> with a status of STATUS_SIGNATURE (A mapout lost a record?) with no 
> following 'fetch_success' or 'linked' CrawlDatum. 
> This probably shouldn't fail the job.
>
> Attached is a patch that logs a warning and keeps going but probably 
> not the right soln.

How weird, This Should Never Happen(tm) ... ;) Lost map output should 
show up in logs, or perhaps even should've killed the job, isn't that 
so? I'll apply your patch for now, but we need to keep an eye on this.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com