You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Venkatesh <vr...@aol.com> on 2011/05/13 04:21:10 UTC

mapreduce job failure

 Hi
Using hbase-0.20.6

mapreduce job started failing in the map phase (using hbase table as input for mapper)..(ran fine for a week or so starting with empty tables)..

task tracker log:


Task attempt_201105121141_0002_m_000452_0 failed to report status for 600 seconds. Killing

 
Region server log:

2011-05-12 18:27:39,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -7857209327501974146 lease expired

2011-05-12 18:28:29,716 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)


2011-05-12 18:28:29,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020, call next(-78572093275019
74146, 1) from .....:35202: error: org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)
        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

I don't see any error in datanodes

Appreciate any help
thanks
v

Re: mapreduce job failure

Posted by Venkatesh <vr...@aol.com>.

thanks J-D as always
 

 


 

 

-----Original Message-----
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Tue, May 17, 2011 8:04 pm
Subject: Re: mapreduce job failure


400 regions a day is way too much, also in 0.20.6 there's a high risk

of collision when you get near the 10 thousands of regions. But that's

most probably not your current issue.



That HDFS message 99% of the time means that the region server went

into GC and when it came back the master already moved the regions

away. Should be pretty obvious in the logs.



As to why the tasks get killed, it's probably related. And since you

are running such an old release you have data loss and if that happens

on the .META. table then you lose metadata about the regions.



To help with GC issues, I suggest you read the multi-part blog post

from Todd: http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/



J-D



On Mon, May 16, 2011 at 2:08 PM, Venkatesh <vr...@aol.com> wrote:

> Thanks J-D

>

> Using hbase-0.20.6, 49 node cluster

>

> The map reduce job involve a full table scan...(region size 4 gig)

> The job runs great for 1 week..

> Starts failing after 1 week of data accumulation (about 3000 regions)

>

>  About 400 regions get created per day...

>

> Can you suggest any tunables at the HBase level. or HDFS level.?

>

> Also, I've one more issue..when region servers die..Errors below: (any 

suggestion here is helpfull as well)

>

> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 

No lease on /hbase_data_one_110425/.../compaction.dir/249610074/4534752250560182124 

File does not exist. Holder DFSClient_-398073404 does not have any open files.

>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)

>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)

>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)

>        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)

>        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

>        at java.lang.reflect.Method.invoke(Method.java:597)

>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)

>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)

>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)

>        at java.security.AccessController.doPrivileged(Native Method)

>        at javax.security.auth.Subject.doAs(Subject.java:396)

>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

>

>

>

>

>

>

>

>

>

> -----Original Message-----

> From: Jean-Daniel Cryans <jd...@apache.org>

> To: user@hbase.apache.org

> Sent: Fri, May 13, 2011 12:39 am

> Subject: Re: mapreduce job failure

>

>

> All that means is that the task stayed in map() for 10 minutes,

>

> blocked on "something".

>

>

>

> If you were scanning an hbase table, and didn't get a new row after 1

>

> minute, then the scanner would expire. That's orthogonal tho.

>

>

>

> You need to figure what you're blocking on, add logging and try to

>

> jstack your Child processes for example.

>

>

>

> J-D

>

>

>

> On Thu, May 12, 2011 at 7:21 PM, Venkatesh <vr...@aol.com> wrote:

>

>>

>

>>  Hi

>

>> Using hbase-0.20.6

>

>>

>

>> mapreduce job started failing in the map phase (using hbase table as input 

for

>

> mapper)..(ran fine for a week or so starting with empty tables)..

>

>>

>

>> task tracker log:

>

>>

>

>>

>

>> Task attempt_201105121141_0002_m_000452_0 failed to report status for 600

>

> seconds. Killing

>

>>

>

>>

>

>> Region server log:

>

>>

>

>> 2011-05-12 18:27:39,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:

>

> Scanner -7857209327501974146 lease expired

>

>>

>

>> 2011-05-12 18:28:29,716 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:org.apache.hadoop.hbase.UnknownScannerException:

>

> Name: -7857209327501974146

>

>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)

>

>       at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)        at

>

> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

>

>>        at java.lang.reflect.Method.invoke(Method.java:597)

>

>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

>

>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>

>>

>

>>

>

>> 2011-05-12 18:28:29,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server

>

> handler 3 on 60020, call next(-78572093275019

>

>> 74146, 1) from .....:35202: error: org.apache.hadoop.hbase.UnknownScannerException:

>

> Name: -7857209327501974146

>

>> org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146

>

>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)

>

>>        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)

>

>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

>

>       at java.lang.reflect.Method.invoke(Method.java:597)

>

>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

>

>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>

>>

>

>> I don't see any error in datanodes

>

>>

>

>> Appreciate any help

>

>> thanks

>

>> v

>

>>

>

>>

>

>>

>

>

>

>


 
=

Re: mapreduce job failure

Posted by Jean-Daniel Cryans <jd...@apache.org>.

400 regions a day is way too much, also in 0.20.6 there's a high risk
of collision when you get near the 10 thousands of regions. But that's
most probably not your current issue.

That HDFS message 99% of the time means that the region server went
into GC and when it came back the master already moved the regions
away. Should be pretty obvious in the logs.

As to why the tasks get killed, it's probably related. And since you
are running such an old release you have data loss and if that happens
on the .META. table then you lose metadata about the regions.

To help with GC issues, I suggest you read the multi-part blog post
from Todd: http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/

J-D

On Mon, May 16, 2011 at 2:08 PM, Venkatesh <vr...@aol.com> wrote:
> Thanks J-D
>
> Using hbase-0.20.6, 49 node cluster
>
> The map reduce job involve a full table scan...(region size 4 gig)
> The job runs great for 1 week..
> Starts failing after 1 week of data accumulation (about 3000 regions)
>
>  About 400 regions get created per day...
>
> Can you suggest any tunables at the HBase level. or HDFS level.?
>
> Also, I've one more issue..when region servers die..Errors below: (any suggestion here is helpfull as well)
>
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase_data_one_110425/.../compaction.dir/249610074/4534752250560182124 File does not exist. Holder DFSClient_-398073404 does not have any open files.
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
>        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Jean-Daniel Cryans <jd...@apache.org>
> To: user@hbase.apache.org
> Sent: Fri, May 13, 2011 12:39 am
> Subject: Re: mapreduce job failure
>
>
> All that means is that the task stayed in map() for 10 minutes,
>
> blocked on "something".
>
>
>
> If you were scanning an hbase table, and didn't get a new row after 1
>
> minute, then the scanner would expire. That's orthogonal tho.
>
>
>
> You need to figure what you're blocking on, add logging and try to
>
> jstack your Child processes for example.
>
>
>
> J-D
>
>
>
> On Thu, May 12, 2011 at 7:21 PM, Venkatesh <vr...@aol.com> wrote:
>
>>
>
>>  Hi
>
>> Using hbase-0.20.6
>
>>
>
>> mapreduce job started failing in the map phase (using hbase table as input for
>
> mapper)..(ran fine for a week or so starting with empty tables)..
>
>>
>
>> task tracker log:
>
>>
>
>>
>
>> Task attempt_201105121141_0002_m_000452_0 failed to report status for 600
>
> seconds. Killing
>
>>
>
>>
>
>> Region server log:
>
>>
>
>> 2011-05-12 18:27:39,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
>
> Scanner -7857209327501974146 lease expired
>
>>
>
>> 2011-05-12 18:28:29,716 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:org.apache.hadoop.hbase.UnknownScannerException:
>
> Name: -7857209327501974146
>
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)
>
>       at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>
>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>>
>
>>
>
>> 2011-05-12 18:28:29,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
>
> handler 3 on 60020, call next(-78572093275019
>
>> 74146, 1) from .....:35202: error: org.apache.hadoop.hbase.UnknownScannerException:
>
> Name: -7857209327501974146
>
>> org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
>
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)
>
>>        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
>       at java.lang.reflect.Method.invoke(Method.java:597)
>
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>
>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>>
>
>> I don't see any error in datanodes
>
>>
>
>> Appreciate any help
>
>> thanks
>
>> v
>
>>
>
>>
>
>>
>
>
>
>

Re: mapreduce job failure

Posted by Venkatesh <vr...@aol.com>.

Thanks J-D

Using hbase-0.20.6, 49 node cluster

The map reduce job involve a full table scan...(region size 4 gig)
The job runs great for 1 week..
Starts failing after 1 week of data accumulation (about 3000 regions)

 About 400 regions get created per day...

Can you suggest any tunables at the HBase level. or HDFS level.?

Also, I've one more issue..when region servers die..Errors below: (any suggestion here is helpfull as well)

org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase_data_one_110425/.../compaction.dir/249610074/4534752250560182124 File does not exist. Holder DFSClient_-398073404 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
 

 


 

 

-----Original Message-----
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Fri, May 13, 2011 12:39 am
Subject: Re: mapreduce job failure


All that means is that the task stayed in map() for 10 minutes,

blocked on "something".



If you were scanning an hbase table, and didn't get a new row after 1

minute, then the scanner would expire. That's orthogonal tho.



You need to figure what you're blocking on, add logging and try to

jstack your Child processes for example.



J-D



On Thu, May 12, 2011 at 7:21 PM, Venkatesh <vr...@aol.com> wrote:

>

>  Hi

> Using hbase-0.20.6

>

> mapreduce job started failing in the map phase (using hbase table as input for 

mapper)..(ran fine for a week or so starting with empty tables)..

>

> task tracker log:

>

>

> Task attempt_201105121141_0002_m_000452_0 failed to report status for 600 

seconds. Killing

>

>

> Region server log:

>

> 2011-05-12 18:27:39,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: 

Scanner -7857209327501974146 lease expired

>

> 2011-05-12 18:28:29,716 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:org.apache.hadoop.hbase.UnknownScannerException: 

Name: -7857209327501974146

>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880) 

       at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)        at 

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

>        at java.lang.reflect.Method.invoke(Method.java:597)

>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>

>

> 2011-05-12 18:28:29,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 

handler 3 on 60020, call next(-78572093275019

> 74146, 1) from .....:35202: error: org.apache.hadoop.hbase.UnknownScannerException: 

Name: -7857209327501974146

> org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146

>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)

>        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)

>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 

       at java.lang.reflect.Method.invoke(Method.java:597)

>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

>

> I don't see any error in datanodes

>

> Appreciate any help

> thanks

> v

>

>

>

Re: mapreduce job failure

Posted by Jean-Daniel Cryans <jd...@apache.org>.

All that means is that the task stayed in map() for 10 minutes,
blocked on "something".

If you were scanning an hbase table, and didn't get a new row after 1
minute, then the scanner would expire. That's orthogonal tho.

You need to figure what you're blocking on, add logging and try to
jstack your Child processes for example.

J-D

On Thu, May 12, 2011 at 7:21 PM, Venkatesh <vr...@aol.com> wrote:
>
>  Hi
> Using hbase-0.20.6
>
> mapreduce job started failing in the map phase (using hbase table as input for mapper)..(ran fine for a week or so starting with empty tables)..
>
> task tracker log:
>
>
> Task attempt_201105121141_0002_m_000452_0 failed to report status for 600 seconds. Killing
>
>
> Region server log:
>
> 2011-05-12 18:27:39,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -7857209327501974146 lease expired
>
> 2011-05-12 18:28:29,716 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>
> 2011-05-12 18:28:29,897 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020, call next(-78572093275019
> 74146, 1) from .....:35202: error: org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
> org.apache.hadoop.hbase.UnknownScannerException: Name: -7857209327501974146
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1880)
>        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
> I don't see any error in datanodes
>
> Appreciate any help
> thanks
> v
>
>
>