You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by 최우용 <oo...@gmail.com> on 2012/07/11 10:23:52 UTC

Mapred job failing with LeaseException

Hi,

I'm running a cluster of few hundred servers with Cloudera's CDH3u4
HBase+Hadoop.
and having trouble with what I think is a simple map job which uses
HBase table as an input.
My mapper code is org.apache.hadoop.hbase.mapreduce.Export with a few
SingleColumnValueFilter(i.e. a FilterList) added to the Scan object.
The job seems to progress without any trouble at first, but after
about 5~7 minutes when little over 50% of map tasks complete,
I suddenly see a lot of LeaseExceptions and the job ultimately fails.

Here's the stack print I see on my failed tasks:

org.apache.hadoop.hbase.regionserver.LeaseException:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'7595201038414594449' does not exist at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230) at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1881)
at
sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at
java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:513) at

I kind of had a similar problem when I was scanning a particular
region using ResultScanner in a single-threaded manner with the same
filters mentioned above
but I assumed it wouldn't be a problem in mapred since it's more
resilient to single task errors.

I tried row caching with Scan.setCaching(), lowered
mapred.tasktracker.map.tasks.maximum property in hopes of reducing the
total loads on region servers, but nothing worked.

Could this be a filter performance problem preventing region servers
from responding before lease expiration?
Or maybe a long sequence of rows don't match my filter list and the
lease expires before it finally hits the one that does.

I'm kind of new to Hadoop map-reduce and HBase, so any pointers would
be very much appreciated.
Thanks.

Re: Mapred job failing with LeaseException

Posted by Ooh Rong <oo...@gmail.com>.
That's exactly why I am so confused!
I can't think of anything in my code that would take more than 60-sec.
and block consecutive next() calls. My "client" prints some progress
infos to standard out and my "map task" (probably) just writes
SequenceFiles to HDFS.
(I didn't actually write the map task. I just fixed Export class from
hbase.mapreduce package by adding some filters to the Scan object
which is passed to TableMapReduceUtil.initTableMapperJob().)
But I'll definitely look into this, just to make sure.

Yesterday I did some more test. I got rid of the Filters that I added
to the Export class and added this "filtering" functionality inside
the map().
Logically this is exactly the same code as the one with the Filters,
except for the fact that the filtering takes place in a different
process. i.e. region server vs. map task

Here's some code:
<Before>
public static Job createSubmittableJob(Configuration conf, String[]
args) throws IOException {
    ...
    List<Filter> filters = new ArrayList<Filter>();
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_1,
CompareOp.EQUAL, LONG_ZERO));
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_2,
CompareOp.EQUAL, LONG_ZERO));
    filters.add(new SingleColumnValueFilter(CF_1, FLAG_3,
CompareOp.EQUAL, LONG_ZERO));
    s.setFilter(new FilterList(Operator.MUST_PASS_ALL, filters));
    ...
<After>
public void map(ImmutableBytesWritable row, Result value, Context
context) throws IOException {
    if (Bytes.equals(value.getValue(CF_1, FLAG_1), LONG_ZERO) &&
		Bytes.equals(value.getValue(CF_1, FLAG_3), LONG_ZERO) &&
		Bytes.equals(value.getValue(CF_1, B_FLAG), LONG_ZERO) ) {
	...
    }

Now the fun part. This time the job finished successfully! There were
some failed tasks(i.e. only 0.09% with speculative execution turned
off) but there were no LeaseExceptions.
About 23% of these failed tasks showed
org.apache.hadoop.hbase.client.ScannerTimeoutException and about 17%
failed to report status and got killed. The rest which makes up 60% of
the failed tasks was a connection problem(e.g. connection reset by
peer, broken pipe) to the name node which I think is understandable.

Any kind of comments are welcome.
Thanks,


On Thu, Jul 12, 2012 at 8:22 AM, Suraj Varma <sv...@gmail.com> wrote:
> The reason you get LeaseExceptions is that the time between two
> scanner.next() calls exceeded your hbase.regionserver.lease.period
> setting which defaults to 60s. Whether it is your "client" or your
> "map task", if it opens a Scan against HBase, scanner.next() should
> continue to get invoked within this lease period - else, the client is
> considered dead and the lease is expired. When this "dead" client
> comes back and tries to do a scanner.next(), it gets a LeaseException.
>
> There are several threads on this ... so - google for "hbase scanner
> leaseexception" and such. See:
> http://mail-archives.apache.org/mod_mbox/hbase-user/200903.mbox/%3Cfa03480d0903110823l5678e8dem353f345483799c5@mail.gmail.com%3E
>  http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/10225
>
> Are you doing some processing in between two scanner.next() calls that
> takes over 60s over time?
> --Suraj
>
>
> On Wed, Jul 11, 2012 at 1:23 AM, 최우용 <oo...@gmail.com> wrote:
>> Hi,
>>
>> I'm running a cluster of few hundred servers with Cloudera's CDH3u4
>> HBase+Hadoop.
>> and having trouble with what I think is a simple map job which uses
>> HBase table as an input.
>> My mapper code is org.apache.hadoop.hbase.mapreduce.Export with a few
>> SingleColumnValueFilter(i.e. a FilterList) added to the Scan object.
>> The job seems to progress without any trouble at first, but after
>> about 5~7 minutes when little over 50% of map tasks complete,
>> I suddenly see a lot of LeaseExceptions and the job ultimately fails.
>>
>> Here's the stack print I see on my failed tasks:
>>
>> org.apache.hadoop.hbase.regionserver.LeaseException:
>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>> '7595201038414594449' does not exist at
>> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230) at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1881)
>> at
>> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at
>> java.lang.reflect.Method.invoke(Method.java:597) at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>> at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) at
>>
>> I kind of had a similar problem when I was scanning a particular
>> region using ResultScanner in a single-threaded manner with the same
>> filters mentioned above
>> but I assumed it wouldn't be a problem in mapred since it's more
>> resilient to single task errors.
>>
>> I tried row caching with Scan.setCaching(), lowered
>> mapred.tasktracker.map.tasks.maximum property in hopes of reducing the
>> total loads on region servers, but nothing worked.
>>
>> Could this be a filter performance problem preventing region servers
>> from responding before lease expiration?
>> Or maybe a long sequence of rows don't match my filter list and the
>> lease expires before it finally hits the one that does.
>>
>> I'm kind of new to Hadoop map-reduce and HBase, so any pointers would
>> be very much appreciated.
>> Thanks.

Re: Mapred job failing with LeaseException

Posted by Suraj Varma <sv...@gmail.com>.
The reason you get LeaseExceptions is that the time between two
scanner.next() calls exceeded your hbase.regionserver.lease.period
setting which defaults to 60s. Whether it is your "client" or your
"map task", if it opens a Scan against HBase, scanner.next() should
continue to get invoked within this lease period - else, the client is
considered dead and the lease is expired. When this "dead" client
comes back and tries to do a scanner.next(), it gets a LeaseException.

There are several threads on this ... so - google for "hbase scanner
leaseexception" and such. See:
http://mail-archives.apache.org/mod_mbox/hbase-user/200903.mbox/%3Cfa03480d0903110823l5678e8dem353f345483799c5@mail.gmail.com%3E
 http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/10225

Are you doing some processing in between two scanner.next() calls that
takes over 60s over time?
--Suraj


On Wed, Jul 11, 2012 at 1:23 AM, 최우용 <oo...@gmail.com> wrote:
> Hi,
>
> I'm running a cluster of few hundred servers with Cloudera's CDH3u4
> HBase+Hadoop.
> and having trouble with what I think is a simple map job which uses
> HBase table as an input.
> My mapper code is org.apache.hadoop.hbase.mapreduce.Export with a few
> SingleColumnValueFilter(i.e. a FilterList) added to the Scan object.
> The job seems to progress without any trouble at first, but after
> about 5~7 minutes when little over 50% of map tasks complete,
> I suddenly see a lot of LeaseExceptions and the job ultimately fails.
>
> Here's the stack print I see on my failed tasks:
>
> org.apache.hadoop.hbase.regionserver.LeaseException:
> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> '7595201038414594449' does not exist at
> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230) at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1881)
> at
> sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at
> java.lang.reflect.Method.invoke(Method.java:597) at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at
> java.lang.reflect.Constructor.newInstance(Constructor.java:513) at
>
> I kind of had a similar problem when I was scanning a particular
> region using ResultScanner in a single-threaded manner with the same
> filters mentioned above
> but I assumed it wouldn't be a problem in mapred since it's more
> resilient to single task errors.
>
> I tried row caching with Scan.setCaching(), lowered
> mapred.tasktracker.map.tasks.maximum property in hopes of reducing the
> total loads on region servers, but nothing worked.
>
> Could this be a filter performance problem preventing region servers
> from responding before lease expiration?
> Or maybe a long sequence of rows don't match my filter list and the
> lease expires before it finally hits the one that does.
>
> I'm kind of new to Hadoop map-reduce and HBase, so any pointers would
> be very much appreciated.
> Thanks.