You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Eran Kutner <er...@gigya.com> on 2011/10/18 12:28:30 UTC

Lease does not exist exceptions

Hi,
I'm having a problem when running map/reduce on a table with about 500
regions.
The MR job shows this kind of excpetions:
11/10/18 06:03:39 INFO mapred.JobClient: Task Id :
attempt_201110030100_0086_m_000062_0, Status : FAILED
org.apache.hadoop.hbase.regionserver.LeaseException:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'-334679770697295011' does not exist
        at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
        at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:83)
        at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1019)
        at
org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1151)
        at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:149)
        at
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)

the hbase logs are full of these:
2011-10-18 06:07:01,425 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'3475143032285946374' does not exist
        at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)


and the datanodes logs have a few (seem to be a lot less than the hbase
errors) of these:
2011-10-18 06:16:42,550 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.1.104.4:50010, storageID=DS-15546166-10.1.104.4-50010-1298985607414,
infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.1.104.4:50010 remote=/
10.1.104.1:57232]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)

I've increased all the relevant limits I know of (which were high to begin
with), so now I have 64K file descriptors and dfs.datanode.max.xcievers is
8192 .
I've restarted everything in the cluster, to make sure all the processed
picked the new configurations, but I still get those errors. They always
begin when the map phase is around 12-14% and eventually the job fails at
~50%
Running random scans against the same  hbase table while the job is running
seems to work fine.

I'm using hadoop 0.20.2+923.97-1 from CDH3 and hbase 0.90.4 compiled from
the branch code a while ago.

Any other setting I'm missing or other ideas of what can be causing it?

Thanks.

-eran

Re: Lease does not exist exceptions

Posted by Doug Meil <do...@explorysmedical.com>.

I'll add something in the docs.


On 10/27/11 3:35 AM, "Lucian Iordache" <lu...@gmail.com>
wrote:

>Yep. did not work entirely.
>
>I had a job to run on 1000 regions. And the caching was 200. The job
>crashed
>with a lot of ClosedChannelExceptions + LeaseExceptions.
>
>Set the caching to 10 ==> the same.
>Set the caching to 1 ==> ~600 successfully completed tasks, but still a
>lot
>of them crashed ==> job crashed
>Set the hbase.rpc.timeout to 240000 (which is the lease timeout on the
>region server) ==> the job completed successfully, without any failed
>attempts.
>
>The problem was that we have some very large regions (2GB) and there are
>some of them with very few data, that's why it takes more than 60 seconds
>to
>get even the first row. As Daniel said, in the documentation of the lease
>timeout for regionserver and the hbase.rpc.timeout should be mentioned to
>be
>careful when modifying them, because you can get to problems, like in our
>case.
>
>Regards,
>Lucian
>
>On Wed, Oct 26, 2011 at 7:53 PM, Jean-Daniel Cryans
><jd...@apache.org>wrote:
>
>> Did you try setting the scanner caching down like I mentioned?
>>
>> J-D
>>
>> On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>> > Problem solved. It was like I said, the server took more than the
>> > hbase.rpc.timeout to run the call and the client closed the
>>connection.
>> >
>> > Best Regards,
>> > Lucian
>> >
>> > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
>> > lucian.george.iordache@gmail.com> wrote:
>> >
>> >> Yes, I will try to see the SocketTimeoutException after putting log
>>on
>> >> debug, because, like it says here
>> >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on
>> debug
>> >> on the client side.
>> >>
>> >> Regards,
>> >> Lucian
>> >>
>> >>
>> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <
>> jdcryans@apache.org>wrote:
>> >>
>> >>> So you should see the SocketTimeoutException in your *client* logs
>>(in
>> >>> your case, mappers), not LeaseException. At this point yes you're
>> >>> going to timeout, but if you spend so much time cycling on the
>>server
>> >>> side then you shouldn't set a high caching configuration on your
>> >>> scanner as IO isn't your bottle neck.
>> >>>
>> >>> J-D
>> >>>
>> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> >>> <lu...@gmail.com> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > The servers have been restarted (I have this configuration for
>>more
>> than
>> >>> a
>> >>> > month, so this is not the problem).
>> >>> > About the stack traces, they show exactly the same, a lot of
>> >>> > ClosedChannelConnections and LeaseExceptions.
>> >>> >
>> >>> > But I found something that could be the problem:
>>hbase.rpc.timeout .
>> >>> This
>> >>> > defaults to 60 seconds, and I did not modify it in
>>hbase-site.xml. So
>> it
>> >>> > could happen the next way:
>> >>> > - the mapper makes a scanner.next call to the region server
>> >>> > - the region servers needs more than 60 seconds to execute it (I
>>use
>> >>> > multiple filters, and it could take a lot of time)
>> >>> > - the scan client gets the timeout and cuts the connection
>> >>> > - the region server tries to send the results to the client ==>
>> >>> > ClosedChannelConnection
>> >>> >
>> >>> > I will get a deeper look into it tomorrow. If you have other
>> >>> suggestions,
>> >>> > please let me know!
>> >>> >
>> >>> > Thanks,
>> >>> > Lucian
>> >>> >
>> >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
>> >>> jdcryans@apache.org>wrote:
>> >>> >
>> >>> >> Did you restart the region servers after changing the config?
>> >>> >>
>> >>> >> Are you sure it's the same exception/stack trace?
>> >>> >>
>> >>> >> J-D
>> >>> >>
>> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> >>> >> <lu...@gmail.com> wrote:
>> >>> >> > Hi all,
>> >>> >> >
>> >>> >> > I have exactly the same problem that Eran had.
>> >>> >> > But there is something I don't understand: in my case, I have
>>set
>> the
>> >>> >> lease
>> >>> >> > time to 240000 (4 minutes). But most of the map tasks that are
>> >>> failing
>> >>> >> run
>> >>> >> > about 2 minutes. How is it possible to get a LeaseException if
>>the
>> >>> task
>> >>> >> runs
>> >>> >> > less than the configured time for a lease?
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > Lucian Iordache
>> >>> >> >
>> >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> Perfect! Thanks.
>> >>> >> >>
>> >>> >> >> -eran
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
>> >>> jdcryans@apache.org
>> >>> >> >> >wrote:
>> >>> >> >>
>> >>> >> >> > hbase.regionserver.lease.period
>> >>> >> >> >
>> >>> >> >> > Set it bigger than 60000.
>> >>> >> >> >
>> >>> >> >> > J-D
>> >>> >> >> >
>> >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner
>><er...@gigya.com>
>> >>> wrote:
>> >>> >> >> > >
>> >>> >> >> > > Thanks J-D!
>> >>> >> >> > > Since my main table is expected to continue growing I
>>guess
>> at
>> >>> some
>> >>> >> >> point
>> >>> >> >> > > even setting the cache size to 1 will not be enough. Is
>>there
>> a
>> >>> way
>> >>> >> to
>> >>> >> >> > > configure the lease timeout?
>> >>> >> >> > >
>> >>> >> >> > > -eran
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >>> >> jdcryans@apache.org
>> >>> >> >> > >wrote:
>> >>> >> >> > >
>> >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <
>> eran@gigya.com
>> >>> >
>> >>> >> >> wrote:
>> >>> >> >> > > >
>> >>> >> >> > > > > Hi J-D,
>> >>> >> >> > > > > Thanks for the detailed explanation.
>> >>> >> >> > > > > So if I understand correctly the lease we're talking
>> about
>> >>> is a
>> >>> >> >> > scanner
>> >>> >> >> > > > > lease and the timeout is between two scanner calls,
>> correct?
>> >>> I
>> >>> >> >> think
>> >>> >> >> > that
>> >>> >> >> > > > > make sense because I now realize that jobs that fail
>> (some
>> >>> jobs
>> >>> >> >> > continued
>> >>> >> >> > > > > to
>> >>> >> >> > > > > fail even after reducing the number of map tasks as
>>Stack
>> >>> >> >> suggested)
>> >>> >> >> > use
>> >>> >> >> > > > > filters to fetch relatively few rows out of a very
>>large
>> >>> table,
>> >>> >> so
>> >>> >> >> > they
>> >>> >> >> > > > > could be spending a lot of time on the region server
>> >>> scanning
>> >>> >> rows
>> >>> >> >> > until
>> >>> >> >> > > > it
>> >>> >> >> > > > > reached my setCaching value which was 1000. Setting
>>the
>> >>> caching
>> >>> >> >> value
>> >>> >> >> > to
>> >>> >> >> > > > 1
>> >>> >> >> > > > > seem to allow these job to complete.
>> >>> >> >> > > > > I think it has to be the above, since my rows are
>>small,
>> >>> with
>> >>> >> just
>> >>> >> >> a
>> >>> >> >> > few
>> >>> >> >> > > > > columns and processing them is very quick.
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > > > Excellent!
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > >
>> >>> >> >> > > > > However, there are still a couple ofw thing I don't
>> >>> understand:
>> >>> >> >> > > > > 1. What is the difference between setCaching and
>> setBatch?
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > > > * Set the maximum number of values to return for each
>>call
>> to
>> >>> >> next()
>> >>> >> >> > > >
>> >>> >> >> > > > VS
>> >>> >> >> > > >
>> >>> >> >> > > > * Set the number of rows for caching that will be
>>passed to
>> >>> >> scanners.
>> >>> >> >> > > >
>> >>> >> >> > > > The former is useful if you have rows with millions of
>> columns
>> >>> and
>> >>> >> >> you
>> >>> >> >> > > > could
>> >>> >> >> > > > setBatch to get only 1000 of them at a time. You could
>>call
>> >>> that
>> >>> >> >> > intra-row
>> >>> >> >> > > > scanning.
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > > 2. Examining the region server logs more closely than
>>I
>> did
>> >>> >> >> yesterday
>> >>> >> >> > I
>> >>> >> >> > > > see
>> >>> >> >> > > > > a log of ClosedChannelExceptions in addition to the
>> expired
>> >>> >> leases
>> >>> >> >> > (but
>> >>> >> >> > > > no
>> >>> >> >> > > > > UnknownScannerException), is that expected? You can
>>see
>> an
>> >>> >> excerpt
>> >>> >> >> of
>> >>> >> >> > the
>> >>> >> >> > > > > log from one of the region servers here:
>> >>> >> >> > http://pastebin.com/NLcZTzsY
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > It means that when the server got to process that client
>> >>> request
>> >>> >> and
>> >>> >> >> > > > started
>> >>> >> >> > > > reading from the socket, the client was already gone.
>> Killing
>> >>> a
>> >>> >> >> client
>> >>> >> >> > does
>> >>> >> >> > > > that (or killing a MR that scans), so does
>> >>> SocketTimeoutException.
>> >>> >> >> This
>> >>> >> >> > > > should probably go in the book. We should also print
>> something
>> >>> >> nicer
>> >>> >> >> :)
>> >>> >> >> > > >
>> >>> >> >> > > > J-D
>> >>> >> >> > > >
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Numai bine,
>> >> Lucian
>> >>
>> >
>> >
>> >
>> > --
>> > Numai bine,
>> > Lucian
>> >
>>
>
>
>
>-- 
>Numai bine,
>Lucian

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Yep. did not work entirely.

I had a job to run on 1000 regions. And the caching was 200. The job crashed
with a lot of ClosedChannelExceptions + LeaseExceptions.

Set the caching to 10 ==> the same.
Set the caching to 1 ==> ~600 successfully completed tasks, but still a lot
of them crashed ==> job crashed
Set the hbase.rpc.timeout to 240000 (which is the lease timeout on the
region server) ==> the job completed successfully, without any failed
attempts.

The problem was that we have some very large regions (2GB) and there are
some of them with very few data, that's why it takes more than 60 seconds to
get even the first row. As Daniel said, in the documentation of the lease
timeout for regionserver and the hbase.rpc.timeout should be mentioned to be
careful when modifying them, because you can get to problems, like in our
case.

Regards,
Lucian

On Wed, Oct 26, 2011 at 7:53 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Did you try setting the scanner caching down like I mentioned?
>
> J-D
>
> On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
> > Problem solved. It was like I said, the server took more than the
> > hbase.rpc.timeout to run the call and the client closed the connection.
> >
> > Best Regards,
> > Lucian
> >
> > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
> > lucian.george.iordache@gmail.com> wrote:
> >
> >> Yes, I will try to see the SocketTimeoutException after putting log on
> >> debug, because, like it says here
> >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on
> debug
> >> on the client side.
> >>
> >> Regards,
> >> Lucian
> >>
> >>
> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >>
> >>> So you should see the SocketTimeoutException in your *client* logs (in
> >>> your case, mappers), not LeaseException. At this point yes you're
> >>> going to timeout, but if you spend so much time cycling on the server
> >>> side then you shouldn't set a high caching configuration on your
> >>> scanner as IO isn't your bottle neck.
> >>>
> >>> J-D
> >>>
> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> >>> <lu...@gmail.com> wrote:
> >>> > Hi,
> >>> >
> >>> > The servers have been restarted (I have this configuration for more
> than
> >>> a
> >>> > month, so this is not the problem).
> >>> > About the stack traces, they show exactly the same, a lot of
> >>> > ClosedChannelConnections and LeaseExceptions.
> >>> >
> >>> > But I found something that could be the problem: hbase.rpc.timeout .
> >>> This
> >>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So
> it
> >>> > could happen the next way:
> >>> > - the mapper makes a scanner.next call to the region server
> >>> > - the region servers needs more than 60 seconds to execute it (I use
> >>> > multiple filters, and it could take a lot of time)
> >>> > - the scan client gets the timeout and cuts the connection
> >>> > - the region server tries to send the results to the client ==>
> >>> > ClosedChannelConnection
> >>> >
> >>> > I will get a deeper look into it tomorrow. If you have other
> >>> suggestions,
> >>> > please let me know!
> >>> >
> >>> > Thanks,
> >>> > Lucian
> >>> >
> >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
> >>> jdcryans@apache.org>wrote:
> >>> >
> >>> >> Did you restart the region servers after changing the config?
> >>> >>
> >>> >> Are you sure it's the same exception/stack trace?
> >>> >>
> >>> >> J-D
> >>> >>
> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> >>> >> <lu...@gmail.com> wrote:
> >>> >> > Hi all,
> >>> >> >
> >>> >> > I have exactly the same problem that Eran had.
> >>> >> > But there is something I don't understand: in my case, I have set
> the
> >>> >> lease
> >>> >> > time to 240000 (4 minutes). But most of the map tasks that are
> >>> failing
> >>> >> run
> >>> >> > about 2 minutes. How is it possible to get a LeaseException if the
> >>> task
> >>> >> runs
> >>> >> > less than the configured time for a lease?
> >>> >> >
> >>> >> > Regards,
> >>> >> > Lucian Iordache
> >>> >> >
> >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com>
> >>> wrote:
> >>> >> >
> >>> >> >> Perfect! Thanks.
> >>> >> >>
> >>> >> >> -eran
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
> >>> jdcryans@apache.org
> >>> >> >> >wrote:
> >>> >> >>
> >>> >> >> > hbase.regionserver.lease.period
> >>> >> >> >
> >>> >> >> > Set it bigger than 60000.
> >>> >> >> >
> >>> >> >> > J-D
> >>> >> >> >
> >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
> >>> wrote:
> >>> >> >> > >
> >>> >> >> > > Thanks J-D!
> >>> >> >> > > Since my main table is expected to continue growing I guess
> at
> >>> some
> >>> >> >> point
> >>> >> >> > > even setting the cache size to 1 will not be enough. Is there
> a
> >>> way
> >>> >> to
> >>> >> >> > > configure the lease timeout?
> >>> >> >> > >
> >>> >> >> > > -eran
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> >>> >> jdcryans@apache.org
> >>> >> >> > >wrote:
> >>> >> >> > >
> >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <
> eran@gigya.com
> >>> >
> >>> >> >> wrote:
> >>> >> >> > > >
> >>> >> >> > > > > Hi J-D,
> >>> >> >> > > > > Thanks for the detailed explanation.
> >>> >> >> > > > > So if I understand correctly the lease we're talking
> about
> >>> is a
> >>> >> >> > scanner
> >>> >> >> > > > > lease and the timeout is between two scanner calls,
> correct?
> >>> I
> >>> >> >> think
> >>> >> >> > that
> >>> >> >> > > > > make sense because I now realize that jobs that fail
> (some
> >>> jobs
> >>> >> >> > continued
> >>> >> >> > > > > to
> >>> >> >> > > > > fail even after reducing the number of map tasks as Stack
> >>> >> >> suggested)
> >>> >> >> > use
> >>> >> >> > > > > filters to fetch relatively few rows out of a very large
> >>> table,
> >>> >> so
> >>> >> >> > they
> >>> >> >> > > > > could be spending a lot of time on the region server
> >>> scanning
> >>> >> rows
> >>> >> >> > until
> >>> >> >> > > > it
> >>> >> >> > > > > reached my setCaching value which was 1000. Setting the
> >>> caching
> >>> >> >> value
> >>> >> >> > to
> >>> >> >> > > > 1
> >>> >> >> > > > > seem to allow these job to complete.
> >>> >> >> > > > > I think it has to be the above, since my rows are small,
> >>> with
> >>> >> just
> >>> >> >> a
> >>> >> >> > few
> >>> >> >> > > > > columns and processing them is very quick.
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > > > Excellent!
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > >
> >>> >> >> > > > > However, there are still a couple ofw thing I don't
> >>> understand:
> >>> >> >> > > > > 1. What is the difference between setCaching and
> setBatch?
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > > > * Set the maximum number of values to return for each call
> to
> >>> >> next()
> >>> >> >> > > >
> >>> >> >> > > > VS
> >>> >> >> > > >
> >>> >> >> > > > * Set the number of rows for caching that will be passed to
> >>> >> scanners.
> >>> >> >> > > >
> >>> >> >> > > > The former is useful if you have rows with millions of
> columns
> >>> and
> >>> >> >> you
> >>> >> >> > > > could
> >>> >> >> > > > setBatch to get only 1000 of them at a time. You could call
> >>> that
> >>> >> >> > intra-row
> >>> >> >> > > > scanning.
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > > 2. Examining the region server logs more closely than I
> did
> >>> >> >> yesterday
> >>> >> >> > I
> >>> >> >> > > > see
> >>> >> >> > > > > a log of ClosedChannelExceptions in addition to the
> expired
> >>> >> leases
> >>> >> >> > (but
> >>> >> >> > > > no
> >>> >> >> > > > > UnknownScannerException), is that expected? You can see
> an
> >>> >> excerpt
> >>> >> >> of
> >>> >> >> > the
> >>> >> >> > > > > log from one of the region servers here:
> >>> >> >> > http://pastebin.com/NLcZTzsY
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > It means that when the server got to process that client
> >>> request
> >>> >> and
> >>> >> >> > > > started
> >>> >> >> > > > reading from the socket, the client was already gone.
> Killing
> >>> a
> >>> >> >> client
> >>> >> >> > does
> >>> >> >> > > > that (or killing a MR that scans), so does
> >>> SocketTimeoutException.
> >>> >> >> This
> >>> >> >> > > > should probably go in the book. We should also print
> something
> >>> >> nicer
> >>> >> >> :)
> >>> >> >> > > >
> >>> >> >> > > > J-D
> >>> >> >> > > >
> >>> >> >> >
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Numai bine,
> >> Lucian
> >>
> >
> >
> >
> > --
> > Numai bine,
> > Lucian
> >
>



-- 
Numai bine,
Lucian

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Did you try setting the scanner caching down like I mentioned?

J-D

On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache
<lu...@gmail.com> wrote:
> Problem solved. It was like I said, the server took more than the
> hbase.rpc.timeout to run the call and the client closed the connection.
>
> Best Regards,
> Lucian
>
> On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
> lucian.george.iordache@gmail.com> wrote:
>
>> Yes, I will try to see the SocketTimeoutException after putting log on
>> debug, because, like it says here
>> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug
>> on the client side.
>>
>> Regards,
>> Lucian
>>
>>
>> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> So you should see the SocketTimeoutException in your *client* logs (in
>>> your case, mappers), not LeaseException. At this point yes you're
>>> going to timeout, but if you spend so much time cycling on the server
>>> side then you shouldn't set a high caching configuration on your
>>> scanner as IO isn't your bottle neck.
>>>
>>> J-D
>>>
>>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>>> <lu...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > The servers have been restarted (I have this configuration for more than
>>> a
>>> > month, so this is not the problem).
>>> > About the stack traces, they show exactly the same, a lot of
>>> > ClosedChannelConnections and LeaseExceptions.
>>> >
>>> > But I found something that could be the problem: hbase.rpc.timeout .
>>> This
>>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>>> > could happen the next way:
>>> > - the mapper makes a scanner.next call to the region server
>>> > - the region servers needs more than 60 seconds to execute it (I use
>>> > multiple filters, and it could take a lot of time)
>>> > - the scan client gets the timeout and cuts the connection
>>> > - the region server tries to send the results to the client ==>
>>> > ClosedChannelConnection
>>> >
>>> > I will get a deeper look into it tomorrow. If you have other
>>> suggestions,
>>> > please let me know!
>>> >
>>> > Thanks,
>>> > Lucian
>>> >
>>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
>>> jdcryans@apache.org>wrote:
>>> >
>>> >> Did you restart the region servers after changing the config?
>>> >>
>>> >> Are you sure it's the same exception/stack trace?
>>> >>
>>> >> J-D
>>> >>
>>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>> >> <lu...@gmail.com> wrote:
>>> >> > Hi all,
>>> >> >
>>> >> > I have exactly the same problem that Eran had.
>>> >> > But there is something I don't understand: in my case, I have set the
>>> >> lease
>>> >> > time to 240000 (4 minutes). But most of the map tasks that are
>>> failing
>>> >> run
>>> >> > about 2 minutes. How is it possible to get a LeaseException if the
>>> task
>>> >> runs
>>> >> > less than the configured time for a lease?
>>> >> >
>>> >> > Regards,
>>> >> > Lucian Iordache
>>> >> >
>>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com>
>>> wrote:
>>> >> >
>>> >> >> Perfect! Thanks.
>>> >> >>
>>> >> >> -eran
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
>>> jdcryans@apache.org
>>> >> >> >wrote:
>>> >> >>
>>> >> >> > hbase.regionserver.lease.period
>>> >> >> >
>>> >> >> > Set it bigger than 60000.
>>> >> >> >
>>> >> >> > J-D
>>> >> >> >
>>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
>>> wrote:
>>> >> >> > >
>>> >> >> > > Thanks J-D!
>>> >> >> > > Since my main table is expected to continue growing I guess at
>>> some
>>> >> >> point
>>> >> >> > > even setting the cache size to 1 will not be enough. Is there a
>>> way
>>> >> to
>>> >> >> > > configure the lease timeout?
>>> >> >> > >
>>> >> >> > > -eran
>>> >> >> > >
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>> >> jdcryans@apache.org
>>> >> >> > >wrote:
>>> >> >> > >
>>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <eran@gigya.com
>>> >
>>> >> >> wrote:
>>> >> >> > > >
>>> >> >> > > > > Hi J-D,
>>> >> >> > > > > Thanks for the detailed explanation.
>>> >> >> > > > > So if I understand correctly the lease we're talking about
>>> is a
>>> >> >> > scanner
>>> >> >> > > > > lease and the timeout is between two scanner calls, correct?
>>> I
>>> >> >> think
>>> >> >> > that
>>> >> >> > > > > make sense because I now realize that jobs that fail (some
>>> jobs
>>> >> >> > continued
>>> >> >> > > > > to
>>> >> >> > > > > fail even after reducing the number of map tasks as Stack
>>> >> >> suggested)
>>> >> >> > use
>>> >> >> > > > > filters to fetch relatively few rows out of a very large
>>> table,
>>> >> so
>>> >> >> > they
>>> >> >> > > > > could be spending a lot of time on the region server
>>> scanning
>>> >> rows
>>> >> >> > until
>>> >> >> > > > it
>>> >> >> > > > > reached my setCaching value which was 1000. Setting the
>>> caching
>>> >> >> value
>>> >> >> > to
>>> >> >> > > > 1
>>> >> >> > > > > seem to allow these job to complete.
>>> >> >> > > > > I think it has to be the above, since my rows are small,
>>> with
>>> >> just
>>> >> >> a
>>> >> >> > few
>>> >> >> > > > > columns and processing them is very quick.
>>> >> >> > > > >
>>> >> >> > > >
>>> >> >> > > > Excellent!
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > >
>>> >> >> > > > > However, there are still a couple ofw thing I don't
>>> understand:
>>> >> >> > > > > 1. What is the difference between setCaching and setBatch?
>>> >> >> > > > >
>>> >> >> > > >
>>> >> >> > > > * Set the maximum number of values to return for each call to
>>> >> next()
>>> >> >> > > >
>>> >> >> > > > VS
>>> >> >> > > >
>>> >> >> > > > * Set the number of rows for caching that will be passed to
>>> >> scanners.
>>> >> >> > > >
>>> >> >> > > > The former is useful if you have rows with millions of columns
>>> and
>>> >> >> you
>>> >> >> > > > could
>>> >> >> > > > setBatch to get only 1000 of them at a time. You could call
>>> that
>>> >> >> > intra-row
>>> >> >> > > > scanning.
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > > 2. Examining the region server logs more closely than I did
>>> >> >> yesterday
>>> >> >> > I
>>> >> >> > > > see
>>> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>>> >> leases
>>> >> >> > (but
>>> >> >> > > > no
>>> >> >> > > > > UnknownScannerException), is that expected? You can see an
>>> >> excerpt
>>> >> >> of
>>> >> >> > the
>>> >> >> > > > > log from one of the region servers here:
>>> >> >> > http://pastebin.com/NLcZTzsY
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > It means that when the server got to process that client
>>> request
>>> >> and
>>> >> >> > > > started
>>> >> >> > > > reading from the socket, the client was already gone. Killing
>>> a
>>> >> >> client
>>> >> >> > does
>>> >> >> > > > that (or killing a MR that scans), so does
>>> SocketTimeoutException.
>>> >> >> This
>>> >> >> > > > should probably go in the book. We should also print something
>>> >> nicer
>>> >> >> :)
>>> >> >> > > >
>>> >> >> > > > J-D
>>> >> >> > > >
>>> >> >> >
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
>>
>>
>>
>> --
>> Numai bine,
>> Lucian
>>
>
>
>
> --
> Numai bine,
> Lucian
>

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Problem solved. It was like I said, the server took more than the
hbase.rpc.timeout to run the call and the client closed the connection.

Best Regards,
Lucian

On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
lucian.george.iordache@gmail.com> wrote:

> Yes, I will try to see the SocketTimeoutException after putting log on
> debug, because, like it says here
> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug
> on the client side.
>
> Regards,
> Lucian
>
>
> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> So you should see the SocketTimeoutException in your *client* logs (in
>> your case, mappers), not LeaseException. At this point yes you're
>> going to timeout, but if you spend so much time cycling on the server
>> side then you shouldn't set a high caching configuration on your
>> scanner as IO isn't your bottle neck.
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>> > Hi,
>> >
>> > The servers have been restarted (I have this configuration for more than
>> a
>> > month, so this is not the problem).
>> > About the stack traces, they show exactly the same, a lot of
>> > ClosedChannelConnections and LeaseExceptions.
>> >
>> > But I found something that could be the problem: hbase.rpc.timeout .
>> This
>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> > could happen the next way:
>> > - the mapper makes a scanner.next call to the region server
>> > - the region servers needs more than 60 seconds to execute it (I use
>> > multiple filters, and it could take a lot of time)
>> > - the scan client gets the timeout and cuts the connection
>> > - the region server tries to send the results to the client ==>
>> > ClosedChannelConnection
>> >
>> > I will get a deeper look into it tomorrow. If you have other
>> suggestions,
>> > please let me know!
>> >
>> > Thanks,
>> > Lucian
>> >
>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
>> jdcryans@apache.org>wrote:
>> >
>> >> Did you restart the region servers after changing the config?
>> >>
>> >> Are you sure it's the same exception/stack trace?
>> >>
>> >> J-D
>> >>
>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> >> <lu...@gmail.com> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have exactly the same problem that Eran had.
>> >> > But there is something I don't understand: in my case, I have set the
>> >> lease
>> >> > time to 240000 (4 minutes). But most of the map tasks that are
>> failing
>> >> run
>> >> > about 2 minutes. How is it possible to get a LeaseException if the
>> task
>> >> runs
>> >> > less than the configured time for a lease?
>> >> >
>> >> > Regards,
>> >> > Lucian Iordache
>> >> >
>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com>
>> wrote:
>> >> >
>> >> >> Perfect! Thanks.
>> >> >>
>> >> >> -eran
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
>> jdcryans@apache.org
>> >> >> >wrote:
>> >> >>
>> >> >> > hbase.regionserver.lease.period
>> >> >> >
>> >> >> > Set it bigger than 60000.
>> >> >> >
>> >> >> > J-D
>> >> >> >
>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
>> wrote:
>> >> >> > >
>> >> >> > > Thanks J-D!
>> >> >> > > Since my main table is expected to continue growing I guess at
>> some
>> >> >> point
>> >> >> > > even setting the cache size to 1 will not be enough. Is there a
>> way
>> >> to
>> >> >> > > configure the lease timeout?
>> >> >> > >
>> >> >> > > -eran
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >> jdcryans@apache.org
>> >> >> > >wrote:
>> >> >> > >
>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <eran@gigya.com
>> >
>> >> >> wrote:
>> >> >> > > >
>> >> >> > > > > Hi J-D,
>> >> >> > > > > Thanks for the detailed explanation.
>> >> >> > > > > So if I understand correctly the lease we're talking about
>> is a
>> >> >> > scanner
>> >> >> > > > > lease and the timeout is between two scanner calls, correct?
>> I
>> >> >> think
>> >> >> > that
>> >> >> > > > > make sense because I now realize that jobs that fail (some
>> jobs
>> >> >> > continued
>> >> >> > > > > to
>> >> >> > > > > fail even after reducing the number of map tasks as Stack
>> >> >> suggested)
>> >> >> > use
>> >> >> > > > > filters to fetch relatively few rows out of a very large
>> table,
>> >> so
>> >> >> > they
>> >> >> > > > > could be spending a lot of time on the region server
>> scanning
>> >> rows
>> >> >> > until
>> >> >> > > > it
>> >> >> > > > > reached my setCaching value which was 1000. Setting the
>> caching
>> >> >> value
>> >> >> > to
>> >> >> > > > 1
>> >> >> > > > > seem to allow these job to complete.
>> >> >> > > > > I think it has to be the above, since my rows are small,
>> with
>> >> just
>> >> >> a
>> >> >> > few
>> >> >> > > > > columns and processing them is very quick.
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > Excellent!
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > >
>> >> >> > > > > However, there are still a couple ofw thing I don't
>> understand:
>> >> >> > > > > 1. What is the difference between setCaching and setBatch?
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > * Set the maximum number of values to return for each call to
>> >> next()
>> >> >> > > >
>> >> >> > > > VS
>> >> >> > > >
>> >> >> > > > * Set the number of rows for caching that will be passed to
>> >> scanners.
>> >> >> > > >
>> >> >> > > > The former is useful if you have rows with millions of columns
>> and
>> >> >> you
>> >> >> > > > could
>> >> >> > > > setBatch to get only 1000 of them at a time. You could call
>> that
>> >> >> > intra-row
>> >> >> > > > scanning.
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > > 2. Examining the region server logs more closely than I did
>> >> >> yesterday
>> >> >> > I
>> >> >> > > > see
>> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>> >> leases
>> >> >> > (but
>> >> >> > > > no
>> >> >> > > > > UnknownScannerException), is that expected? You can see an
>> >> excerpt
>> >> >> of
>> >> >> > the
>> >> >> > > > > log from one of the region servers here:
>> >> >> > http://pastebin.com/NLcZTzsY
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > It means that when the server got to process that client
>> request
>> >> and
>> >> >> > > > started
>> >> >> > > > reading from the socket, the client was already gone. Killing
>> a
>> >> >> client
>> >> >> > does
>> >> >> > > > that (or killing a MR that scans), so does
>> SocketTimeoutException.
>> >> >> This
>> >> >> > > > should probably go in the book. We should also print something
>> >> nicer
>> >> >> :)
>> >> >> > > >
>> >> >> > > > J-D
>> >> >> > > >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
>
>
> --
> Numai bine,
> Lucian
>



-- 
Numai bine,
Lucian

Re: Lease does not exist exceptions

Posted by Daniel Iancu <da...@1and1.ro>.

I would also suggest to document that to increase the 
hbase.regionserver.leases.period you must increase the hbase.rpc.timeout 
to a value equal or greater or else the client holding a valid lease 
will lose it cause its RPC call timeout-ed ? At lease that seem to be 
the problem in our case, we went for a lease period of 3 mins to allow 
slow responding scans to finish but the RPC connection timeouted after 1 
minute (the default value)a and hence that error.

Daniel

On 10/26/2011 07:04 PM, Stack wrote:
> On Wed, Oct 26, 2011 at 8:59 AM, Lucian Iordache
> <lu...@gmail.com>  wrote:
>> Hello,
>>
>> I would suggest logging the exception produced by the hbase.rpc.timeout on
>> the client side on WARN, not debug like it is right now.
>>
> That makes sens.  Mind making an issue (and adding a patch if inclined)?
>
> The above seems more painful than it need be figuring what was going on.
>
> Thanks boss,
> St.Ack

-- 
Daniel Iancu
Java Developer,Web Components Romania
1&1 Internet Development srl.
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
www.1and1.ro
Phone:+40-031-223-9081

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Ok, I will add an issue for that + a patch probably tomorrow.

Regards,
Lucian

On Wed, Oct 26, 2011 at 7:04 PM, Stack <st...@duboce.net> wrote:

> On Wed, Oct 26, 2011 at 8:59 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
> > Hello,
> >
> > I would suggest logging the exception produced by the hbase.rpc.timeout
> on
> > the client side on WARN, not debug like it is right now.
> >
>
> That makes sens.  Mind making an issue (and adding a patch if inclined)?
>
> The above seems more painful than it need be figuring what was going on.
>
> Thanks boss,
> St.Ack
>

Re: Lease does not exist exceptions

Posted by Stack <st...@duboce.net>.

On Wed, Oct 26, 2011 at 8:59 AM, Lucian Iordache
<lu...@gmail.com> wrote:
> Hello,
>
> I would suggest logging the exception produced by the hbase.rpc.timeout on
> the client side on WARN, not debug like it is right now.
>

That makes sens.  Mind making an issue (and adding a patch if inclined)?

The above seems more painful than it need be figuring what was going on.

Thanks boss,
St.Ack

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Hello,

I would suggest logging the exception produced by the hbase.rpc.timeout on
the client side on WARN, not debug like it is right now.

Regards,
Lucian

On Wed, Oct 26, 2011 at 6:51 PM, Stack <st...@duboce.net> wrote:

> What would you suggest we do to improve the messages we emit around
> here making it more clear whats going on?
>
> St.Ack
>
> On Tue, Oct 25, 2011 at 1:15 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
> > Yes, I will try to see the SocketTimeoutException after putting log on
> > debug, because, like it says here
> > https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on
> debug
> > on the client side.
> >
> > Regards,
> > Lucian
> >
> > On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> So you should see the SocketTimeoutException in your *client* logs (in
> >> your case, mappers), not LeaseException. At this point yes you're
> >> going to timeout, but if you spend so much time cycling on the server
> >> side then you shouldn't set a high caching configuration on your
> >> scanner as IO isn't your bottle neck.
> >>
> >> J-D
> >>
> >> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> >> <lu...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > The servers have been restarted (I have this configuration for more
> than
> >> a
> >> > month, so this is not the problem).
> >> > About the stack traces, they show exactly the same, a lot of
> >> > ClosedChannelConnections and LeaseExceptions.
> >> >
> >> > But I found something that could be the problem: hbase.rpc.timeout .
> This
> >> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So
> it
> >> > could happen the next way:
> >> > - the mapper makes a scanner.next call to the region server
> >> > - the region servers needs more than 60 seconds to execute it (I use
> >> > multiple filters, and it could take a lot of time)
> >> > - the scan client gets the timeout and cuts the connection
> >> > - the region server tries to send the results to the client ==>
> >> > ClosedChannelConnection
> >> >
> >> > I will get a deeper look into it tomorrow. If you have other
> suggestions,
> >> > please let me know!
> >> >
> >> > Thanks,
> >> > Lucian
> >> >
> >> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> Did you restart the region servers after changing the config?
> >> >>
> >> >> Are you sure it's the same exception/stack trace?
> >> >>
> >> >> J-D
> >> >>
> >> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> >> >> <lu...@gmail.com> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I have exactly the same problem that Eran had.
> >> >> > But there is something I don't understand: in my case, I have set
> the
> >> >> lease
> >> >> > time to 240000 (4 minutes). But most of the map tasks that are
> failing
> >> >> run
> >> >> > about 2 minutes. How is it possible to get a LeaseException if the
> >> task
> >> >> runs
> >> >> > less than the configured time for a lease?
> >> >> >
> >> >> > Regards,
> >> >> > Lucian Iordache
> >> >> >
> >> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com>
> wrote:
> >> >> >
> >> >> >> Perfect! Thanks.
> >> >> >>
> >> >> >> -eran
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
> >> jdcryans@apache.org
> >> >> >> >wrote:
> >> >> >>
> >> >> >> > hbase.regionserver.lease.period
> >> >> >> >
> >> >> >> > Set it bigger than 60000.
> >> >> >> >
> >> >> >> > J-D
> >> >> >> >
> >> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
> >> wrote:
> >> >> >> > >
> >> >> >> > > Thanks J-D!
> >> >> >> > > Since my main table is expected to continue growing I guess at
> >> some
> >> >> >> point
> >> >> >> > > even setting the cache size to 1 will not be enough. Is there
> a
> >> way
> >> >> to
> >> >> >> > > configure the lease timeout?
> >> >> >> > >
> >> >> >> > > -eran
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> >> >> jdcryans@apache.org
> >> >> >> > >wrote:
> >> >> >> > >
> >> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <
> eran@gigya.com>
> >> >> >> wrote:
> >> >> >> > > >
> >> >> >> > > > > Hi J-D,
> >> >> >> > > > > Thanks for the detailed explanation.
> >> >> >> > > > > So if I understand correctly the lease we're talking about
> is
> >> a
> >> >> >> > scanner
> >> >> >> > > > > lease and the timeout is between two scanner calls,
> correct?
> >> I
> >> >> >> think
> >> >> >> > that
> >> >> >> > > > > make sense because I now realize that jobs that fail (some
> >> jobs
> >> >> >> > continued
> >> >> >> > > > > to
> >> >> >> > > > > fail even after reducing the number of map tasks as Stack
> >> >> >> suggested)
> >> >> >> > use
> >> >> >> > > > > filters to fetch relatively few rows out of a very large
> >> table,
> >> >> so
> >> >> >> > they
> >> >> >> > > > > could be spending a lot of time on the region server
> scanning
> >> >> rows
> >> >> >> > until
> >> >> >> > > > it
> >> >> >> > > > > reached my setCaching value which was 1000. Setting the
> >> caching
> >> >> >> value
> >> >> >> > to
> >> >> >> > > > 1
> >> >> >> > > > > seem to allow these job to complete.
> >> >> >> > > > > I think it has to be the above, since my rows are small,
> with
> >> >> just
> >> >> >> a
> >> >> >> > few
> >> >> >> > > > > columns and processing them is very quick.
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > > > Excellent!
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > >
> >> >> >> > > > > However, there are still a couple ofw thing I don't
> >> understand:
> >> >> >> > > > > 1. What is the difference between setCaching and setBatch?
> >> >> >> > > > >
> >> >> >> > > >
> >> >> >> > > > * Set the maximum number of values to return for each call
> to
> >> >> next()
> >> >> >> > > >
> >> >> >> > > > VS
> >> >> >> > > >
> >> >> >> > > > * Set the number of rows for caching that will be passed to
> >> >> scanners.
> >> >> >> > > >
> >> >> >> > > > The former is useful if you have rows with millions of
> columns
> >> and
> >> >> >> you
> >> >> >> > > > could
> >> >> >> > > > setBatch to get only 1000 of them at a time. You could call
> >> that
> >> >> >> > intra-row
> >> >> >> > > > scanning.
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > > 2. Examining the region server logs more closely than I
> did
> >> >> >> yesterday
> >> >> >> > I
> >> >> >> > > > see
> >> >> >> > > > > a log of ClosedChannelExceptions in addition to the
> expired
> >> >> leases
> >> >> >> > (but
> >> >> >> > > > no
> >> >> >> > > > > UnknownScannerException), is that expected? You can see an
> >> >> excerpt
> >> >> >> of
> >> >> >> > the
> >> >> >> > > > > log from one of the region servers here:
> >> >> >> > http://pastebin.com/NLcZTzsY
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > It means that when the server got to process that client
> >> request
> >> >> and
> >> >> >> > > > started
> >> >> >> > > > reading from the socket, the client was already gone.
> Killing a
> >> >> >> client
> >> >> >> > does
> >> >> >> > > > that (or killing a MR that scans), so does
> >> SocketTimeoutException.
> >> >> >> This
> >> >> >> > > > should probably go in the book. We should also print
> something
> >> >> nicer
> >> >> >> :)
> >> >> >> > > >
> >> >> >> > > > J-D
> >> >> >> > > >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Numai bine,
> > Lucian
> >
>

Re: Lease does not exist exceptions

Posted by Stack <st...@duboce.net>.

What would you suggest we do to improve the messages we emit around
here making it more clear whats going on?

St.Ack

On Tue, Oct 25, 2011 at 1:15 AM, Lucian Iordache
<lu...@gmail.com> wrote:
> Yes, I will try to see the SocketTimeoutException after putting log on
> debug, because, like it says here
> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug
> on the client side.
>
> Regards,
> Lucian
>
> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> So you should see the SocketTimeoutException in your *client* logs (in
>> your case, mappers), not LeaseException. At this point yes you're
>> going to timeout, but if you spend so much time cycling on the server
>> side then you shouldn't set a high caching configuration on your
>> scanner as IO isn't your bottle neck.
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>> > Hi,
>> >
>> > The servers have been restarted (I have this configuration for more than
>> a
>> > month, so this is not the problem).
>> > About the stack traces, they show exactly the same, a lot of
>> > ClosedChannelConnections and LeaseExceptions.
>> >
>> > But I found something that could be the problem: hbase.rpc.timeout . This
>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> > could happen the next way:
>> > - the mapper makes a scanner.next call to the region server
>> > - the region servers needs more than 60 seconds to execute it (I use
>> > multiple filters, and it could take a lot of time)
>> > - the scan client gets the timeout and cuts the connection
>> > - the region server tries to send the results to the client ==>
>> > ClosedChannelConnection
>> >
>> > I will get a deeper look into it tomorrow. If you have other suggestions,
>> > please let me know!
>> >
>> > Thanks,
>> > Lucian
>> >
>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> Did you restart the region servers after changing the config?
>> >>
>> >> Are you sure it's the same exception/stack trace?
>> >>
>> >> J-D
>> >>
>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> >> <lu...@gmail.com> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have exactly the same problem that Eran had.
>> >> > But there is something I don't understand: in my case, I have set the
>> >> lease
>> >> > time to 240000 (4 minutes). But most of the map tasks that are failing
>> >> run
>> >> > about 2 minutes. How is it possible to get a LeaseException if the
>> task
>> >> runs
>> >> > less than the configured time for a lease?
>> >> >
>> >> > Regards,
>> >> > Lucian Iordache
>> >> >
>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>> >> >
>> >> >> Perfect! Thanks.
>> >> >>
>> >> >> -eran
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
>> jdcryans@apache.org
>> >> >> >wrote:
>> >> >>
>> >> >> > hbase.regionserver.lease.period
>> >> >> >
>> >> >> > Set it bigger than 60000.
>> >> >> >
>> >> >> > J-D
>> >> >> >
>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
>> wrote:
>> >> >> > >
>> >> >> > > Thanks J-D!
>> >> >> > > Since my main table is expected to continue growing I guess at
>> some
>> >> >> point
>> >> >> > > even setting the cache size to 1 will not be enough. Is there a
>> way
>> >> to
>> >> >> > > configure the lease timeout?
>> >> >> > >
>> >> >> > > -eran
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >> jdcryans@apache.org
>> >> >> > >wrote:
>> >> >> > >
>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>> >> >> wrote:
>> >> >> > > >
>> >> >> > > > > Hi J-D,
>> >> >> > > > > Thanks for the detailed explanation.
>> >> >> > > > > So if I understand correctly the lease we're talking about is
>> a
>> >> >> > scanner
>> >> >> > > > > lease and the timeout is between two scanner calls, correct?
>> I
>> >> >> think
>> >> >> > that
>> >> >> > > > > make sense because I now realize that jobs that fail (some
>> jobs
>> >> >> > continued
>> >> >> > > > > to
>> >> >> > > > > fail even after reducing the number of map tasks as Stack
>> >> >> suggested)
>> >> >> > use
>> >> >> > > > > filters to fetch relatively few rows out of a very large
>> table,
>> >> so
>> >> >> > they
>> >> >> > > > > could be spending a lot of time on the region server scanning
>> >> rows
>> >> >> > until
>> >> >> > > > it
>> >> >> > > > > reached my setCaching value which was 1000. Setting the
>> caching
>> >> >> value
>> >> >> > to
>> >> >> > > > 1
>> >> >> > > > > seem to allow these job to complete.
>> >> >> > > > > I think it has to be the above, since my rows are small, with
>> >> just
>> >> >> a
>> >> >> > few
>> >> >> > > > > columns and processing them is very quick.
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > Excellent!
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > >
>> >> >> > > > > However, there are still a couple ofw thing I don't
>> understand:
>> >> >> > > > > 1. What is the difference between setCaching and setBatch?
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > > * Set the maximum number of values to return for each call to
>> >> next()
>> >> >> > > >
>> >> >> > > > VS
>> >> >> > > >
>> >> >> > > > * Set the number of rows for caching that will be passed to
>> >> scanners.
>> >> >> > > >
>> >> >> > > > The former is useful if you have rows with millions of columns
>> and
>> >> >> you
>> >> >> > > > could
>> >> >> > > > setBatch to get only 1000 of them at a time. You could call
>> that
>> >> >> > intra-row
>> >> >> > > > scanning.
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > > 2. Examining the region server logs more closely than I did
>> >> >> yesterday
>> >> >> > I
>> >> >> > > > see
>> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>> >> leases
>> >> >> > (but
>> >> >> > > > no
>> >> >> > > > > UnknownScannerException), is that expected? You can see an
>> >> excerpt
>> >> >> of
>> >> >> > the
>> >> >> > > > > log from one of the region servers here:
>> >> >> > http://pastebin.com/NLcZTzsY
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > It means that when the server got to process that client
>> request
>> >> and
>> >> >> > > > started
>> >> >> > > > reading from the socket, the client was already gone. Killing a
>> >> >> client
>> >> >> > does
>> >> >> > > > that (or killing a MR that scans), so does
>> SocketTimeoutException.
>> >> >> This
>> >> >> > > > should probably go in the book. We should also print something
>> >> nicer
>> >> >> :)
>> >> >> > > >
>> >> >> > > > J-D
>> >> >> > > >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
>
>
> --
> Numai bine,
> Lucian
>

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Yes, I will try to see the SocketTimeoutException after putting log on
debug, because, like it says here
https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on debug
on the client side.

Regards,
Lucian

On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> So you should see the SocketTimeoutException in your *client* logs (in
> your case, mappers), not LeaseException. At this point yes you're
> going to timeout, but if you spend so much time cycling on the server
> side then you shouldn't set a high caching configuration on your
> scanner as IO isn't your bottle neck.
>
> J-D
>
> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
> > Hi,
> >
> > The servers have been restarted (I have this configuration for more than
> a
> > month, so this is not the problem).
> > About the stack traces, they show exactly the same, a lot of
> > ClosedChannelConnections and LeaseExceptions.
> >
> > But I found something that could be the problem: hbase.rpc.timeout . This
> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> > could happen the next way:
> > - the mapper makes a scanner.next call to the region server
> > - the region servers needs more than 60 seconds to execute it (I use
> > multiple filters, and it could take a lot of time)
> > - the scan client gets the timeout and cuts the connection
> > - the region server tries to send the results to the client ==>
> > ClosedChannelConnection
> >
> > I will get a deeper look into it tomorrow. If you have other suggestions,
> > please let me know!
> >
> > Thanks,
> > Lucian
> >
> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Did you restart the region servers after changing the config?
> >>
> >> Are you sure it's the same exception/stack trace?
> >>
> >> J-D
> >>
> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> >> <lu...@gmail.com> wrote:
> >> > Hi all,
> >> >
> >> > I have exactly the same problem that Eran had.
> >> > But there is something I don't understand: in my case, I have set the
> >> lease
> >> > time to 240000 (4 minutes). But most of the map tasks that are failing
> >> run
> >> > about 2 minutes. How is it possible to get a LeaseException if the
> task
> >> runs
> >> > less than the configured time for a lease?
> >> >
> >> > Regards,
> >> > Lucian Iordache
> >> >
> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
> >> >
> >> >> Perfect! Thanks.
> >> >>
> >> >> -eran
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >> >wrote:
> >> >>
> >> >> > hbase.regionserver.lease.period
> >> >> >
> >> >> > Set it bigger than 60000.
> >> >> >
> >> >> > J-D
> >> >> >
> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com>
> wrote:
> >> >> > >
> >> >> > > Thanks J-D!
> >> >> > > Since my main table is expected to continue growing I guess at
> some
> >> >> point
> >> >> > > even setting the cache size to 1 will not be enough. Is there a
> way
> >> to
> >> >> > > configure the lease timeout?
> >> >> > >
> >> >> > > -eran
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> >> jdcryans@apache.org
> >> >> > >wrote:
> >> >> > >
> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
> >> >> wrote:
> >> >> > > >
> >> >> > > > > Hi J-D,
> >> >> > > > > Thanks for the detailed explanation.
> >> >> > > > > So if I understand correctly the lease we're talking about is
> a
> >> >> > scanner
> >> >> > > > > lease and the timeout is between two scanner calls, correct?
> I
> >> >> think
> >> >> > that
> >> >> > > > > make sense because I now realize that jobs that fail (some
> jobs
> >> >> > continued
> >> >> > > > > to
> >> >> > > > > fail even after reducing the number of map tasks as Stack
> >> >> suggested)
> >> >> > use
> >> >> > > > > filters to fetch relatively few rows out of a very large
> table,
> >> so
> >> >> > they
> >> >> > > > > could be spending a lot of time on the region server scanning
> >> rows
> >> >> > until
> >> >> > > > it
> >> >> > > > > reached my setCaching value which was 1000. Setting the
> caching
> >> >> value
> >> >> > to
> >> >> > > > 1
> >> >> > > > > seem to allow these job to complete.
> >> >> > > > > I think it has to be the above, since my rows are small, with
> >> just
> >> >> a
> >> >> > few
> >> >> > > > > columns and processing them is very quick.
> >> >> > > > >
> >> >> > > >
> >> >> > > > Excellent!
> >> >> > > >
> >> >> > > >
> >> >> > > > >
> >> >> > > > > However, there are still a couple ofw thing I don't
> understand:
> >> >> > > > > 1. What is the difference between setCaching and setBatch?
> >> >> > > > >
> >> >> > > >
> >> >> > > > * Set the maximum number of values to return for each call to
> >> next()
> >> >> > > >
> >> >> > > > VS
> >> >> > > >
> >> >> > > > * Set the number of rows for caching that will be passed to
> >> scanners.
> >> >> > > >
> >> >> > > > The former is useful if you have rows with millions of columns
> and
> >> >> you
> >> >> > > > could
> >> >> > > > setBatch to get only 1000 of them at a time. You could call
> that
> >> >> > intra-row
> >> >> > > > scanning.
> >> >> > > >
> >> >> > > >
> >> >> > > > > 2. Examining the region server logs more closely than I did
> >> >> yesterday
> >> >> > I
> >> >> > > > see
> >> >> > > > > a log of ClosedChannelExceptions in addition to the expired
> >> leases
> >> >> > (but
> >> >> > > > no
> >> >> > > > > UnknownScannerException), is that expected? You can see an
> >> excerpt
> >> >> of
> >> >> > the
> >> >> > > > > log from one of the region servers here:
> >> >> > http://pastebin.com/NLcZTzsY
> >> >> > > >
> >> >> > > >
> >> >> > > > It means that when the server got to process that client
> request
> >> and
> >> >> > > > started
> >> >> > > > reading from the socket, the client was already gone. Killing a
> >> >> client
> >> >> > does
> >> >> > > > that (or killing a MR that scans), so does
> SocketTimeoutException.
> >> >> This
> >> >> > > > should probably go in the book. We should also print something
> >> nicer
> >> >> :)
> >> >> > > >
> >> >> > > > J-D
> >> >> > > >
> >> >> >
> >> >>
> >> >
> >>
> >
>



-- 
Numai bine,
Lucian

Re: Lease does not exist exceptions

Posted by Harsh J <ha...@cloudera.com>.

Hi Igal,

I seem to have missed that mail in my search. Thanks for pointing it
out - you are right on there. I commented on the JIRA, it is a nice
improvement.

On Thu, Sep 20, 2012 at 7:52 PM, Igal Shilman <ig...@wix.com> wrote:
> Hi,
> Do you mind taking a look at HBASE-6071 ?
>
> It was submitted as a result of this mail (back at May)
> http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E
>
> I've recently submitted logs that (I think) confirms this theory.
>
> Thanks,
> Igal.
>
> On Thu, Sep 20, 2012 at 4:55 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Daniel,
>>
>> That sounds fine to do (easier a solution, my brain's gotten complex today
>> ha).
>>
>> We should classify the two types of error in the docs for users the
>> way you have here, to indicate what the issue is in each of the error
>> cases - UnknownScannerException and LeaseException. Mind filing a
>> JIRA? :)
>>
>> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro>
>> wrote:
>> > Thaaank you! I was waiting for this email for months. I've read all the
>> > posts regarding lease timeouts and see that people usually have them for
>> 2
>> > reasons. One, the normal case where the client app does not process the
>> row
>> > fast enough so they get UnknownScannerException and some had the issue
>> below
>> > and get LeaseException instead.
>> >
>> > How about using a try/catch for the
>> >
>> > // Remove lease while its being processed in server; protects against
>> case
>> >       // where processing of request takes > lease expiration time.
>> >       lease = this.leases.removeLease(scannerName);
>> >
>> > and re-throw an IllegalStateException or log a warning message because a
>> > client with and active scanner but no lease does not seem to be in the
>> right
>> > state?
>> >
>> > Just an idea but you know  better.
>> > Daniel
>> >
>> > On 09/20/2012 03:42 PM, Harsh J wrote:
>> >
>> > Hi,
>> >
>> > I hit this today and got down to investigate it and one of my
>> > colleagues discovered this thread. Since I got some more clues, I
>> > thought I'll bump up this thread for good.
>> >
>> > Lucian almost got the issue here. The thing we missed thinking about
>> > is the client retry. The client of HBaseRPC seems to silently retry on
>> > timeouts. So if you apply Lucian's theory below and apply that a
>> > client retry calls next(ID, Rows) yet again, you can construct this
>> > issue:
>> >
>> > - Client calls next(ID, Rows) first time.
>> > - RS receives the handler-sent request, removes lease (to not expire
>> > it during next() call) and begins work.
>> > - RS#next hangs during work (for whatever reason we can assume - large
>> > values or locks or whatever)
>> > - Client times out after a minute, retries (due to default nature).
>> > Retry seems to be silent though?
>> > - New next(ID, Rows) call is invoked. Scanner still exists so no
>> > UnknownScanner is thrown. But when next() tries to remove lease, we
>> > get thrown LeaseException (and the client gets this immediately and
>> > dies) as the other parallel handler has the lease object already
>> > removed and held in its stuck state.
>> > - A few secs/mins later, the original next() unfreezes, adds back
>> > lease to the queue, tries to write back response, runs into
>> > ClosedChannelException as the client had already thrown its original
>> > socket away. End of clients.
>> > - Lease-period expiry later, the lease is now formally removed without
>> > any hitches.
>> >
>> > Ideally, to prevent this, the rpc.timeout must be > lease period as
>> > was pointed out. Since in that case, we'd have waited for X units more
>> > for the original next() to unblock and continue itself and not have
>> > retried. That is how this is avoided, unintentionally, but can still
>> > happen if the next() still takes very long.
>> >
>> > I haven't seen a LeaseException in any other case so far, so maybe we
>> > can improve that exception's message to indicate whats going on in
>> > simpler terms so clients can reconfigure to fix themselves?
>> >
>> > Also we could add in some measures to prevent next()-duping, as that
>> > is never bound to work given the lease-required system. Perhaps when
>> > the next() stores the removed lease, we can store it somewhere global
>> > (like ActiveLeases or summat) and deny next() duping if their
>> > requested lease is already in ActiveLeases? Just ends up giving a
>> > better message, not a solution.
>> >
>> > Hope this helps others who've run into the same issue.
>> >
>> > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>> > <jd...@apache.org> wrote:
>> >
>> > So you should see the SocketTimeoutException in your *client* logs (in
>> > your case, mappers), not LeaseException. At this point yes you're
>> > going to timeout, but if you spend so much time cycling on the server
>> > side then you shouldn't set a high caching configuration on your
>> > scanner as IO isn't your bottle neck.
>> >
>> > J-D
>> >
>> > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> > <lu...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > The servers have been restarted (I have this configuration for more than
>> a
>> > month, so this is not the problem).
>> > About the stack traces, they show exactly the same, a lot of
>> > ClosedChannelConnections and LeaseExceptions.
>> >
>> > But I found something that could be the problem: hbase.rpc.timeout . This
>> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> > could happen the next way:
>> > - the mapper makes a scanner.next call to the region server
>> > - the region servers needs more than 60 seconds to execute it (I use
>> > multiple filters, and it could take a lot of time)
>> > - the scan client gets the timeout and cuts the connection
>> > - the region server tries to send the results to the client ==>
>> > ClosedChannelConnection
>> >
>> > I will get a deeper look into it tomorrow. If you have other suggestions,
>> > please let me know!
>> >
>> > Thanks,
>> > Lucian
>> >
>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>> > <jd...@apache.org>wrote:
>> >
>> > Did you restart the region servers after changing the config?
>> >
>> > Are you sure it's the same exception/stack trace?
>> >
>> > J-D
>> >
>> > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> > <lu...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I have exactly the same problem that Eran had.
>> > But there is something I don't understand: in my case, I have set the
>> >
>> > lease
>> >
>> > time to 240000 (4 minutes). But most of the map tasks that are failing
>> >
>> > run
>> >
>> > about 2 minutes. How is it possible to get a LeaseException if the task
>> >
>> > runs
>> >
>> > less than the configured time for a lease?
>> >
>> > Regards,
>> > Lucian Iordache
>> >
>> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>> >
>> > Perfect! Thanks.
>> >
>> > -eran
>> >
>> >
>> >
>> > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>> >
>> > wrote:
>> >
>> > hbase.regionserver.lease.period
>> >
>> > Set it bigger than 60000.
>> >
>> > J-D
>> >
>> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>> >
>> > Thanks J-D!
>> > Since my main table is expected to continue growing I guess at some
>> >
>> > point
>> >
>> > even setting the cache size to 1 will not be enough. Is there a way
>> >
>> > to
>> >
>> > configure the lease timeout?
>> >
>> > -eran
>> >
>> >
>> >
>> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> >
>> > jdcryans@apache.org
>> >
>> > wrote:
>> >
>> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>> >
>> > wrote:
>> >
>> > Hi J-D,
>> > Thanks for the detailed explanation.
>> > So if I understand correctly the lease we're talking about is a
>> >
>> > scanner
>> >
>> > lease and the timeout is between two scanner calls, correct? I
>> >
>> > think
>> >
>> > that
>> >
>> > make sense because I now realize that jobs that fail (some jobs
>> >
>> > continued
>> >
>> > to
>> > fail even after reducing the number of map tasks as Stack
>> >
>> > suggested)
>> >
>> > use
>> >
>> > filters to fetch relatively few rows out of a very large table,
>> >
>> > so
>> >
>> > they
>> >
>> > could be spending a lot of time on the region server scanning
>> >
>> > rows
>> >
>> > until
>> >
>> > it
>> >
>> > reached my setCaching value which was 1000. Setting the caching
>> >
>> > value
>> >
>> > to
>> >
>> > 1
>> >
>> > seem to allow these job to complete.
>> > I think it has to be the above, since my rows are small, with
>> >
>> > just
>> >
>> > a
>> >
>> > few
>> >
>> > columns and processing them is very quick.
>> >
>> > Excellent!
>> >
>> >
>> > However, there are still a couple ofw thing I don't understand:
>> > 1. What is the difference between setCaching and setBatch?
>> >
>> > * Set the maximum number of values to return for each call to
>> >
>> > next()
>> >
>> > VS
>> >
>> > * Set the number of rows for caching that will be passed to
>> >
>> > scanners.
>> >
>> > The former is useful if you have rows with millions of columns and
>> >
>> > you
>> >
>> > could
>> > setBatch to get only 1000 of them at a time. You could call that
>> >
>> > intra-row
>> >
>> > scanning.
>> >
>> >
>> > 2. Examining the region server logs more closely than I did
>> >
>> > yesterday
>> >
>> > I
>> >
>> > see
>> >
>> > a log of ClosedChannelExceptions in addition to the expired
>> >
>> > leases
>> >
>> > (but
>> >
>> > no
>> >
>> > UnknownScannerException), is that expected? You can see an
>> >
>> > excerpt
>> >
>> > of
>> >
>> > the
>> >
>> > log from one of the region servers here:
>> >
>> > http://pastebin.com/NLcZTzsY
>> >
>> > It means that when the server got to process that client request
>> >
>> > and
>> >
>> > started
>> > reading from the socket, the client was already gone. Killing a
>> >
>> > client
>> >
>> > does
>> >
>> > that (or killing a MR that scans), so does SocketTimeoutException.
>> >
>> > This
>> >
>> > should probably go in the book. We should also print something
>> >
>> > nicer
>> >
>> > :)
>> >
>> > J-D
>> >
>> >
>> >
>> > --
>> > Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J

Re: Lease does not exist exceptions

Posted by Igal Shilman <ig...@wix.com>.

Hi,
Do you mind taking a look at HBASE-6071 ?

It was submitted as a result of this mail (back at May)
http://mail-archives.apache.org/mod_mbox/hbase-user/201205.mbox/%3CCAFebPXBq9V9BVdzRTNr-MB3a1Lz78SZj6gvP6On0b%2Bajt9StAg%40mail.gmail.com%3E

I've recently submitted logs that (I think) confirms this theory.

Thanks,
Igal.

On Thu, Sep 20, 2012 at 4:55 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Daniel,
>
> That sounds fine to do (easier a solution, my brain's gotten complex today
> ha).
>
> We should classify the two types of error in the docs for users the
> way you have here, to indicate what the issue is in each of the error
> cases - UnknownScannerException and LeaseException. Mind filing a
> JIRA? :)
>
> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro>
> wrote:
> > Thaaank you! I was waiting for this email for months. I've read all the
> > posts regarding lease timeouts and see that people usually have them for
> 2
> > reasons. One, the normal case where the client app does not process the
> row
> > fast enough so they get UnknownScannerException and some had the issue
> below
> > and get LeaseException instead.
> >
> > How about using a try/catch for the
> >
> > // Remove lease while its being processed in server; protects against
> case
> >       // where processing of request takes > lease expiration time.
> >       lease = this.leases.removeLease(scannerName);
> >
> > and re-throw an IllegalStateException or log a warning message because a
> > client with and active scanner but no lease does not seem to be in the
> right
> > state?
> >
> > Just an idea but you know  better.
> > Daniel
> >
> > On 09/20/2012 03:42 PM, Harsh J wrote:
> >
> > Hi,
> >
> > I hit this today and got down to investigate it and one of my
> > colleagues discovered this thread. Since I got some more clues, I
> > thought I'll bump up this thread for good.
> >
> > Lucian almost got the issue here. The thing we missed thinking about
> > is the client retry. The client of HBaseRPC seems to silently retry on
> > timeouts. So if you apply Lucian's theory below and apply that a
> > client retry calls next(ID, Rows) yet again, you can construct this
> > issue:
> >
> > - Client calls next(ID, Rows) first time.
> > - RS receives the handler-sent request, removes lease (to not expire
> > it during next() call) and begins work.
> > - RS#next hangs during work (for whatever reason we can assume - large
> > values or locks or whatever)
> > - Client times out after a minute, retries (due to default nature).
> > Retry seems to be silent though?
> > - New next(ID, Rows) call is invoked. Scanner still exists so no
> > UnknownScanner is thrown. But when next() tries to remove lease, we
> > get thrown LeaseException (and the client gets this immediately and
> > dies) as the other parallel handler has the lease object already
> > removed and held in its stuck state.
> > - A few secs/mins later, the original next() unfreezes, adds back
> > lease to the queue, tries to write back response, runs into
> > ClosedChannelException as the client had already thrown its original
> > socket away. End of clients.
> > - Lease-period expiry later, the lease is now formally removed without
> > any hitches.
> >
> > Ideally, to prevent this, the rpc.timeout must be > lease period as
> > was pointed out. Since in that case, we'd have waited for X units more
> > for the original next() to unblock and continue itself and not have
> > retried. That is how this is avoided, unintentionally, but can still
> > happen if the next() still takes very long.
> >
> > I haven't seen a LeaseException in any other case so far, so maybe we
> > can improve that exception's message to indicate whats going on in
> > simpler terms so clients can reconfigure to fix themselves?
> >
> > Also we could add in some measures to prevent next()-duping, as that
> > is never bound to work given the lease-required system. Perhaps when
> > the next() stores the removed lease, we can store it somewhere global
> > (like ActiveLeases or summat) and deny next() duping if their
> > requested lease is already in ActiveLeases? Just ends up giving a
> > better message, not a solution.
> >
> > Hope this helps others who've run into the same issue.
> >
> > On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
> > <jd...@apache.org> wrote:
> >
> > So you should see the SocketTimeoutException in your *client* logs (in
> > your case, mappers), not LeaseException. At this point yes you're
> > going to timeout, but if you spend so much time cycling on the server
> > side then you shouldn't set a high caching configuration on your
> > scanner as IO isn't your bottle neck.
> >
> > J-D
> >
> > On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> > <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > The servers have been restarted (I have this configuration for more than
> a
> > month, so this is not the problem).
> > About the stack traces, they show exactly the same, a lot of
> > ClosedChannelConnections and LeaseExceptions.
> >
> > But I found something that could be the problem: hbase.rpc.timeout . This
> > defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> > could happen the next way:
> > - the mapper makes a scanner.next call to the region server
> > - the region servers needs more than 60 seconds to execute it (I use
> > multiple filters, and it could take a lot of time)
> > - the scan client gets the timeout and cuts the connection
> > - the region server tries to send the results to the client ==>
> > ClosedChannelConnection
> >
> > I will get a deeper look into it tomorrow. If you have other suggestions,
> > please let me know!
> >
> > Thanks,
> > Lucian
> >
> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
> > <jd...@apache.org>wrote:
> >
> > Did you restart the region servers after changing the config?
> >
> > Are you sure it's the same exception/stack trace?
> >
> > J-D
> >
> > On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> > <lu...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I have exactly the same problem that Eran had.
> > But there is something I don't understand: in my case, I have set the
> >
> > lease
> >
> > time to 240000 (4 minutes). But most of the map tasks that are failing
> >
> > run
> >
> > about 2 minutes. How is it possible to get a LeaseException if the task
> >
> > runs
> >
> > less than the configured time for a lease?
> >
> > Regards,
> > Lucian Iordache
> >
> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
> >
> > Perfect! Thanks.
> >
> > -eran
> >
> >
> >
> > On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
> >
> > wrote:
> >
> > hbase.regionserver.lease.period
> >
> > Set it bigger than 60000.
> >
> > J-D
> >
> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
> >
> > Thanks J-D!
> > Since my main table is expected to continue growing I guess at some
> >
> > point
> >
> > even setting the cache size to 1 will not be enough. Is there a way
> >
> > to
> >
> > configure the lease timeout?
> >
> > -eran
> >
> >
> >
> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> >
> > jdcryans@apache.org
> >
> > wrote:
> >
> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
> >
> > wrote:
> >
> > Hi J-D,
> > Thanks for the detailed explanation.
> > So if I understand correctly the lease we're talking about is a
> >
> > scanner
> >
> > lease and the timeout is between two scanner calls, correct? I
> >
> > think
> >
> > that
> >
> > make sense because I now realize that jobs that fail (some jobs
> >
> > continued
> >
> > to
> > fail even after reducing the number of map tasks as Stack
> >
> > suggested)
> >
> > use
> >
> > filters to fetch relatively few rows out of a very large table,
> >
> > so
> >
> > they
> >
> > could be spending a lot of time on the region server scanning
> >
> > rows
> >
> > until
> >
> > it
> >
> > reached my setCaching value which was 1000. Setting the caching
> >
> > value
> >
> > to
> >
> > 1
> >
> > seem to allow these job to complete.
> > I think it has to be the above, since my rows are small, with
> >
> > just
> >
> > a
> >
> > few
> >
> > columns and processing them is very quick.
> >
> > Excellent!
> >
> >
> > However, there are still a couple ofw thing I don't understand:
> > 1. What is the difference between setCaching and setBatch?
> >
> > * Set the maximum number of values to return for each call to
> >
> > next()
> >
> > VS
> >
> > * Set the number of rows for caching that will be passed to
> >
> > scanners.
> >
> > The former is useful if you have rows with millions of columns and
> >
> > you
> >
> > could
> > setBatch to get only 1000 of them at a time. You could call that
> >
> > intra-row
> >
> > scanning.
> >
> >
> > 2. Examining the region server logs more closely than I did
> >
> > yesterday
> >
> > I
> >
> > see
> >
> > a log of ClosedChannelExceptions in addition to the expired
> >
> > leases
> >
> > (but
> >
> > no
> >
> > UnknownScannerException), is that expected? You can see an
> >
> > excerpt
> >
> > of
> >
> > the
> >
> > log from one of the region servers here:
> >
> > http://pastebin.com/NLcZTzsY
> >
> > It means that when the server got to process that client request
> >
> > and
> >
> > started
> > reading from the socket, the client was already gone. Killing a
> >
> > client
> >
> > does
> >
> > that (or killing a MR that scans), so does SocketTimeoutException.
> >
> > This
> >
> > should probably go in the book. We should also print something
> >
> > nicer
> >
> > :)
> >
> > J-D
> >
> >
> >
> > --
> > Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Lease does not exist exceptions

Posted by Daniel Iancu <da...@1and1.ro>.

Here it is:

https://issues.apache.org/jira/browse/HBASE-6856


On 09/21/2012 08:42 PM, Harsh J wrote:
> Daniel,
>
> Nice follow up! We could add some notes around these to the doc as
> well. Please do post back a JIRA link once you've filed it.
>
> On Thu, Sep 20, 2012 at 8:15 PM, Daniel Iancu <da...@1and1.ro> wrote:
>> Hi Harsh
>>
>> I've forget to mention that LE happens in the context of a slow internal
>> scanner. So maybe the problem is elsewhere (bad schema design, hardware
>> issue etc) and this is just a consequence (or symptom)  but still it is a
>> problem. In our case the scan was very slow when we were using filters (even
>> the standard ones). When we dropped the filters and transfered the entire
>> row (I know it is not recommended) we haven't see this exception anymore and
>> the overall performance of MR jobs improved since any LE was crashing the
>> task attempt.
>>
>> OK, I'll a open a JIRA for this.
>>
>> Regards,
>> Daniel
>>
>>
>>
>>
>>
>> On 09/20/2012 04:55 PM, Harsh J wrote:
>>> Hi Daniel,
>>>
>>> That sounds fine to do (easier a solution, my brain's gotten complex today
>>> ha).
>>>
>>> We should classify the two types of error in the docs for users the
>>> way you have here, to indicate what the issue is in each of the error
>>> cases - UnknownScannerException and LeaseException. Mind filing a
>>> JIRA? :)
>>>
>>> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro>
>>> wrote:
>>>> Thaaank you! I was waiting for this email for months. I've read all the
>>>> posts regarding lease timeouts and see that people usually have them for
>>>> 2
>>>> reasons. One, the normal case where the client app does not process the
>>>> row
>>>> fast enough so they get UnknownScannerException and some had the issue
>>>> below
>>>> and get LeaseException instead.
>>>>
>>>> How about using a try/catch for the
>>>>
>>>> // Remove lease while its being processed in server; protects against
>>>> case
>>>>         // where processing of request takes > lease expiration time.
>>>>         lease = this.leases.removeLease(scannerName);
>>>>
>>>> and re-throw an IllegalStateException or log a warning message because a
>>>> client with and active scanner but no lease does not seem to be in the
>>>> right
>>>> state?
>>>>
>>>> Just an idea but you know  better.
>>>> Daniel
>>>>
>>>> On 09/20/2012 03:42 PM, Harsh J wrote:
>>>>
>>>> Hi,
>>>>
>>>> I hit this today and got down to investigate it and one of my
>>>> colleagues discovered this thread. Since I got some more clues, I
>>>> thought I'll bump up this thread for good.
>>>>
>>>> Lucian almost got the issue here. The thing we missed thinking about
>>>> is the client retry. The client of HBaseRPC seems to silently retry on
>>>> timeouts. So if you apply Lucian's theory below and apply that a
>>>> client retry calls next(ID, Rows) yet again, you can construct this
>>>> issue:
>>>>
>>>> - Client calls next(ID, Rows) first time.
>>>> - RS receives the handler-sent request, removes lease (to not expire
>>>> it during next() call) and begins work.
>>>> - RS#next hangs during work (for whatever reason we can assume - large
>>>> values or locks or whatever)
>>>> - Client times out after a minute, retries (due to default nature).
>>>> Retry seems to be silent though?
>>>> - New next(ID, Rows) call is invoked. Scanner still exists so no
>>>> UnknownScanner is thrown. But when next() tries to remove lease, we
>>>> get thrown LeaseException (and the client gets this immediately and
>>>> dies) as the other parallel handler has the lease object already
>>>> removed and held in its stuck state.
>>>> - A few secs/mins later, the original next() unfreezes, adds back
>>>> lease to the queue, tries to write back response, runs into
>>>> ClosedChannelException as the client had already thrown its original
>>>> socket away. End of clients.
>>>> - Lease-period expiry later, the lease is now formally removed without
>>>> any hitches.
>>>>
>>>> Ideally, to prevent this, the rpc.timeout must be > lease period as
>>>> was pointed out. Since in that case, we'd have waited for X units more
>>>> for the original next() to unblock and continue itself and not have
>>>> retried. That is how this is avoided, unintentionally, but can still
>>>> happen if the next() still takes very long.
>>>>
>>>> I haven't seen a LeaseException in any other case so far, so maybe we
>>>> can improve that exception's message to indicate whats going on in
>>>> simpler terms so clients can reconfigure to fix themselves?
>>>>
>>>> Also we could add in some measures to prevent next()-duping, as that
>>>> is never bound to work given the lease-required system. Perhaps when
>>>> the next() stores the removed lease, we can store it somewhere global
>>>> (like ActiveLeases or summat) and deny next() duping if their
>>>> requested lease is already in ActiveLeases? Just ends up giving a
>>>> better message, not a solution.
>>>>
>>>> Hope this helps others who've run into the same issue.
>>>>
>>>> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>>>> <jd...@apache.org> wrote:
>>>>
>>>> So you should see the SocketTimeoutException in your *client* logs (in
>>>> your case, mappers), not LeaseException. At this point yes you're
>>>> going to timeout, but if you spend so much time cycling on the server
>>>> side then you shouldn't set a high caching configuration on your
>>>> scanner as IO isn't your bottle neck.
>>>>
>>>> J-D
>>>>
>>>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>>>> <lu...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> The servers have been restarted (I have this configuration for more than
>>>> a
>>>> month, so this is not the problem).
>>>> About the stack traces, they show exactly the same, a lot of
>>>> ClosedChannelConnections and LeaseExceptions.
>>>>
>>>> But I found something that could be the problem: hbase.rpc.timeout . This
>>>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>>>> could happen the next way:
>>>> - the mapper makes a scanner.next call to the region server
>>>> - the region servers needs more than 60 seconds to execute it (I use
>>>> multiple filters, and it could take a lot of time)
>>>> - the scan client gets the timeout and cuts the connection
>>>> - the region server tries to send the results to the client ==>
>>>> ClosedChannelConnection
>>>>
>>>> I will get a deeper look into it tomorrow. If you have other suggestions,
>>>> please let me know!
>>>>
>>>> Thanks,
>>>> Lucian
>>>>
>>>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>>>> <jd...@apache.org>wrote:
>>>>
>>>> Did you restart the region servers after changing the config?
>>>>
>>>> Are you sure it's the same exception/stack trace?
>>>>
>>>> J-D
>>>>
>>>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>>> <lu...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I have exactly the same problem that Eran had.
>>>> But there is something I don't understand: in my case, I have set the
>>>>
>>>> lease
>>>>
>>>> time to 240000 (4 minutes). But most of the map tasks that are failing
>>>>
>>>> run
>>>>
>>>> about 2 minutes. How is it possible to get a LeaseException if the task
>>>>
>>>> runs
>>>>
>>>> less than the configured time for a lease?
>>>>
>>>> Regards,
>>>> Lucian Iordache
>>>>
>>>> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>>>>
>>>> Perfect! Thanks.
>>>>
>>>> -eran
>>>>
>>>>
>>>>
>>>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>>>
>>>> wrote:
>>>>
>>>> hbase.regionserver.lease.period
>>>>
>>>> Set it bigger than 60000.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>>>>
>>>> Thanks J-D!
>>>> Since my main table is expected to continue growing I guess at some
>>>>
>>>> point
>>>>
>>>> even setting the cache size to 1 will not be enough. Is there a way
>>>>
>>>> to
>>>>
>>>> configure the lease timeout?
>>>>
>>>> -eran
>>>>
>>>>
>>>>
>>>> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>>>
>>>> jdcryans@apache.org
>>>>
>>>> wrote:
>>>>
>>>> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi J-D,
>>>> Thanks for the detailed explanation.
>>>> So if I understand correctly the lease we're talking about is a
>>>>
>>>> scanner
>>>>
>>>> lease and the timeout is between two scanner calls, correct? I
>>>>
>>>> think
>>>>
>>>> that
>>>>
>>>> make sense because I now realize that jobs that fail (some jobs
>>>>
>>>> continued
>>>>
>>>> to
>>>> fail even after reducing the number of map tasks as Stack
>>>>
>>>> suggested)
>>>>
>>>> use
>>>>
>>>> filters to fetch relatively few rows out of a very large table,
>>>>
>>>> so
>>>>
>>>> they
>>>>
>>>> could be spending a lot of time on the region server scanning
>>>>
>>>> rows
>>>>
>>>> until
>>>>
>>>> it
>>>>
>>>> reached my setCaching value which was 1000. Setting the caching
>>>>
>>>> value
>>>>
>>>> to
>>>>
>>>> 1
>>>>
>>>> seem to allow these job to complete.
>>>> I think it has to be the above, since my rows are small, with
>>>>
>>>> just
>>>>
>>>> a
>>>>
>>>> few
>>>>
>>>> columns and processing them is very quick.
>>>>
>>>> Excellent!
>>>>
>>>>
>>>> However, there are still a couple ofw thing I don't understand:
>>>> 1. What is the difference between setCaching and setBatch?
>>>>
>>>> * Set the maximum number of values to return for each call to
>>>>
>>>> next()
>>>>
>>>> VS
>>>>
>>>> * Set the number of rows for caching that will be passed to
>>>>
>>>> scanners.
>>>>
>>>> The former is useful if you have rows with millions of columns and
>>>>
>>>> you
>>>>
>>>> could
>>>> setBatch to get only 1000 of them at a time. You could call that
>>>>
>>>> intra-row
>>>>
>>>> scanning.
>>>>
>>>>
>>>> 2. Examining the region server logs more closely than I did
>>>>
>>>> yesterday
>>>>
>>>> I
>>>>
>>>> see
>>>>
>>>> a log of ClosedChannelExceptions in addition to the expired
>>>>
>>>> leases
>>>>
>>>> (but
>>>>
>>>> no
>>>>
>>>> UnknownScannerException), is that expected? You can see an
>>>>
>>>> excerpt
>>>>
>>>> of
>>>>
>>>> the
>>>>
>>>> log from one of the region servers here:
>>>>
>>>> http://pastebin.com/NLcZTzsY
>>>>
>>>> It means that when the server got to process that client request
>>>>
>>>> and
>>>>
>>>> started
>>>> reading from the socket, the client was already gone. Killing a
>>>>
>>>> client
>>>>
>>>> does
>>>>
>>>> that (or killing a MR that scans), so does SocketTimeoutException.
>>>>
>>>> This
>>>>
>>>> should probably go in the book. We should also print something
>>>>
>>>> nicer
>>>>
>>>> :)
>>>>
>>>> J-D
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>>
>>>
>>> --
>>> Harsh J
>>
>
>
> --
> Harsh J

Re: Lease does not exist exceptions

Posted by Harsh J <ha...@cloudera.com>.

Daniel,

Nice follow up! We could add some notes around these to the doc as
well. Please do post back a JIRA link once you've filed it.

On Thu, Sep 20, 2012 at 8:15 PM, Daniel Iancu <da...@1and1.ro> wrote:
> Hi Harsh
>
> I've forget to mention that LE happens in the context of a slow internal
> scanner. So maybe the problem is elsewhere (bad schema design, hardware
> issue etc) and this is just a consequence (or symptom)  but still it is a
> problem. In our case the scan was very slow when we were using filters (even
> the standard ones). When we dropped the filters and transfered the entire
> row (I know it is not recommended) we haven't see this exception anymore and
> the overall performance of MR jobs improved since any LE was crashing the
> task attempt.
>
> OK, I'll a open a JIRA for this.
>
> Regards,
> Daniel
>
>
>
>
>
> On 09/20/2012 04:55 PM, Harsh J wrote:
>>
>> Hi Daniel,
>>
>> That sounds fine to do (easier a solution, my brain's gotten complex today
>> ha).
>>
>> We should classify the two types of error in the docs for users the
>> way you have here, to indicate what the issue is in each of the error
>> cases - UnknownScannerException and LeaseException. Mind filing a
>> JIRA? :)
>>
>> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro>
>> wrote:
>>>
>>> Thaaank you! I was waiting for this email for months. I've read all the
>>> posts regarding lease timeouts and see that people usually have them for
>>> 2
>>> reasons. One, the normal case where the client app does not process the
>>> row
>>> fast enough so they get UnknownScannerException and some had the issue
>>> below
>>> and get LeaseException instead.
>>>
>>> How about using a try/catch for the
>>>
>>> // Remove lease while its being processed in server; protects against
>>> case
>>>        // where processing of request takes > lease expiration time.
>>>        lease = this.leases.removeLease(scannerName);
>>>
>>> and re-throw an IllegalStateException or log a warning message because a
>>> client with and active scanner but no lease does not seem to be in the
>>> right
>>> state?
>>>
>>> Just an idea but you know  better.
>>> Daniel
>>>
>>> On 09/20/2012 03:42 PM, Harsh J wrote:
>>>
>>> Hi,
>>>
>>> I hit this today and got down to investigate it and one of my
>>> colleagues discovered this thread. Since I got some more clues, I
>>> thought I'll bump up this thread for good.
>>>
>>> Lucian almost got the issue here. The thing we missed thinking about
>>> is the client retry. The client of HBaseRPC seems to silently retry on
>>> timeouts. So if you apply Lucian's theory below and apply that a
>>> client retry calls next(ID, Rows) yet again, you can construct this
>>> issue:
>>>
>>> - Client calls next(ID, Rows) first time.
>>> - RS receives the handler-sent request, removes lease (to not expire
>>> it during next() call) and begins work.
>>> - RS#next hangs during work (for whatever reason we can assume - large
>>> values or locks or whatever)
>>> - Client times out after a minute, retries (due to default nature).
>>> Retry seems to be silent though?
>>> - New next(ID, Rows) call is invoked. Scanner still exists so no
>>> UnknownScanner is thrown. But when next() tries to remove lease, we
>>> get thrown LeaseException (and the client gets this immediately and
>>> dies) as the other parallel handler has the lease object already
>>> removed and held in its stuck state.
>>> - A few secs/mins later, the original next() unfreezes, adds back
>>> lease to the queue, tries to write back response, runs into
>>> ClosedChannelException as the client had already thrown its original
>>> socket away. End of clients.
>>> - Lease-period expiry later, the lease is now formally removed without
>>> any hitches.
>>>
>>> Ideally, to prevent this, the rpc.timeout must be > lease period as
>>> was pointed out. Since in that case, we'd have waited for X units more
>>> for the original next() to unblock and continue itself and not have
>>> retried. That is how this is avoided, unintentionally, but can still
>>> happen if the next() still takes very long.
>>>
>>> I haven't seen a LeaseException in any other case so far, so maybe we
>>> can improve that exception's message to indicate whats going on in
>>> simpler terms so clients can reconfigure to fix themselves?
>>>
>>> Also we could add in some measures to prevent next()-duping, as that
>>> is never bound to work given the lease-required system. Perhaps when
>>> the next() stores the removed lease, we can store it somewhere global
>>> (like ActiveLeases or summat) and deny next() duping if their
>>> requested lease is already in ActiveLeases? Just ends up giving a
>>> better message, not a solution.
>>>
>>> Hope this helps others who've run into the same issue.
>>>
>>> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>>> <jd...@apache.org> wrote:
>>>
>>> So you should see the SocketTimeoutException in your *client* logs (in
>>> your case, mappers), not LeaseException. At this point yes you're
>>> going to timeout, but if you spend so much time cycling on the server
>>> side then you shouldn't set a high caching configuration on your
>>> scanner as IO isn't your bottle neck.
>>>
>>> J-D
>>>
>>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>>> <lu...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> The servers have been restarted (I have this configuration for more than
>>> a
>>> month, so this is not the problem).
>>> About the stack traces, they show exactly the same, a lot of
>>> ClosedChannelConnections and LeaseExceptions.
>>>
>>> But I found something that could be the problem: hbase.rpc.timeout . This
>>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>>> could happen the next way:
>>> - the mapper makes a scanner.next call to the region server
>>> - the region servers needs more than 60 seconds to execute it (I use
>>> multiple filters, and it could take a lot of time)
>>> - the scan client gets the timeout and cuts the connection
>>> - the region server tries to send the results to the client ==>
>>> ClosedChannelConnection
>>>
>>> I will get a deeper look into it tomorrow. If you have other suggestions,
>>> please let me know!
>>>
>>> Thanks,
>>> Lucian
>>>
>>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>>> <jd...@apache.org>wrote:
>>>
>>> Did you restart the region servers after changing the config?
>>>
>>> Are you sure it's the same exception/stack trace?
>>>
>>> J-D
>>>
>>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>> <lu...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I have exactly the same problem that Eran had.
>>> But there is something I don't understand: in my case, I have set the
>>>
>>> lease
>>>
>>> time to 240000 (4 minutes). But most of the map tasks that are failing
>>>
>>> run
>>>
>>> about 2 minutes. How is it possible to get a LeaseException if the task
>>>
>>> runs
>>>
>>> less than the configured time for a lease?
>>>
>>> Regards,
>>> Lucian Iordache
>>>
>>> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>>>
>>> Perfect! Thanks.
>>>
>>> -eran
>>>
>>>
>>>
>>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>>
>>> wrote:
>>>
>>> hbase.regionserver.lease.period
>>>
>>> Set it bigger than 60000.
>>>
>>> J-D
>>>
>>> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>>>
>>> Thanks J-D!
>>> Since my main table is expected to continue growing I guess at some
>>>
>>> point
>>>
>>> even setting the cache size to 1 will not be enough. Is there a way
>>>
>>> to
>>>
>>> configure the lease timeout?
>>>
>>> -eran
>>>
>>>
>>>
>>> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>>
>>> jdcryans@apache.org
>>>
>>> wrote:
>>>
>>> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>>>
>>> wrote:
>>>
>>> Hi J-D,
>>> Thanks for the detailed explanation.
>>> So if I understand correctly the lease we're talking about is a
>>>
>>> scanner
>>>
>>> lease and the timeout is between two scanner calls, correct? I
>>>
>>> think
>>>
>>> that
>>>
>>> make sense because I now realize that jobs that fail (some jobs
>>>
>>> continued
>>>
>>> to
>>> fail even after reducing the number of map tasks as Stack
>>>
>>> suggested)
>>>
>>> use
>>>
>>> filters to fetch relatively few rows out of a very large table,
>>>
>>> so
>>>
>>> they
>>>
>>> could be spending a lot of time on the region server scanning
>>>
>>> rows
>>>
>>> until
>>>
>>> it
>>>
>>> reached my setCaching value which was 1000. Setting the caching
>>>
>>> value
>>>
>>> to
>>>
>>> 1
>>>
>>> seem to allow these job to complete.
>>> I think it has to be the above, since my rows are small, with
>>>
>>> just
>>>
>>> a
>>>
>>> few
>>>
>>> columns and processing them is very quick.
>>>
>>> Excellent!
>>>
>>>
>>> However, there are still a couple ofw thing I don't understand:
>>> 1. What is the difference between setCaching and setBatch?
>>>
>>> * Set the maximum number of values to return for each call to
>>>
>>> next()
>>>
>>> VS
>>>
>>> * Set the number of rows for caching that will be passed to
>>>
>>> scanners.
>>>
>>> The former is useful if you have rows with millions of columns and
>>>
>>> you
>>>
>>> could
>>> setBatch to get only 1000 of them at a time. You could call that
>>>
>>> intra-row
>>>
>>> scanning.
>>>
>>>
>>> 2. Examining the region server logs more closely than I did
>>>
>>> yesterday
>>>
>>> I
>>>
>>> see
>>>
>>> a log of ClosedChannelExceptions in addition to the expired
>>>
>>> leases
>>>
>>> (but
>>>
>>> no
>>>
>>> UnknownScannerException), is that expected? You can see an
>>>
>>> excerpt
>>>
>>> of
>>>
>>> the
>>>
>>> log from one of the region servers here:
>>>
>>> http://pastebin.com/NLcZTzsY
>>>
>>> It means that when the server got to process that client request
>>>
>>> and
>>>
>>> started
>>> reading from the socket, the client was already gone. Killing a
>>>
>>> client
>>>
>>> does
>>>
>>> that (or killing a MR that scans), so does SocketTimeoutException.
>>>
>>> This
>>>
>>> should probably go in the book. We should also print something
>>>
>>> nicer
>>>
>>> :)
>>>
>>> J-D
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Lease does not exist exceptions

Posted by Daniel Iancu <da...@1and1.ro>.

Hi Harsh

I've forget to mention that LE happens in the context of a slow internal 
scanner. So maybe the problem is elsewhere (bad schema design, hardware 
issue etc) and this is just a consequence (or symptom)  but still it is 
a problem. In our case the scan was very slow when we were using filters 
(even the standard ones). When we dropped the filters and transfered the 
entire row (I know it is not recommended) we haven't see this exception 
anymore and the overall performance of MR jobs improved since any LE was 
crashing the task attempt.

OK, I'll a open a JIRA for this.

Regards,
Daniel




On 09/20/2012 04:55 PM, Harsh J wrote:
> Hi Daniel,
>
> That sounds fine to do (easier a solution, my brain's gotten complex today ha).
>
> We should classify the two types of error in the docs for users the
> way you have here, to indicate what the issue is in each of the error
> cases - UnknownScannerException and LeaseException. Mind filing a
> JIRA? :)
>
> On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro> wrote:
>> Thaaank you! I was waiting for this email for months. I've read all the
>> posts regarding lease timeouts and see that people usually have them for 2
>> reasons. One, the normal case where the client app does not process the row
>> fast enough so they get UnknownScannerException and some had the issue below
>> and get LeaseException instead.
>>
>> How about using a try/catch for the
>>
>> // Remove lease while its being processed in server; protects against case
>>        // where processing of request takes > lease expiration time.
>>        lease = this.leases.removeLease(scannerName);
>>
>> and re-throw an IllegalStateException or log a warning message because a
>> client with and active scanner but no lease does not seem to be in the right
>> state?
>>
>> Just an idea but you know  better.
>> Daniel
>>
>> On 09/20/2012 03:42 PM, Harsh J wrote:
>>
>> Hi,
>>
>> I hit this today and got down to investigate it and one of my
>> colleagues discovered this thread. Since I got some more clues, I
>> thought I'll bump up this thread for good.
>>
>> Lucian almost got the issue here. The thing we missed thinking about
>> is the client retry. The client of HBaseRPC seems to silently retry on
>> timeouts. So if you apply Lucian's theory below and apply that a
>> client retry calls next(ID, Rows) yet again, you can construct this
>> issue:
>>
>> - Client calls next(ID, Rows) first time.
>> - RS receives the handler-sent request, removes lease (to not expire
>> it during next() call) and begins work.
>> - RS#next hangs during work (for whatever reason we can assume - large
>> values or locks or whatever)
>> - Client times out after a minute, retries (due to default nature).
>> Retry seems to be silent though?
>> - New next(ID, Rows) call is invoked. Scanner still exists so no
>> UnknownScanner is thrown. But when next() tries to remove lease, we
>> get thrown LeaseException (and the client gets this immediately and
>> dies) as the other parallel handler has the lease object already
>> removed and held in its stuck state.
>> - A few secs/mins later, the original next() unfreezes, adds back
>> lease to the queue, tries to write back response, runs into
>> ClosedChannelException as the client had already thrown its original
>> socket away. End of clients.
>> - Lease-period expiry later, the lease is now formally removed without
>> any hitches.
>>
>> Ideally, to prevent this, the rpc.timeout must be > lease period as
>> was pointed out. Since in that case, we'd have waited for X units more
>> for the original next() to unblock and continue itself and not have
>> retried. That is how this is avoided, unintentionally, but can still
>> happen if the next() still takes very long.
>>
>> I haven't seen a LeaseException in any other case so far, so maybe we
>> can improve that exception's message to indicate whats going on in
>> simpler terms so clients can reconfigure to fix themselves?
>>
>> Also we could add in some measures to prevent next()-duping, as that
>> is never bound to work given the lease-required system. Perhaps when
>> the next() stores the removed lease, we can store it somewhere global
>> (like ActiveLeases or summat) and deny next() duping if their
>> requested lease is already in ActiveLeases? Just ends up giving a
>> better message, not a solution.
>>
>> Hope this helps others who've run into the same issue.
>>
>> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
>> <jd...@apache.org> wrote:
>>
>> So you should see the SocketTimeoutException in your *client* logs (in
>> your case, mappers), not LeaseException. At this point yes you're
>> going to timeout, but if you spend so much time cycling on the server
>> side then you shouldn't set a high caching configuration on your
>> scanner as IO isn't your bottle neck.
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>>
>> Hi,
>>
>> The servers have been restarted (I have this configuration for more than a
>> month, so this is not the problem).
>> About the stack traces, they show exactly the same, a lot of
>> ClosedChannelConnections and LeaseExceptions.
>>
>> But I found something that could be the problem: hbase.rpc.timeout . This
>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> could happen the next way:
>> - the mapper makes a scanner.next call to the region server
>> - the region servers needs more than 60 seconds to execute it (I use
>> multiple filters, and it could take a lot of time)
>> - the scan client gets the timeout and cuts the connection
>> - the region server tries to send the results to the client ==>
>> ClosedChannelConnection
>>
>> I will get a deeper look into it tomorrow. If you have other suggestions,
>> please let me know!
>>
>> Thanks,
>> Lucian
>>
>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
>> <jd...@apache.org>wrote:
>>
>> Did you restart the region servers after changing the config?
>>
>> Are you sure it's the same exception/stack trace?
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I have exactly the same problem that Eran had.
>> But there is something I don't understand: in my case, I have set the
>>
>> lease
>>
>> time to 240000 (4 minutes). But most of the map tasks that are failing
>>
>> run
>>
>> about 2 minutes. How is it possible to get a LeaseException if the task
>>
>> runs
>>
>> less than the configured time for a lease?
>>
>> Regards,
>> Lucian Iordache
>>
>> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>>
>> Perfect! Thanks.
>>
>> -eran
>>
>>
>>
>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>
>> wrote:
>>
>> hbase.regionserver.lease.period
>>
>> Set it bigger than 60000.
>>
>> J-D
>>
>> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>>
>> Thanks J-D!
>> Since my main table is expected to continue growing I guess at some
>>
>> point
>>
>> even setting the cache size to 1 will not be enough. Is there a way
>>
>> to
>>
>> configure the lease timeout?
>>
>> -eran
>>
>>
>>
>> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>
>> jdcryans@apache.org
>>
>> wrote:
>>
>> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>>
>> wrote:
>>
>> Hi J-D,
>> Thanks for the detailed explanation.
>> So if I understand correctly the lease we're talking about is a
>>
>> scanner
>>
>> lease and the timeout is between two scanner calls, correct? I
>>
>> think
>>
>> that
>>
>> make sense because I now realize that jobs that fail (some jobs
>>
>> continued
>>
>> to
>> fail even after reducing the number of map tasks as Stack
>>
>> suggested)
>>
>> use
>>
>> filters to fetch relatively few rows out of a very large table,
>>
>> so
>>
>> they
>>
>> could be spending a lot of time on the region server scanning
>>
>> rows
>>
>> until
>>
>> it
>>
>> reached my setCaching value which was 1000. Setting the caching
>>
>> value
>>
>> to
>>
>> 1
>>
>> seem to allow these job to complete.
>> I think it has to be the above, since my rows are small, with
>>
>> just
>>
>> a
>>
>> few
>>
>> columns and processing them is very quick.
>>
>> Excellent!
>>
>>
>> However, there are still a couple ofw thing I don't understand:
>> 1. What is the difference between setCaching and setBatch?
>>
>> * Set the maximum number of values to return for each call to
>>
>> next()
>>
>> VS
>>
>> * Set the number of rows for caching that will be passed to
>>
>> scanners.
>>
>> The former is useful if you have rows with millions of columns and
>>
>> you
>>
>> could
>> setBatch to get only 1000 of them at a time. You could call that
>>
>> intra-row
>>
>> scanning.
>>
>>
>> 2. Examining the region server logs more closely than I did
>>
>> yesterday
>>
>> I
>>
>> see
>>
>> a log of ClosedChannelExceptions in addition to the expired
>>
>> leases
>>
>> (but
>>
>> no
>>
>> UnknownScannerException), is that expected? You can see an
>>
>> excerpt
>>
>> of
>>
>> the
>>
>> log from one of the region servers here:
>>
>> http://pastebin.com/NLcZTzsY
>>
>> It means that when the server got to process that client request
>>
>> and
>>
>> started
>> reading from the socket, the client was already gone. Killing a
>>
>> client
>>
>> does
>>
>> that (or killing a MR that scans), so does SocketTimeoutException.
>>
>> This
>>
>> should probably go in the book. We should also print something
>>
>> nicer
>>
>> :)
>>
>> J-D
>>
>>
>>
>> --
>> Harsh J
>>
>>
>
>
> --
> Harsh J

Re: Lease does not exist exceptions

Posted by Harsh J <ha...@cloudera.com>.

Hi Daniel,

That sounds fine to do (easier a solution, my brain's gotten complex today ha).

We should classify the two types of error in the docs for users the
way you have here, to indicate what the issue is in each of the error
cases - UnknownScannerException and LeaseException. Mind filing a
JIRA? :)

On Thu, Sep 20, 2012 at 7:21 PM, Daniel Iancu <da...@1and1.ro> wrote:
> Thaaank you! I was waiting for this email for months. I've read all the
> posts regarding lease timeouts and see that people usually have them for 2
> reasons. One, the normal case where the client app does not process the row
> fast enough so they get UnknownScannerException and some had the issue below
> and get LeaseException instead.
>
> How about using a try/catch for the
>
> // Remove lease while its being processed in server; protects against case
>       // where processing of request takes > lease expiration time.
>       lease = this.leases.removeLease(scannerName);
>
> and re-throw an IllegalStateException or log a warning message because a
> client with and active scanner but no lease does not seem to be in the right
> state?
>
> Just an idea but you know  better.
> Daniel
>
> On 09/20/2012 03:42 PM, Harsh J wrote:
>
> Hi,
>
> I hit this today and got down to investigate it and one of my
> colleagues discovered this thread. Since I got some more clues, I
> thought I'll bump up this thread for good.
>
> Lucian almost got the issue here. The thing we missed thinking about
> is the client retry. The client of HBaseRPC seems to silently retry on
> timeouts. So if you apply Lucian's theory below and apply that a
> client retry calls next(ID, Rows) yet again, you can construct this
> issue:
>
> - Client calls next(ID, Rows) first time.
> - RS receives the handler-sent request, removes lease (to not expire
> it during next() call) and begins work.
> - RS#next hangs during work (for whatever reason we can assume - large
> values or locks or whatever)
> - Client times out after a minute, retries (due to default nature).
> Retry seems to be silent though?
> - New next(ID, Rows) call is invoked. Scanner still exists so no
> UnknownScanner is thrown. But when next() tries to remove lease, we
> get thrown LeaseException (and the client gets this immediately and
> dies) as the other parallel handler has the lease object already
> removed and held in its stuck state.
> - A few secs/mins later, the original next() unfreezes, adds back
> lease to the queue, tries to write back response, runs into
> ClosedChannelException as the client had already thrown its original
> socket away. End of clients.
> - Lease-period expiry later, the lease is now formally removed without
> any hitches.
>
> Ideally, to prevent this, the rpc.timeout must be > lease period as
> was pointed out. Since in that case, we'd have waited for X units more
> for the original next() to unblock and continue itself and not have
> retried. That is how this is avoided, unintentionally, but can still
> happen if the next() still takes very long.
>
> I haven't seen a LeaseException in any other case so far, so maybe we
> can improve that exception's message to indicate whats going on in
> simpler terms so clients can reconfigure to fix themselves?
>
> Also we could add in some measures to prevent next()-duping, as that
> is never bound to work given the lease-required system. Perhaps when
> the next() stores the removed lease, we can store it somewhere global
> (like ActiveLeases or summat) and deny next() duping if their
> requested lease is already in ActiveLeases? Just ends up giving a
> better message, not a solution.
>
> Hope this helps others who've run into the same issue.
>
> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
>
> So you should see the SocketTimeoutException in your *client* logs (in
> your case, mappers), not LeaseException. At this point yes you're
> going to timeout, but if you spend so much time cycling on the server
> side then you shouldn't set a high caching configuration on your
> scanner as IO isn't your bottle neck.
>
> J-D
>
> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
>
> Hi,
>
> The servers have been restarted (I have this configuration for more than a
> month, so this is not the problem).
> About the stack traces, they show exactly the same, a lot of
> ClosedChannelConnections and LeaseExceptions.
>
> But I found something that could be the problem: hbase.rpc.timeout . This
> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> could happen the next way:
> - the mapper makes a scanner.next call to the region server
> - the region servers needs more than 60 seconds to execute it (I use
> multiple filters, and it could take a lot of time)
> - the scan client gets the timeout and cuts the connection
> - the region server tries to send the results to the client ==>
> ClosedChannelConnection
>
> I will get a deeper look into it tomorrow. If you have other suggestions,
> please let me know!
>
> Thanks,
> Lucian
>
> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans
> <jd...@apache.org>wrote:
>
> Did you restart the region servers after changing the config?
>
> Are you sure it's the same exception/stack trace?
>
> J-D
>
> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
>
> Hi all,
>
> I have exactly the same problem that Eran had.
> But there is something I don't understand: in my case, I have set the
>
> lease
>
> time to 240000 (4 minutes). But most of the map tasks that are failing
>
> run
>
> about 2 minutes. How is it possible to get a LeaseException if the task
>
> runs
>
> less than the configured time for a lease?
>
> Regards,
> Lucian Iordache
>
> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>
> Perfect! Thanks.
>
> -eran
>
>
>
> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>
> wrote:
>
> hbase.regionserver.lease.period
>
> Set it bigger than 60000.
>
> J-D
>
> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>
> Thanks J-D!
> Since my main table is expected to continue growing I guess at some
>
> point
>
> even setting the cache size to 1 will not be enough. Is there a way
>
> to
>
> configure the lease timeout?
>
> -eran
>
>
>
> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>
> jdcryans@apache.org
>
> wrote:
>
> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>
> wrote:
>
> Hi J-D,
> Thanks for the detailed explanation.
> So if I understand correctly the lease we're talking about is a
>
> scanner
>
> lease and the timeout is between two scanner calls, correct? I
>
> think
>
> that
>
> make sense because I now realize that jobs that fail (some jobs
>
> continued
>
> to
> fail even after reducing the number of map tasks as Stack
>
> suggested)
>
> use
>
> filters to fetch relatively few rows out of a very large table,
>
> so
>
> they
>
> could be spending a lot of time on the region server scanning
>
> rows
>
> until
>
> it
>
> reached my setCaching value which was 1000. Setting the caching
>
> value
>
> to
>
> 1
>
> seem to allow these job to complete.
> I think it has to be the above, since my rows are small, with
>
> just
>
> a
>
> few
>
> columns and processing them is very quick.
>
> Excellent!
>
>
> However, there are still a couple ofw thing I don't understand:
> 1. What is the difference between setCaching and setBatch?
>
> * Set the maximum number of values to return for each call to
>
> next()
>
> VS
>
> * Set the number of rows for caching that will be passed to
>
> scanners.
>
> The former is useful if you have rows with millions of columns and
>
> you
>
> could
> setBatch to get only 1000 of them at a time. You could call that
>
> intra-row
>
> scanning.
>
>
> 2. Examining the region server logs more closely than I did
>
> yesterday
>
> I
>
> see
>
> a log of ClosedChannelExceptions in addition to the expired
>
> leases
>
> (but
>
> no
>
> UnknownScannerException), is that expected? You can see an
>
> excerpt
>
> of
>
> the
>
> log from one of the region servers here:
>
> http://pastebin.com/NLcZTzsY
>
> It means that when the server got to process that client request
>
> and
>
> started
> reading from the socket, the client was already gone. Killing a
>
> client
>
> does
>
> that (or killing a MR that scans), so does SocketTimeoutException.
>
> This
>
> should probably go in the book. We should also print something
>
> nicer
>
> :)
>
> J-D
>
>
>
> --
> Harsh J
>
>



-- 
Harsh J

Re: Lease does not exist exceptions

Posted by Daniel Iancu <da...@1and1.ro>.

Thaaank you! I was waiting for this email for months. I've read all the 
posts regarding lease timeouts and see that people usually have them for 
2 reasons. One, the normal case where the client app does not process 
the row fast enough so they get UnknownScannerException and some had the 
issue below and get LeaseException instead.

How about using a try/catch for the

// Remove lease while its being processed in server; protects against case
       // where processing of request takes > lease expiration time.
       lease = this.leases.removeLease(scannerName);

and re-throw an IllegalStateException or log a warning message because a 
client with and active scanner but no lease does not seem to be in the 
right state?

Just an idea but you know  better.
Daniel

On 09/20/2012 03:42 PM, Harsh J wrote:
> Hi,
>
> I hit this today and got down to investigate it and one of my
> colleagues discovered this thread. Since I got some more clues, I
> thought I'll bump up this thread for good.
>
> Lucian almost got the issue here. The thing we missed thinking about
> is the client retry. The client of HBaseRPC seems to silently retry on
> timeouts. So if you apply Lucian's theory below and apply that a
> client retry calls next(ID, Rows) yet again, you can construct this
> issue:
>
> - Client calls next(ID, Rows) first time.
> - RS receives the handler-sent request, removes lease (to not expire
> it during next() call) and begins work.
> - RS#next hangs during work (for whatever reason we can assume - large
> values or locks or whatever)
> - Client times out after a minute, retries (due to default nature).
> Retry seems to be silent though?
> - New next(ID, Rows) call is invoked. Scanner still exists so no
> UnknownScanner is thrown. But when next() tries to remove lease, we
> get thrown LeaseException (and the client gets this immediately and
> dies) as the other parallel handler has the lease object already
> removed and held in its stuck state.
> - A few secs/mins later, the original next() unfreezes, adds back
> lease to the queue, tries to write back response, runs into
> ClosedChannelException as the client had already thrown its original
> socket away. End of clients.
> - Lease-period expiry later, the lease is now formally removed without
> any hitches.
>
> Ideally, to prevent this, the rpc.timeout must be > lease period as
> was pointed out. Since in that case, we'd have waited for X units more
> for the original next() to unblock and continue itself and not have
> retried. That is how this is avoided, unintentionally, but can still
> happen if the next() still takes very long.
>
> I haven't seen a LeaseException in any other case so far, so maybe we
> can improve that exception's message to indicate whats going on in
> simpler terms so clients can reconfigure to fix themselves?
>
> Also we could add in some measures to prevent next()-duping, as that
> is never bound to work given the lease-required system. Perhaps when
> the next() stores the removed lease, we can store it somewhere global
> (like ActiveLeases or summat) and deny next() duping if their
> requested lease is already in ActiveLeases? Just ends up giving a
> better message, not a solution.
>
> Hope this helps others who've run into the same issue.
>
> On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
> <jd...@apache.org> wrote:
>> So you should see the SocketTimeoutException in your *client* logs (in
>> your case, mappers), not LeaseException. At this point yes you're
>> going to timeout, but if you spend so much time cycling on the server
>> side then you shouldn't set a high caching configuration on your
>> scanner as IO isn't your bottle neck.
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>>> Hi,
>>>
>>> The servers have been restarted (I have this configuration for more than a
>>> month, so this is not the problem).
>>> About the stack traces, they show exactly the same, a lot of
>>> ClosedChannelConnections and LeaseExceptions.
>>>
>>> But I found something that could be the problem: hbase.rpc.timeout . This
>>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>>> could happen the next way:
>>> - the mapper makes a scanner.next call to the region server
>>> - the region servers needs more than 60 seconds to execute it (I use
>>> multiple filters, and it could take a lot of time)
>>> - the scan client gets the timeout and cuts the connection
>>> - the region server tries to send the results to the client ==>
>>> ClosedChannelConnection
>>>
>>> I will get a deeper look into it tomorrow. If you have other suggestions,
>>> please let me know!
>>>
>>> Thanks,
>>> Lucian
>>>
>>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>>
>>>> Did you restart the region servers after changing the config?
>>>>
>>>> Are you sure it's the same exception/stack trace?
>>>>
>>>> J-D
>>>>
>>>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>>> <lu...@gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> I have exactly the same problem that Eran had.
>>>>> But there is something I don't understand: in my case, I have set the
>>>> lease
>>>>> time to 240000 (4 minutes). But most of the map tasks that are failing
>>>> run
>>>>> about 2 minutes. How is it possible to get a LeaseException if the task
>>>> runs
>>>>> less than the configured time for a lease?
>>>>>
>>>>> Regards,
>>>>> Lucian Iordache
>>>>>
>>>>> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>>>>>
>>>>>> Perfect! Thanks.
>>>>>>
>>>>>> -eran
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>>>>>> wrote:
>>>>>>> hbase.regionserver.lease.period
>>>>>>>
>>>>>>> Set it bigger than 60000.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>>>>>>>> Thanks J-D!
>>>>>>>> Since my main table is expected to continue growing I guess at some
>>>>>> point
>>>>>>>> even setting the cache size to 1 will not be enough. Is there a way
>>>> to
>>>>>>>> configure the lease timeout?
>>>>>>>>
>>>>>>>> -eran
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>>> jdcryans@apache.org
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>>>>>> wrote:
>>>>>>>>>> Hi J-D,
>>>>>>>>>> Thanks for the detailed explanation.
>>>>>>>>>> So if I understand correctly the lease we're talking about is a
>>>>>>> scanner
>>>>>>>>>> lease and the timeout is between two scanner calls, correct? I
>>>>>> think
>>>>>>> that
>>>>>>>>>> make sense because I now realize that jobs that fail (some jobs
>>>>>>> continued
>>>>>>>>>> to
>>>>>>>>>> fail even after reducing the number of map tasks as Stack
>>>>>> suggested)
>>>>>>> use
>>>>>>>>>> filters to fetch relatively few rows out of a very large table,
>>>> so
>>>>>>> they
>>>>>>>>>> could be spending a lot of time on the region server scanning
>>>> rows
>>>>>>> until
>>>>>>>>> it
>>>>>>>>>> reached my setCaching value which was 1000. Setting the caching
>>>>>> value
>>>>>>> to
>>>>>>>>> 1
>>>>>>>>>> seem to allow these job to complete.
>>>>>>>>>> I think it has to be the above, since my rows are small, with
>>>> just
>>>>>> a
>>>>>>> few
>>>>>>>>>> columns and processing them is very quick.
>>>>>>>>>>
>>>>>>>>> Excellent!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> However, there are still a couple ofw thing I don't understand:
>>>>>>>>>> 1. What is the difference between setCaching and setBatch?
>>>>>>>>>>
>>>>>>>>> * Set the maximum number of values to return for each call to
>>>> next()
>>>>>>>>> VS
>>>>>>>>>
>>>>>>>>> * Set the number of rows for caching that will be passed to
>>>> scanners.
>>>>>>>>> The former is useful if you have rows with millions of columns and
>>>>>> you
>>>>>>>>> could
>>>>>>>>> setBatch to get only 1000 of them at a time. You could call that
>>>>>>> intra-row
>>>>>>>>> scanning.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2. Examining the region server logs more closely than I did
>>>>>> yesterday
>>>>>>> I
>>>>>>>>> see
>>>>>>>>>> a log of ClosedChannelExceptions in addition to the expired
>>>> leases
>>>>>>> (but
>>>>>>>>> no
>>>>>>>>>> UnknownScannerException), is that expected? You can see an
>>>> excerpt
>>>>>> of
>>>>>>> the
>>>>>>>>>> log from one of the region servers here:
>>>>>>> http://pastebin.com/NLcZTzsY
>>>>>>>>>
>>>>>>>>> It means that when the server got to process that client request
>>>> and
>>>>>>>>> started
>>>>>>>>> reading from the socket, the client was already gone. Killing a
>>>>>> client
>>>>>>> does
>>>>>>>>> that (or killing a MR that scans), so does SocketTimeoutException.
>>>>>> This
>>>>>>>>> should probably go in the book. We should also print something
>>>> nicer
>>>>>> :)
>>>>>>>>> J-D
>>>>>>>>>
>
>
> --
> Harsh J

Re: Lease does not exist exceptions

Posted by Harsh J <ha...@cloudera.com>.

Hi,

I hit this today and got down to investigate it and one of my
colleagues discovered this thread. Since I got some more clues, I
thought I'll bump up this thread for good.

Lucian almost got the issue here. The thing we missed thinking about
is the client retry. The client of HBaseRPC seems to silently retry on
timeouts. So if you apply Lucian's theory below and apply that a
client retry calls next(ID, Rows) yet again, you can construct this
issue:

- Client calls next(ID, Rows) first time.
- RS receives the handler-sent request, removes lease (to not expire
it during next() call) and begins work.
- RS#next hangs during work (for whatever reason we can assume - large
values or locks or whatever)
- Client times out after a minute, retries (due to default nature).
Retry seems to be silent though?
- New next(ID, Rows) call is invoked. Scanner still exists so no
UnknownScanner is thrown. But when next() tries to remove lease, we
get thrown LeaseException (and the client gets this immediately and
dies) as the other parallel handler has the lease object already
removed and held in its stuck state.
- A few secs/mins later, the original next() unfreezes, adds back
lease to the queue, tries to write back response, runs into
ClosedChannelException as the client had already thrown its original
socket away. End of clients.
- Lease-period expiry later, the lease is now formally removed without
any hitches.

Ideally, to prevent this, the rpc.timeout must be > lease period as
was pointed out. Since in that case, we'd have waited for X units more
for the original next() to unblock and continue itself and not have
retried. That is how this is avoided, unintentionally, but can still
happen if the next() still takes very long.

I haven't seen a LeaseException in any other case so far, so maybe we
can improve that exception's message to indicate whats going on in
simpler terms so clients can reconfigure to fix themselves?

Also we could add in some measures to prevent next()-duping, as that
is never bound to work given the lease-required system. Perhaps when
the next() stores the removed lease, we can store it somewhere global
(like ActiveLeases or summat) and deny next() duping if their
requested lease is already in ActiveLeases? Just ends up giving a
better message, not a solution.

Hope this helps others who've run into the same issue.

On Mon, Oct 24, 2011 at 10:52 PM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> So you should see the SocketTimeoutException in your *client* logs (in
> your case, mappers), not LeaseException. At this point yes you're
> going to timeout, but if you spend so much time cycling on the server
> side then you shouldn't set a high caching configuration on your
> scanner as IO isn't your bottle neck.
>
> J-D
>
> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
>> Hi,
>>
>> The servers have been restarted (I have this configuration for more than a
>> month, so this is not the problem).
>> About the stack traces, they show exactly the same, a lot of
>> ClosedChannelConnections and LeaseExceptions.
>>
>> But I found something that could be the problem: hbase.rpc.timeout . This
>> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
>> could happen the next way:
>> - the mapper makes a scanner.next call to the region server
>> - the region servers needs more than 60 seconds to execute it (I use
>> multiple filters, and it could take a lot of time)
>> - the scan client gets the timeout and cuts the connection
>> - the region server tries to send the results to the client ==>
>> ClosedChannelConnection
>>
>> I will get a deeper look into it tomorrow. If you have other suggestions,
>> please let me know!
>>
>> Thanks,
>> Lucian
>>
>> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> Did you restart the region servers after changing the config?
>>>
>>> Are you sure it's the same exception/stack trace?
>>>
>>> J-D
>>>
>>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>>> <lu...@gmail.com> wrote:
>>> > Hi all,
>>> >
>>> > I have exactly the same problem that Eran had.
>>> > But there is something I don't understand: in my case, I have set the
>>> lease
>>> > time to 240000 (4 minutes). But most of the map tasks that are failing
>>> run
>>> > about 2 minutes. How is it possible to get a LeaseException if the task
>>> runs
>>> > less than the configured time for a lease?
>>> >
>>> > Regards,
>>> > Lucian Iordache
>>> >
>>> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>>> >
>>> >> Perfect! Thanks.
>>> >>
>>> >> -eran
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>>> >> >wrote:
>>> >>
>>> >> > hbase.regionserver.lease.period
>>> >> >
>>> >> > Set it bigger than 60000.
>>> >> >
>>> >> > J-D
>>> >> >
>>> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>>> >> > >
>>> >> > > Thanks J-D!
>>> >> > > Since my main table is expected to continue growing I guess at some
>>> >> point
>>> >> > > even setting the cache size to 1 will not be enough. Is there a way
>>> to
>>> >> > > configure the lease timeout?
>>> >> > >
>>> >> > > -eran
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>>> jdcryans@apache.org
>>> >> > >wrote:
>>> >> > >
>>> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>>> >> wrote:
>>> >> > > >
>>> >> > > > > Hi J-D,
>>> >> > > > > Thanks for the detailed explanation.
>>> >> > > > > So if I understand correctly the lease we're talking about is a
>>> >> > scanner
>>> >> > > > > lease and the timeout is between two scanner calls, correct? I
>>> >> think
>>> >> > that
>>> >> > > > > make sense because I now realize that jobs that fail (some jobs
>>> >> > continued
>>> >> > > > > to
>>> >> > > > > fail even after reducing the number of map tasks as Stack
>>> >> suggested)
>>> >> > use
>>> >> > > > > filters to fetch relatively few rows out of a very large table,
>>> so
>>> >> > they
>>> >> > > > > could be spending a lot of time on the region server scanning
>>> rows
>>> >> > until
>>> >> > > > it
>>> >> > > > > reached my setCaching value which was 1000. Setting the caching
>>> >> value
>>> >> > to
>>> >> > > > 1
>>> >> > > > > seem to allow these job to complete.
>>> >> > > > > I think it has to be the above, since my rows are small, with
>>> just
>>> >> a
>>> >> > few
>>> >> > > > > columns and processing them is very quick.
>>> >> > > > >
>>> >> > > >
>>> >> > > > Excellent!
>>> >> > > >
>>> >> > > >
>>> >> > > > >
>>> >> > > > > However, there are still a couple ofw thing I don't understand:
>>> >> > > > > 1. What is the difference between setCaching and setBatch?
>>> >> > > > >
>>> >> > > >
>>> >> > > > * Set the maximum number of values to return for each call to
>>> next()
>>> >> > > >
>>> >> > > > VS
>>> >> > > >
>>> >> > > > * Set the number of rows for caching that will be passed to
>>> scanners.
>>> >> > > >
>>> >> > > > The former is useful if you have rows with millions of columns and
>>> >> you
>>> >> > > > could
>>> >> > > > setBatch to get only 1000 of them at a time. You could call that
>>> >> > intra-row
>>> >> > > > scanning.
>>> >> > > >
>>> >> > > >
>>> >> > > > > 2. Examining the region server logs more closely than I did
>>> >> yesterday
>>> >> > I
>>> >> > > > see
>>> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>>> leases
>>> >> > (but
>>> >> > > > no
>>> >> > > > > UnknownScannerException), is that expected? You can see an
>>> excerpt
>>> >> of
>>> >> > the
>>> >> > > > > log from one of the region servers here:
>>> >> > http://pastebin.com/NLcZTzsY
>>> >> > > >
>>> >> > > >
>>> >> > > > It means that when the server got to process that client request
>>> and
>>> >> > > > started
>>> >> > > > reading from the socket, the client was already gone. Killing a
>>> >> client
>>> >> > does
>>> >> > > > that (or killing a MR that scans), so does SocketTimeoutException.
>>> >> This
>>> >> > > > should probably go in the book. We should also print something
>>> nicer
>>> >> :)
>>> >> > > >
>>> >> > > > J-D
>>> >> > > >
>>> >> >
>>> >>
>>> >
>>>
>>



-- 
Harsh J

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

So you should see the SocketTimeoutException in your *client* logs (in
your case, mappers), not LeaseException. At this point yes you're
going to timeout, but if you spend so much time cycling on the server
side then you shouldn't set a high caching configuration on your
scanner as IO isn't your bottle neck.

J-D

On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
<lu...@gmail.com> wrote:
> Hi,
>
> The servers have been restarted (I have this configuration for more than a
> month, so this is not the problem).
> About the stack traces, they show exactly the same, a lot of
> ClosedChannelConnections and LeaseExceptions.
>
> But I found something that could be the problem: hbase.rpc.timeout . This
> defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
> could happen the next way:
> - the mapper makes a scanner.next call to the region server
> - the region servers needs more than 60 seconds to execute it (I use
> multiple filters, and it could take a lot of time)
> - the scan client gets the timeout and cuts the connection
> - the region server tries to send the results to the client ==>
> ClosedChannelConnection
>
> I will get a deeper look into it tomorrow. If you have other suggestions,
> please let me know!
>
> Thanks,
> Lucian
>
> On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Did you restart the region servers after changing the config?
>>
>> Are you sure it's the same exception/stack trace?
>>
>> J-D
>>
>> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> <lu...@gmail.com> wrote:
>> > Hi all,
>> >
>> > I have exactly the same problem that Eran had.
>> > But there is something I don't understand: in my case, I have set the
>> lease
>> > time to 240000 (4 minutes). But most of the map tasks that are failing
>> run
>> > about 2 minutes. How is it possible to get a LeaseException if the task
>> runs
>> > less than the configured time for a lease?
>> >
>> > Regards,
>> > Lucian Iordache
>> >
>> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>> >
>> >> Perfect! Thanks.
>> >>
>> >> -eran
>> >>
>> >>
>> >>
>> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>> >> >wrote:
>> >>
>> >> > hbase.regionserver.lease.period
>> >> >
>> >> > Set it bigger than 60000.
>> >> >
>> >> > J-D
>> >> >
>> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>> >> > >
>> >> > > Thanks J-D!
>> >> > > Since my main table is expected to continue growing I guess at some
>> >> point
>> >> > > even setting the cache size to 1 will not be enough. Is there a way
>> to
>> >> > > configure the lease timeout?
>> >> > >
>> >> > > -eran
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
>> jdcryans@apache.org
>> >> > >wrote:
>> >> > >
>> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>> >> wrote:
>> >> > > >
>> >> > > > > Hi J-D,
>> >> > > > > Thanks for the detailed explanation.
>> >> > > > > So if I understand correctly the lease we're talking about is a
>> >> > scanner
>> >> > > > > lease and the timeout is between two scanner calls, correct? I
>> >> think
>> >> > that
>> >> > > > > make sense because I now realize that jobs that fail (some jobs
>> >> > continued
>> >> > > > > to
>> >> > > > > fail even after reducing the number of map tasks as Stack
>> >> suggested)
>> >> > use
>> >> > > > > filters to fetch relatively few rows out of a very large table,
>> so
>> >> > they
>> >> > > > > could be spending a lot of time on the region server scanning
>> rows
>> >> > until
>> >> > > > it
>> >> > > > > reached my setCaching value which was 1000. Setting the caching
>> >> value
>> >> > to
>> >> > > > 1
>> >> > > > > seem to allow these job to complete.
>> >> > > > > I think it has to be the above, since my rows are small, with
>> just
>> >> a
>> >> > few
>> >> > > > > columns and processing them is very quick.
>> >> > > > >
>> >> > > >
>> >> > > > Excellent!
>> >> > > >
>> >> > > >
>> >> > > > >
>> >> > > > > However, there are still a couple ofw thing I don't understand:
>> >> > > > > 1. What is the difference between setCaching and setBatch?
>> >> > > > >
>> >> > > >
>> >> > > > * Set the maximum number of values to return for each call to
>> next()
>> >> > > >
>> >> > > > VS
>> >> > > >
>> >> > > > * Set the number of rows for caching that will be passed to
>> scanners.
>> >> > > >
>> >> > > > The former is useful if you have rows with millions of columns and
>> >> you
>> >> > > > could
>> >> > > > setBatch to get only 1000 of them at a time. You could call that
>> >> > intra-row
>> >> > > > scanning.
>> >> > > >
>> >> > > >
>> >> > > > > 2. Examining the region server logs more closely than I did
>> >> yesterday
>> >> > I
>> >> > > > see
>> >> > > > > a log of ClosedChannelExceptions in addition to the expired
>> leases
>> >> > (but
>> >> > > > no
>> >> > > > > UnknownScannerException), is that expected? You can see an
>> excerpt
>> >> of
>> >> > the
>> >> > > > > log from one of the region servers here:
>> >> > http://pastebin.com/NLcZTzsY
>> >> > > >
>> >> > > >
>> >> > > > It means that when the server got to process that client request
>> and
>> >> > > > started
>> >> > > > reading from the socket, the client was already gone. Killing a
>> >> client
>> >> > does
>> >> > > > that (or killing a MR that scans), so does SocketTimeoutException.
>> >> This
>> >> > > > should probably go in the book. We should also print something
>> nicer
>> >> :)
>> >> > > >
>> >> > > > J-D
>> >> > > >
>> >> >
>> >>
>> >
>>
>

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Hi,

The servers have been restarted (I have this configuration for more than a
month, so this is not the problem).
About the stack traces, they show exactly the same, a lot of
ClosedChannelConnections and LeaseExceptions.

But I found something that could be the problem: hbase.rpc.timeout . This
defaults to 60 seconds, and I did not modify it in hbase-site.xml. So it
could happen the next way:
- the mapper makes a scanner.next call to the region server
- the region servers needs more than 60 seconds to execute it (I use
multiple filters, and it could take a lot of time)
- the scan client gets the timeout and cuts the connection
- the region server tries to send the results to the client ==>
ClosedChannelConnection

I will get a deeper look into it tomorrow. If you have other suggestions,
please let me know!

Thanks,
Lucian

On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Did you restart the region servers after changing the config?
>
> Are you sure it's the same exception/stack trace?
>
> J-D
>
> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
> <lu...@gmail.com> wrote:
> > Hi all,
> >
> > I have exactly the same problem that Eran had.
> > But there is something I don't understand: in my case, I have set the
> lease
> > time to 240000 (4 minutes). But most of the map tasks that are failing
> run
> > about 2 minutes. How is it possible to get a LeaseException if the task
> runs
> > less than the configured time for a lease?
> >
> > Regards,
> > Lucian Iordache
> >
> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
> >
> >> Perfect! Thanks.
> >>
> >> -eran
> >>
> >>
> >>
> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
> >> >wrote:
> >>
> >> > hbase.regionserver.lease.period
> >> >
> >> > Set it bigger than 60000.
> >> >
> >> > J-D
> >> >
> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
> >> > >
> >> > > Thanks J-D!
> >> > > Since my main table is expected to continue growing I guess at some
> >> point
> >> > > even setting the cache size to 1 will not be enough. Is there a way
> to
> >> > > configure the lease timeout?
> >> > >
> >> > > -eran
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> > >wrote:
> >> > >
> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
> >> wrote:
> >> > > >
> >> > > > > Hi J-D,
> >> > > > > Thanks for the detailed explanation.
> >> > > > > So if I understand correctly the lease we're talking about is a
> >> > scanner
> >> > > > > lease and the timeout is between two scanner calls, correct? I
> >> think
> >> > that
> >> > > > > make sense because I now realize that jobs that fail (some jobs
> >> > continued
> >> > > > > to
> >> > > > > fail even after reducing the number of map tasks as Stack
> >> suggested)
> >> > use
> >> > > > > filters to fetch relatively few rows out of a very large table,
> so
> >> > they
> >> > > > > could be spending a lot of time on the region server scanning
> rows
> >> > until
> >> > > > it
> >> > > > > reached my setCaching value which was 1000. Setting the caching
> >> value
> >> > to
> >> > > > 1
> >> > > > > seem to allow these job to complete.
> >> > > > > I think it has to be the above, since my rows are small, with
> just
> >> a
> >> > few
> >> > > > > columns and processing them is very quick.
> >> > > > >
> >> > > >
> >> > > > Excellent!
> >> > > >
> >> > > >
> >> > > > >
> >> > > > > However, there are still a couple ofw thing I don't understand:
> >> > > > > 1. What is the difference between setCaching and setBatch?
> >> > > > >
> >> > > >
> >> > > > * Set the maximum number of values to return for each call to
> next()
> >> > > >
> >> > > > VS
> >> > > >
> >> > > > * Set the number of rows for caching that will be passed to
> scanners.
> >> > > >
> >> > > > The former is useful if you have rows with millions of columns and
> >> you
> >> > > > could
> >> > > > setBatch to get only 1000 of them at a time. You could call that
> >> > intra-row
> >> > > > scanning.
> >> > > >
> >> > > >
> >> > > > > 2. Examining the region server logs more closely than I did
> >> yesterday
> >> > I
> >> > > > see
> >> > > > > a log of ClosedChannelExceptions in addition to the expired
> leases
> >> > (but
> >> > > > no
> >> > > > > UnknownScannerException), is that expected? You can see an
> excerpt
> >> of
> >> > the
> >> > > > > log from one of the region servers here:
> >> > http://pastebin.com/NLcZTzsY
> >> > > >
> >> > > >
> >> > > > It means that when the server got to process that client request
> and
> >> > > > started
> >> > > > reading from the socket, the client was already gone. Killing a
> >> client
> >> > does
> >> > > > that (or killing a MR that scans), so does SocketTimeoutException.
> >> This
> >> > > > should probably go in the book. We should also print something
> nicer
> >> :)
> >> > > >
> >> > > > J-D
> >> > > >
> >> >
> >>
> >
>

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Did you restart the region servers after changing the config?

Are you sure it's the same exception/stack trace?

J-D

On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
<lu...@gmail.com> wrote:
> Hi all,
>
> I have exactly the same problem that Eran had.
> But there is something I don't understand: in my case, I have set the lease
> time to 240000 (4 minutes). But most of the map tasks that are failing run
> about 2 minutes. How is it possible to get a LeaseException if the task runs
> less than the configured time for a lease?
>
> Regards,
> Lucian Iordache
>
> On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:
>
>> Perfect! Thanks.
>>
>> -eran
>>
>>
>>
>> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>>
>> > hbase.regionserver.lease.period
>> >
>> > Set it bigger than 60000.
>> >
>> > J-D
>> >
>> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>> > >
>> > > Thanks J-D!
>> > > Since my main table is expected to continue growing I guess at some
>> point
>> > > even setting the cache size to 1 will not be enough. Is there a way to
>> > > configure the lease timeout?
>> > >
>> > > -eran
>> > >
>> > >
>> > >
>> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <jdcryans@apache.org
>> > >wrote:
>> > >
>> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
>> wrote:
>> > > >
>> > > > > Hi J-D,
>> > > > > Thanks for the detailed explanation.
>> > > > > So if I understand correctly the lease we're talking about is a
>> > scanner
>> > > > > lease and the timeout is between two scanner calls, correct? I
>> think
>> > that
>> > > > > make sense because I now realize that jobs that fail (some jobs
>> > continued
>> > > > > to
>> > > > > fail even after reducing the number of map tasks as Stack
>> suggested)
>> > use
>> > > > > filters to fetch relatively few rows out of a very large table, so
>> > they
>> > > > > could be spending a lot of time on the region server scanning rows
>> > until
>> > > > it
>> > > > > reached my setCaching value which was 1000. Setting the caching
>> value
>> > to
>> > > > 1
>> > > > > seem to allow these job to complete.
>> > > > > I think it has to be the above, since my rows are small, with just
>> a
>> > few
>> > > > > columns and processing them is very quick.
>> > > > >
>> > > >
>> > > > Excellent!
>> > > >
>> > > >
>> > > > >
>> > > > > However, there are still a couple ofw thing I don't understand:
>> > > > > 1. What is the difference between setCaching and setBatch?
>> > > > >
>> > > >
>> > > > * Set the maximum number of values to return for each call to next()
>> > > >
>> > > > VS
>> > > >
>> > > > * Set the number of rows for caching that will be passed to scanners.
>> > > >
>> > > > The former is useful if you have rows with millions of columns and
>> you
>> > > > could
>> > > > setBatch to get only 1000 of them at a time. You could call that
>> > intra-row
>> > > > scanning.
>> > > >
>> > > >
>> > > > > 2. Examining the region server logs more closely than I did
>> yesterday
>> > I
>> > > > see
>> > > > > a log of ClosedChannelExceptions in addition to the expired leases
>> > (but
>> > > > no
>> > > > > UnknownScannerException), is that expected? You can see an excerpt
>> of
>> > the
>> > > > > log from one of the region servers here:
>> > http://pastebin.com/NLcZTzsY
>> > > >
>> > > >
>> > > > It means that when the server got to process that client request and
>> > > > started
>> > > > reading from the socket, the client was already gone. Killing a
>> client
>> > does
>> > > > that (or killing a MR that scans), so does SocketTimeoutException.
>> This
>> > > > should probably go in the book. We should also print something nicer
>> :)
>> > > >
>> > > > J-D
>> > > >
>> >
>>
>

Re: Lease does not exist exceptions

Posted by Lucian Iordache <lu...@gmail.com>.

Hi all,

I have exactly the same problem that Eran had.
But there is something I don't understand: in my case, I have set the lease
time to 240000 (4 minutes). But most of the map tasks that are failing run
about 2 minutes. How is it possible to get a LeaseException if the task runs
less than the configured time for a lease?

Regards,
Lucian Iordache

On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <er...@gigya.com> wrote:

> Perfect! Thanks.
>
> -eran
>
>
>
> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > hbase.regionserver.lease.period
> >
> > Set it bigger than 60000.
> >
> > J-D
> >
> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
> > >
> > > Thanks J-D!
> > > Since my main table is expected to continue growing I guess at some
> point
> > > even setting the cache size to 1 will not be enough. Is there a way to
> > > configure the lease timeout?
> > >
> > > -eran
> > >
> > >
> > >
> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <jdcryans@apache.org
> > >wrote:
> > >
> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com>
> wrote:
> > > >
> > > > > Hi J-D,
> > > > > Thanks for the detailed explanation.
> > > > > So if I understand correctly the lease we're talking about is a
> > scanner
> > > > > lease and the timeout is between two scanner calls, correct? I
> think
> > that
> > > > > make sense because I now realize that jobs that fail (some jobs
> > continued
> > > > > to
> > > > > fail even after reducing the number of map tasks as Stack
> suggested)
> > use
> > > > > filters to fetch relatively few rows out of a very large table, so
> > they
> > > > > could be spending a lot of time on the region server scanning rows
> > until
> > > > it
> > > > > reached my setCaching value which was 1000. Setting the caching
> value
> > to
> > > > 1
> > > > > seem to allow these job to complete.
> > > > > I think it has to be the above, since my rows are small, with just
> a
> > few
> > > > > columns and processing them is very quick.
> > > > >
> > > >
> > > > Excellent!
> > > >
> > > >
> > > > >
> > > > > However, there are still a couple ofw thing I don't understand:
> > > > > 1. What is the difference between setCaching and setBatch?
> > > > >
> > > >
> > > > * Set the maximum number of values to return for each call to next()
> > > >
> > > > VS
> > > >
> > > > * Set the number of rows for caching that will be passed to scanners.
> > > >
> > > > The former is useful if you have rows with millions of columns and
> you
> > > > could
> > > > setBatch to get only 1000 of them at a time. You could call that
> > intra-row
> > > > scanning.
> > > >
> > > >
> > > > > 2. Examining the region server logs more closely than I did
> yesterday
> > I
> > > > see
> > > > > a log of ClosedChannelExceptions in addition to the expired leases
> > (but
> > > > no
> > > > > UnknownScannerException), is that expected? You can see an excerpt
> of
> > the
> > > > > log from one of the region servers here:
> > http://pastebin.com/NLcZTzsY
> > > >
> > > >
> > > > It means that when the server got to process that client request and
> > > > started
> > > > reading from the socket, the client was already gone. Killing a
> client
> > does
> > > > that (or killing a MR that scans), so does SocketTimeoutException.
> This
> > > > should probably go in the book. We should also print something nicer
> :)
> > > >
> > > > J-D
> > > >
> >
>

Re: Lease does not exist exceptions

Posted by Eran Kutner <er...@gigya.com>.

Perfect! Thanks.

-eran



On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans <jd...@apache.org>wrote:

> hbase.regionserver.lease.period
>
> Set it bigger than 60000.
>
> J-D
>
> On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
> >
> > Thanks J-D!
> > Since my main table is expected to continue growing I guess at some point
> > even setting the cache size to 1 will not be enough. Is there a way to
> > configure the lease timeout?
> >
> > -eran
> >
> >
> >
> > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> > > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com> wrote:
> > >
> > > > Hi J-D,
> > > > Thanks for the detailed explanation.
> > > > So if I understand correctly the lease we're talking about is a
> scanner
> > > > lease and the timeout is between two scanner calls, correct? I think
> that
> > > > make sense because I now realize that jobs that fail (some jobs
> continued
> > > > to
> > > > fail even after reducing the number of map tasks as Stack suggested)
> use
> > > > filters to fetch relatively few rows out of a very large table, so
> they
> > > > could be spending a lot of time on the region server scanning rows
> until
> > > it
> > > > reached my setCaching value which was 1000. Setting the caching value
> to
> > > 1
> > > > seem to allow these job to complete.
> > > > I think it has to be the above, since my rows are small, with just a
> few
> > > > columns and processing them is very quick.
> > > >
> > >
> > > Excellent!
> > >
> > >
> > > >
> > > > However, there are still a couple ofw thing I don't understand:
> > > > 1. What is the difference between setCaching and setBatch?
> > > >
> > >
> > > * Set the maximum number of values to return for each call to next()
> > >
> > > VS
> > >
> > > * Set the number of rows for caching that will be passed to scanners.
> > >
> > > The former is useful if you have rows with millions of columns and you
> > > could
> > > setBatch to get only 1000 of them at a time. You could call that
> intra-row
> > > scanning.
> > >
> > >
> > > > 2. Examining the region server logs more closely than I did yesterday
> I
> > > see
> > > > a log of ClosedChannelExceptions in addition to the expired leases
> (but
> > > no
> > > > UnknownScannerException), is that expected? You can see an excerpt of
> the
> > > > log from one of the region servers here:
> http://pastebin.com/NLcZTzsY
> > >
> > >
> > > It means that when the server got to process that client request and
> > > started
> > > reading from the socket, the client was already gone. Killing a client
> does
> > > that (or killing a MR that scans), so does SocketTimeoutException. This
> > > should probably go in the book. We should also print something nicer :)
> > >
> > > J-D
> > >
>

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

hbase.regionserver.lease.period

Set it bigger than 60000.

J-D

On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner <er...@gigya.com> wrote:
>
> Thanks J-D!
> Since my main table is expected to continue growing I guess at some point
> even setting the cache size to 1 will not be enough. Is there a way to
> configure the lease timeout?
>
> -eran
>
>
>
> On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
> > On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com> wrote:
> >
> > > Hi J-D,
> > > Thanks for the detailed explanation.
> > > So if I understand correctly the lease we're talking about is a scanner
> > > lease and the timeout is between two scanner calls, correct? I think that
> > > make sense because I now realize that jobs that fail (some jobs continued
> > > to
> > > fail even after reducing the number of map tasks as Stack suggested) use
> > > filters to fetch relatively few rows out of a very large table, so they
> > > could be spending a lot of time on the region server scanning rows until
> > it
> > > reached my setCaching value which was 1000. Setting the caching value to
> > 1
> > > seem to allow these job to complete.
> > > I think it has to be the above, since my rows are small, with just a few
> > > columns and processing them is very quick.
> > >
> >
> > Excellent!
> >
> >
> > >
> > > However, there are still a couple ofw thing I don't understand:
> > > 1. What is the difference between setCaching and setBatch?
> > >
> >
> > * Set the maximum number of values to return for each call to next()
> >
> > VS
> >
> > * Set the number of rows for caching that will be passed to scanners.
> >
> > The former is useful if you have rows with millions of columns and you
> > could
> > setBatch to get only 1000 of them at a time. You could call that intra-row
> > scanning.
> >
> >
> > > 2. Examining the region server logs more closely than I did yesterday I
> > see
> > > a log of ClosedChannelExceptions in addition to the expired leases (but
> > no
> > > UnknownScannerException), is that expected? You can see an excerpt of the
> > > log from one of the region servers here: http://pastebin.com/NLcZTzsY
> >
> >
> > It means that when the server got to process that client request and
> > started
> > reading from the socket, the client was already gone. Killing a client does
> > that (or killing a MR that scans), so does SocketTimeoutException. This
> > should probably go in the book. We should also print something nicer :)
> >
> > J-D
> >

Re: Lease does not exist exceptions

Posted by Eran Kutner <er...@gigya.com>.

Thanks J-D!
Since my main table is expected to continue growing I guess at some point
even setting the cache size to 1 will not be enough. Is there a way to
configure the lease timeout?

-eran



On Thu, Oct 20, 2011 at 23:16, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com> wrote:
>
> > Hi J-D,
> > Thanks for the detailed explanation.
> > So if I understand correctly the lease we're talking about is a scanner
> > lease and the timeout is between two scanner calls, correct? I think that
> > make sense because I now realize that jobs that fail (some jobs continued
> > to
> > fail even after reducing the number of map tasks as Stack suggested) use
> > filters to fetch relatively few rows out of a very large table, so they
> > could be spending a lot of time on the region server scanning rows until
> it
> > reached my setCaching value which was 1000. Setting the caching value to
> 1
> > seem to allow these job to complete.
> > I think it has to be the above, since my rows are small, with just a few
> > columns and processing them is very quick.
> >
>
> Excellent!
>
>
> >
> > However, there are still a couple ofw thing I don't understand:
> > 1. What is the difference between setCaching and setBatch?
> >
>
> * Set the maximum number of values to return for each call to next()
>
> VS
>
> * Set the number of rows for caching that will be passed to scanners.
>
> The former is useful if you have rows with millions of columns and you
> could
> setBatch to get only 1000 of them at a time. You could call that intra-row
> scanning.
>
>
> > 2. Examining the region server logs more closely than I did yesterday I
> see
> > a log of ClosedChannelExceptions in addition to the expired leases (but
> no
> > UnknownScannerException), is that expected? You can see an excerpt of the
> > log from one of the region servers here: http://pastebin.com/NLcZTzsY
>
>
> It means that when the server got to process that client request and
> started
> reading from the socket, the client was already gone. Killing a client does
> that (or killing a MR that scans), so does SocketTimeoutException. This
> should probably go in the book. We should also print something nicer :)
>
> J-D
>

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Wed, Oct 19, 2011 at 12:51 PM, Eran Kutner <er...@gigya.com> wrote:

> Hi J-D,
> Thanks for the detailed explanation.
> So if I understand correctly the lease we're talking about is a scanner
> lease and the timeout is between two scanner calls, correct? I think that
> make sense because I now realize that jobs that fail (some jobs continued
> to
> fail even after reducing the number of map tasks as Stack suggested) use
> filters to fetch relatively few rows out of a very large table, so they
> could be spending a lot of time on the region server scanning rows until it
> reached my setCaching value which was 1000. Setting the caching value to 1
> seem to allow these job to complete.
> I think it has to be the above, since my rows are small, with just a few
> columns and processing them is very quick.
>

Excellent!

>
> However, there are still a couple ofw thing I don't understand:
> 1. What is the difference between setCaching and setBatch?
>

* Set the maximum number of values to return for each call to next()

VS

* Set the number of rows for caching that will be passed to scanners.

The former is useful if you have rows with millions of columns and you could
setBatch to get only 1000 of them at a time. You could call that intra-row
scanning.

> 2. Examining the region server logs more closely than I did yesterday I see
> a log of ClosedChannelExceptions in addition to the expired leases (but no
> UnknownScannerException), is that expected? You can see an excerpt of the
> log from one of the region servers here: http://pastebin.com/NLcZTzsY

It means that when the server got to process that client request and started
reading from the socket, the client was already gone. Killing a client does
that (or killing a MR that scans), so does SocketTimeoutException. This
should probably go in the book. We should also print something nicer :)

J-D

Re: Lease does not exist exceptions

Posted by Eran Kutner <er...@gigya.com>.

Hi J-D,
Thanks for the detailed explanation.
So if I understand correctly the lease we're talking about is a scanner
lease and the timeout is between two scanner calls, correct? I think that
make sense because I now realize that jobs that fail (some jobs continued to
fail even after reducing the number of map tasks as Stack suggested) use
filters to fetch relatively few rows out of a very large table, so they
could be spending a lot of time on the region server scanning rows until it
reached my setCaching value which was 1000. Setting the caching value to 1
seem to allow these job to complete.
I think it has to be the above, since my rows are small, with just a few
columns and processing them is very quick.

However, there are still a couple ofw thing I don't understand:
1. What is the difference between setCaching and setBatch?
2. Examining the region server logs more closely than I did yesterday I see
a log of ClosedChannelExceptions in addition to the expired leases (but no
UnknownScannerException), is that expected? You can see an excerpt of the
log from one of the region servers here: http://pastebin.com/NLcZTzsY

-eran



On Tue, Oct 18, 2011 at 23:57, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Actually the important setting is:
>
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)
>
> The decides how many rows are fetched each time the client exhausts its
> local cache and goes back to the server. Reasons to have setCaching low:
>
>  - Do you have a filter on? If so it could spend some time in the region
> server trying to find all the rows
>  - Are your rows fat? It might put a lot of memory pressure in the region
> server
>  - Are you spending a lot of time on each row, like Stack was saying? This
> could also be a side effect of inserting back into HBase. The issue I hit
> recently was that I was inserting a massive table into a tiny one (in terms
> of # of regions), and I was hitting the 90 seconds sleep because of too
> many
> store files. Right there waiting that time was getting over the 60 seconds
> lease timeout.
>
> Reasons to have setCaching high:
>
>  - Lots of tiny-ish rows that you process really really fast. Basically if
> your bottleneck is just getting the rows from HBase.
>
> I found that 1000 is a good number for our rows when we process them fast,
> but that 10 is just as good if we need to spend time on each row. YMMV.
>
> With all that said, I don't know if your caching is set to anything else
> than the default of 1, so this whole discussion could be a waste.
>
>
> Anyways, here's what I do see in your case. LeaseException is a rare one,
> usually you get UnknownScannerException (could it be that you have it too?
>  Do you have a log?). Looking at HRS.next, I see that the only way to get
> this is if you race with the ScannerListener. The method does this:
>
> InternalScanner s = this.scanners.get(scannerName);
> ...
> if (s == null) throw new UnknownScannerException("Name: " + scannerName);
> ...
> lease = this.leases.removeLease(scannerName);
>
> And when a scan expires (the lease was just removed from this.leases):
>
> LOG.info("Scanner " + this.scannerName + " lease expired");
> InternalScanner s = scanners.remove(this.scannerName);
>
> Which means that your exception happens after you get the InternalScanner
> in
> next(), and before you get to this.leases.removeLease the lease expiration
> already started. If you get this all the time, there might be a bigger
> issue
> or else I would expect that you see UnknownScannerException. It could be
> due
> to locking contention, I see that there's a synchronized in removeLease in
> the leases queue, but it seems unlikely since what happens in those sync
> blocks is fast.
>
> If you do get some UnknownScannerExceptions, they will show how long you
> took before going back to the server by say like 65340ms ms passed since
> the
> last invocation, timeout is currently set to 60000 (where 65340 is a number
> I just invented, yours will be different). After that you need to find
> where
> you are spending that time.
>
> J-D
>
> On Tue, Oct 18, 2011 at 6:39 AM, Eran Kutner <er...@gigya.com> wrote:
>
> > Hi Stack,
> > Yep, reducing the number of map tasks did resolve the problem, however
> the
> > only way I found for doing it is by changing the setting in the
> > mapred-site.xml file, which means it will affect all my jobs. Do you know
> > if
> > there is a way to limit the number of concurrent map tasks a specific job
> > may run? I know it was possible with the old JobConf class from the
> mapred
> > namespace but the new Job class doesn't have the setNumMapTasks() method.
> > Is it possible to extend the lease timeout? I'm not even sure lease on
> > what,
> > HDFS blocks? What is it by default?
> >
> > As for setBatch, what would be a good value? I didn't set it before and
> > setting it didn't seem to change anything.
> >
> > Finally to answer your question regarding the intensity of the job - yes,
> > it
> > is pretty intense, getting cpu and disk IO utilization to ~90%
> >
> > Thanks a million!
> >
> > -eran
> >
> >
> >
> > On Tue, Oct 18, 2011 at 13:06, Stack <st...@duboce.net> wrote:
> >
> > > Look back in the mailing list Eran for more detailed answers but in
> > > essence, the below usually means that the client has been away from
> > > the server too long.  This can happen for a few reasons.  If you fetch
> > > lots of rows per next on a scanner, processing the batch client side
> > > may be taking you longer than the lease timeout.  Set down the
> > > prefetch size and see if that helps (I'm talking about this:
> > >
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
> > > ).
> > >  Throw in a GC on client-side or over on the server-side and it might
> > > put you over your lease timeout.  Are your mapreduce jobs heavy-duty
> > > robbing resources from the running regionservers or datanodes?  Try
> > > having them run half the mappers and see if that makes it more likely
> > > your job will complete.
> > >
> > > St.Ack
> > > P.S IIRC, J-D tripped over a cause recently but I can't find it at the
> > mo.
> >
>

Re: Lease does not exist exceptions

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Actually the important setting is:

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)

The decides how many rows are fetched each time the client exhausts its
local cache and goes back to the server. Reasons to have setCaching low:

 - Do you have a filter on? If so it could spend some time in the region
server trying to find all the rows
 - Are your rows fat? It might put a lot of memory pressure in the region
server
 - Are you spending a lot of time on each row, like Stack was saying? This
could also be a side effect of inserting back into HBase. The issue I hit
recently was that I was inserting a massive table into a tiny one (in terms
of # of regions), and I was hitting the 90 seconds sleep because of too many
store files. Right there waiting that time was getting over the 60 seconds
lease timeout.

Reasons to have setCaching high:

 - Lots of tiny-ish rows that you process really really fast. Basically if
your bottleneck is just getting the rows from HBase.

I found that 1000 is a good number for our rows when we process them fast,
but that 10 is just as good if we need to spend time on each row. YMMV.

With all that said, I don't know if your caching is set to anything else
than the default of 1, so this whole discussion could be a waste.

Anyways, here's what I do see in your case. LeaseException is a rare one,
usually you get UnknownScannerException (could it be that you have it too?
 Do you have a log?). Looking at HRS.next, I see that the only way to get
this is if you race with the ScannerListener. The method does this:

InternalScanner s = this.scanners.get(scannerName);
...
if (s == null) throw new UnknownScannerException("Name: " + scannerName);
...
lease = this.leases.removeLease(scannerName);

And when a scan expires (the lease was just removed from this.leases):

LOG.info("Scanner " + this.scannerName + " lease expired");
InternalScanner s = scanners.remove(this.scannerName);

Which means that your exception happens after you get the InternalScanner in
next(), and before you get to this.leases.removeLease the lease expiration
already started. If you get this all the time, there might be a bigger issue
or else I would expect that you see UnknownScannerException. It could be due
to locking contention, I see that there's a synchronized in removeLease in
the leases queue, but it seems unlikely since what happens in those sync
blocks is fast.

If you do get some UnknownScannerExceptions, they will show how long you
took before going back to the server by say like 65340ms ms passed since the
last invocation, timeout is currently set to 60000 (where 65340 is a number
I just invented, yours will be different). After that you need to find where
you are spending that time.

J-D

On Tue, Oct 18, 2011 at 6:39 AM, Eran Kutner <er...@gigya.com> wrote:

> Hi Stack,
> Yep, reducing the number of map tasks did resolve the problem, however the
> only way I found for doing it is by changing the setting in the
> mapred-site.xml file, which means it will affect all my jobs. Do you know
> if
> there is a way to limit the number of concurrent map tasks a specific job
> may run? I know it was possible with the old JobConf class from the mapred
> namespace but the new Job class doesn't have the setNumMapTasks() method.
> Is it possible to extend the lease timeout? I'm not even sure lease on
> what,
> HDFS blocks? What is it by default?
>
> As for setBatch, what would be a good value? I didn't set it before and
> setting it didn't seem to change anything.
>
> Finally to answer your question regarding the intensity of the job - yes,
> it
> is pretty intense, getting cpu and disk IO utilization to ~90%
>
> Thanks a million!
>
> -eran
>
>
>
> On Tue, Oct 18, 2011 at 13:06, Stack <st...@duboce.net> wrote:
>
> > Look back in the mailing list Eran for more detailed answers but in
> > essence, the below usually means that the client has been away from
> > the server too long.  This can happen for a few reasons.  If you fetch
> > lots of rows per next on a scanner, processing the batch client side
> > may be taking you longer than the lease timeout.  Set down the
> > prefetch size and see if that helps (I'm talking about this:
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
> > ).
> >  Throw in a GC on client-side or over on the server-side and it might
> > put you over your lease timeout.  Are your mapreduce jobs heavy-duty
> > robbing resources from the running regionservers or datanodes?  Try
> > having them run half the mappers and see if that makes it more likely
> > your job will complete.
> >
> > St.Ack
> > P.S IIRC, J-D tripped over a cause recently but I can't find it at the
> mo.
>

Re: Lease does not exist exceptions

Posted by Eran Kutner <er...@gigya.com>.

Hi Stack,
Yep, reducing the number of map tasks did resolve the problem, however the
only way I found for doing it is by changing the setting in the
mapred-site.xml file, which means it will affect all my jobs. Do you know if
there is a way to limit the number of concurrent map tasks a specific job
may run? I know it was possible with the old JobConf class from the mapred
namespace but the new Job class doesn't have the setNumMapTasks() method.
Is it possible to extend the lease timeout? I'm not even sure lease on what,
HDFS blocks? What is it by default?

As for setBatch, what would be a good value? I didn't set it before and
setting it didn't seem to change anything.

Finally to answer your question regarding the intensity of the job - yes, it
is pretty intense, getting cpu and disk IO utilization to ~90%

Thanks a million!

-eran



On Tue, Oct 18, 2011 at 13:06, Stack <st...@duboce.net> wrote:

> Look back in the mailing list Eran for more detailed answers but in
> essence, the below usually means that the client has been away from
> the server too long.  This can happen for a few reasons.  If you fetch
> lots of rows per next on a scanner, processing the batch client side
> may be taking you longer than the lease timeout.  Set down the
> prefetch size and see if that helps (I'm talking about this:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)
> ).
>  Throw in a GC on client-side or over on the server-side and it might
> put you over your lease timeout.  Are your mapreduce jobs heavy-duty
> robbing resources from the running regionservers or datanodes?  Try
> having them run half the mappers and see if that makes it more likely
> your job will complete.
>
> St.Ack
> P.S IIRC, J-D tripped over a cause recently but I can't find it at the mo.
>
> On Tue, Oct 18, 2011 at 10:28 AM, Eran Kutner <er...@gigya.com> wrote:
> > Hi,
> > I'm having a problem when running map/reduce on a table with about 500
> > regions.
> > The MR job shows this kind of excpetions:
> > 11/10/18 06:03:39 INFO mapred.JobClient: Task Id :
> > attempt_201110030100_0086_m_000062_0, Status : FAILED
> > org.apache.hadoop.hbase.regionserver.LeaseException:
> > org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > '-334679770697295011' does not exist
> >        at
> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
> >        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >
> >        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >        at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >        at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >        at
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
> >        at
> >
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:83)
> >        at
> >
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
> >        at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1019)
> >        at
> >
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1151)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:149)
> >        at
> >
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
> >        at
> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >
> > the hbase logs are full of these:
> > 2011-10-18 06:07:01,425 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > org.apache.hadoop.hbase.regionserver.LeaseException: lease
> > '3475143032285946374' does not exist
> >        at
> > org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
> >        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >
> >
> > and the datanodes logs have a few (seem to be a lot less than the hbase
> > errors) of these:
> > 2011-10-18 06:16:42,550 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.104.4:50010, storageID=DS-15546166-10.1.104.4-50010-1298985607414,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.1.104.4:50010remote=/
> > 10.1.104.1:57232]
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)
> >
> > I've increased all the relevant limits I know of (which were high to
> begin
> > with), so now I have 64K file descriptors and dfs.datanode.max.xcievers
> is
> > 8192 .
> > I've restarted everything in the cluster, to make sure all the processed
> > picked the new configurations, but I still get those errors. They always
> > begin when the map phase is around 12-14% and eventually the job fails at
> > ~50%
> > Running random scans against the same  hbase table while the job is
> running
> > seems to work fine.
> >
> > I'm using hadoop 0.20.2+923.97-1 from CDH3 and hbase 0.90.4 compiled from
> > the branch code a while ago.
> >
> > Any other setting I'm missing or other ideas of what can be causing it?
> >
> > Thanks.
> >
> > -eran
> >
>

Re: Lease does not exist exceptions

Posted by Stack <st...@duboce.net>.

Look back in the mailing list Eran for more detailed answers but in
essence, the below usually means that the client has been away from
the server too long.  This can happen for a few reasons.  If you fetch
lots of rows per next on a scanner, processing the batch client side
may be taking you longer than the lease timeout.  Set down the
prefetch size and see if that helps (I'm talking about this:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)).
 Throw in a GC on client-side or over on the server-side and it might
put you over your lease timeout.  Are your mapreduce jobs heavy-duty
robbing resources from the running regionservers or datanodes?  Try
having them run half the mappers and see if that makes it more likely
your job will complete.

St.Ack
P.S IIRC, J-D tripped over a cause recently but I can't find it at the mo.

On Tue, Oct 18, 2011 at 10:28 AM, Eran Kutner <er...@gigya.com> wrote:
> Hi,
> I'm having a problem when running map/reduce on a table with about 500
> regions.
> The MR job shows this kind of excpetions:
> 11/10/18 06:03:39 INFO mapred.JobClient: Task Id :
> attempt_201110030100_0086_m_000062_0, Status : FAILED
> org.apache.hadoop.hbase.regionserver.LeaseException:
> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> '-334679770697295011' does not exist
>        at
> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
>        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>        at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
>        at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:83)
>        at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:1)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1019)
>        at
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1151)
>        at
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:149)
>        at
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
> the hbase logs are full of these:
> 2011-10-18 06:07:01,425 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> '3475143032285946374' does not exist
>        at
> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1845)
>        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>
> and the datanodes logs have a few (seem to be a lot less than the hbase
> errors) of these:
> 2011-10-18 06:16:42,550 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.104.4:50010, storageID=DS-15546166-10.1.104.4-50010-1298985607414,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.104.4:50010 remote=/
> 10.1.104.1:57232]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)
>
> I've increased all the relevant limits I know of (which were high to begin
> with), so now I have 64K file descriptors and dfs.datanode.max.xcievers is
> 8192 .
> I've restarted everything in the cluster, to make sure all the processed
> picked the new configurations, but I still get those errors. They always
> begin when the map phase is around 12-14% and eventually the job fails at
> ~50%
> Running random scans against the same  hbase table while the job is running
> seems to work fine.
>
> I'm using hadoop 0.20.2+923.97-1 from CDH3 and hbase 0.90.4 compiled from
> the branch code a while ago.
>
> Any other setting I'm missing or other ideas of what can be causing it?
>
> Thanks.
>
> -eran
>