You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jan Lukavský <ja...@firma.seznam.cz> on 2011/08/11 15:58:45 UTC

ScannerTimeoutException during MapReduce

Hi,

we've recently moved to HBase 0.90.3 (cdh3u1) from 0.20.6, which 
resolved most of our previous issues, but we are now having much more 
ScannerTimeoutExceptions than before. All these exceptions come from 
trace like this

org.apache.hadoop.hbase.client.ScannerTimeoutException: 307127ms passed since the last invocation, timeout is currently set to 60000
	at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1133)
	at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:143)
	at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)


After a bit of investigation, I suppose that cause of this is that the 
first call to scanner.next() after HTable.getScanner() times out. What 
could be the cause of this? I see neither any region moving around in 
the cluster nor any compation on the side of the regionserver. As long 
as I can tell everything looks just fine. This would suggest, that it 
took too long to locate the regionserver in call to HTable.getScanner(), 
but I cannot see any reason. Could this issue be resolved on the side of 
TableRecordReader? Eg. at TableRecordReaderImpl.java:143 the 
ScannerTimeoutException could be caught and the scanner restarted a 
couple more times (say configurable?).

After looking at the code it also seems to me, that there may be a bug 
causing the reader to skip the first row of region. The scenario is as 
follows:
  - the reader is initialized with TableRecordReader.init()
  - then nextKeyValue is called, causing call to scanner.next() - here 
ScannerTimeoutException occurs
  - the scanner is restarted by call to restart() and then *two* calls 
to scanner.next() occur, causing we have lost the first row

Can anyone confirm this?

Thanks,
  Jan


Re: ScannerTimeoutException during MapReduce

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi J-D,

thanks for reply.

On 11.8.2011 23:37, Jean-Daniel Cryans wrote:
> Usual reasons would be like a mix of taking a long time to process
> rows in the mapper and scanners that grab a lot of rows (using scanner
> caching and maybe filters).
>
> Do you enable DEBUG for HBase in your mapreduce context? This would
> give relevant information like if the client was doing lots of retries
> trying to start scanning.

I'll enable it and try to get more info. Thanks.

> About the bug, can you open a jira?
Sure.
https://issues.apache.org/jira/browse/HBASE-4196

> J-D
>
> On Thu, Aug 11, 2011 at 6:58 AM, Jan Lukavský
> <ja...@firma.seznam.cz>  wrote:
>> Hi,
>>
>> we've recently moved to HBase 0.90.3 (cdh3u1) from 0.20.6, which resolved
>> most of our previous issues, but we are now having much more
>> ScannerTimeoutExceptions than before. All these exceptions come from trace
>> like this
>>
>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 307127ms passed
>> since the last invocation, timeout is currently set to 60000
>>         at
>> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1133)
>>         at
>> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:143)
>>         at
>> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
>>         at
>> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>
>>
>> After a bit of investigation, I suppose that cause of this is that the first
>> call to scanner.next() after HTable.getScanner() times out. What could be
>> the cause of this? I see neither any region moving around in the cluster nor
>> any compation on the side of the regionserver. As long as I can tell
>> everything looks just fine. This would suggest, that it took too long to
>> locate the regionserver in call to HTable.getScanner(), but I cannot see any
>> reason. Could this issue be resolved on the side of TableRecordReader? Eg.
>> at TableRecordReaderImpl.java:143 the ScannerTimeoutException could be
>> caught and the scanner restarted a couple more times (say configurable?).
>>
>> After looking at the code it also seems to me, that there may be a bug
>> causing the reader to skip the first row of region. The scenario is as
>> follows:
>>   - the reader is initialized with TableRecordReader.init()
>>   - then nextKeyValue is called, causing call to scanner.next() - here
>> ScannerTimeoutException occurs
>>   - the scanner is restarted by call to restart() and then *two* calls to
>> scanner.next() occur, causing we have lost the first row
>>
>> Can anyone confirm this?
>>
>> Thanks,
>>   Jan
>>
>>


-- 

Jan Lukavský
programátor
Seznam.cz, a.s.
Radlická 608/2
15000, Praha 5

jan.lukavsky@firma.seznam.cz
http://www.seznam.cz


Re: ScannerTimeoutException during MapReduce

Posted by Jan Lukavský <ja...@firma.seznam.cz>.
Hi all,

as J-D suggested, I tried to enable DEBUG level in client and it turned 
out that the problem is caused by reading too sparse columns. Let's say 
we have 1 milion rows containing column "CF:A" and then only one row 
with "CF:B". When I set the scanner as scan.addColumn("CF".getBytes(), 
"B".getBytes()), I will probably get ScannerTimeoutException. Solution 
is to setup the scanner to read even those rows I'm not interested in 
and skip them in Mapper (eg. scan.addFamily("CF".getBytes())). This has 
of course the disadvantage of much slower processing, so my question is, 
could we renew the scanner's lease more often?

Could there be for example an upper bound for number of rows read from 
scanner irrespective of whether it has been discarded or not? 
Alternatively a thread, that will renew the lease in some time intervals 
while the RegionServer is in HRegionServer.next()?


On 11.8.2011 23:37, Jean-Daniel Cryans wrote:
> Usual reasons would be like a mix of taking a long time to process
> rows in the mapper and scanners that grab a lot of rows (using scanner
> caching and maybe filters).
>
> Do you enable DEBUG for HBase in your mapreduce context? This would
> give relevant information like if the client was doing lots of retries
> trying to start scanning.
>
> About the bug, can you open a jira?
>
> J-D
>
> On Thu, Aug 11, 2011 at 6:58 AM, Jan Lukavský
> <ja...@firma.seznam.cz>  wrote:
>> Hi,
>>
>> we've recently moved to HBase 0.90.3 (cdh3u1) from 0.20.6, which resolved
>> most of our previous issues, but we are now having much more
>> ScannerTimeoutExceptions than before. All these exceptions come from trace
>> like this
>>
>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 307127ms passed
>> since the last invocation, timeout is currently set to 60000
>>         at
>> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1133)
>>         at
>> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:143)
>>         at
>> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
>>         at
>> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>
>>
>> After a bit of investigation, I suppose that cause of this is that the first
>> call to scanner.next() after HTable.getScanner() times out. What could be
>> the cause of this? I see neither any region moving around in the cluster nor
>> any compation on the side of the regionserver. As long as I can tell
>> everything looks just fine. This would suggest, that it took too long to
>> locate the regionserver in call to HTable.getScanner(), but I cannot see any
>> reason. Could this issue be resolved on the side of TableRecordReader? Eg.
>> at TableRecordReaderImpl.java:143 the ScannerTimeoutException could be
>> caught and the scanner restarted a couple more times (say configurable?).
>>
>> After looking at the code it also seems to me, that there may be a bug
>> causing the reader to skip the first row of region. The scenario is as
>> follows:
>>   - the reader is initialized with TableRecordReader.init()
>>   - then nextKeyValue is called, causing call to scanner.next() - here
>> ScannerTimeoutException occurs
>>   - the scanner is restarted by call to restart() and then *two* calls to
>> scanner.next() occur, causing we have lost the first row
>>
>> Can anyone confirm this?
>>
>> Thanks,
>>   Jan
>>
>>


Re: ScannerTimeoutException during MapReduce

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Usual reasons would be like a mix of taking a long time to process
rows in the mapper and scanners that grab a lot of rows (using scanner
caching and maybe filters).

Do you enable DEBUG for HBase in your mapreduce context? This would
give relevant information like if the client was doing lots of retries
trying to start scanning.

About the bug, can you open a jira?

J-D

On Thu, Aug 11, 2011 at 6:58 AM, Jan Lukavský
<ja...@firma.seznam.cz> wrote:
> Hi,
>
> we've recently moved to HBase 0.90.3 (cdh3u1) from 0.20.6, which resolved
> most of our previous issues, but we are now having much more
> ScannerTimeoutExceptions than before. All these exceptions come from trace
> like this
>
> org.apache.hadoop.hbase.client.ScannerTimeoutException: 307127ms passed
> since the last invocation, timeout is currently set to 60000
>        at
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1133)
>        at
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:143)
>        at
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>
>
> After a bit of investigation, I suppose that cause of this is that the first
> call to scanner.next() after HTable.getScanner() times out. What could be
> the cause of this? I see neither any region moving around in the cluster nor
> any compation on the side of the regionserver. As long as I can tell
> everything looks just fine. This would suggest, that it took too long to
> locate the regionserver in call to HTable.getScanner(), but I cannot see any
> reason. Could this issue be resolved on the side of TableRecordReader? Eg.
> at TableRecordReaderImpl.java:143 the ScannerTimeoutException could be
> caught and the scanner restarted a couple more times (say configurable?).
>
> After looking at the code it also seems to me, that there may be a bug
> causing the reader to skip the first row of region. The scenario is as
> follows:
>  - the reader is initialized with TableRecordReader.init()
>  - then nextKeyValue is called, causing call to scanner.next() - here
> ScannerTimeoutException occurs
>  - the scanner is restarted by call to restart() and then *two* calls to
> scanner.next() occur, causing we have lost the first row
>
> Can anyone confirm this?
>
> Thanks,
>  Jan
>
>