You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Guanghao Zhang (JIRA)" <ji...@apache.org> on 2019/02/21 11:36:00 UTC
[jira] [Commented] (HBASE-21941) Increment the default scanner timeout

    [ https://issues.apache.org/jira/browse/HBASE-21941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773993#comment-16773993 ] 

Guanghao Zhang commented on HBASE-21941:
----------------------------------------

Checked the code again. When scanner.call use RpcRetryingCallerImpl#callWithRetries, it use hbase.rpc.timeout as rpc timeout and use hbase.client.scanner.timeout.period as operation timeout.

> Increment the default scanner timeout
> -------------------------------------
>
>                 Key: HBASE-21941
>                 URL: https://issues.apache.org/jira/browse/HBASE-21941
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Guanghao Zhang
>            Priority: Major
>
> There are hbase.rpc.timeout and hbase.client.operation.timeout for client operation expect scan. And there is a special config hbase.client.scanner.timeout.period for scan. If I am not wrong, this should rpc timeout of scan call. But now we use this as operation timeout of scan call. The scan callable is complicated as we need handle the replica case. The real call with retry is called in [https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80|https://github.com/apache/hbase/blob/9a55cbb2c1dfe5a13a6ceb323ac7edd23532f4b5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ResultBoundedCompletionService.java#L80.] . And the callTimeout is configed by hbase.client.scanner.timeout.period. So I thought this is not right.
>  
> I meet this problem when run ITBLL for branch-2.2. The verify map task failed when scan.
> {code:java}
> 2019-02-21 03:47:20,287 INFO [main] org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: recovered from org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=16, exceptions: 
> 2019-02-21 03:47:20,287 INFO [main] org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Closing the previously opened scanner object.
> 2019-02-21 03:47:20,331 INFO [main] org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Current scan={"loadColumnFamiliesOnDemand":null,"startRow":"\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":3,"maxResultSize":4194304,"families":{"big":["big"],"meta":["prev"],"tiny":["tiny"]},"caching":10000,"maxVersions":1,"timeRange":[0,9223372036854775807]} 2019-02-21 03:47:20,335 INFO [hconnection-0x7b44b63d-metaLookup-shared--pool4-t36] org.apache.hadoop.hbase.client.ScannerCallable: Open scanner=-4916858472898750097 for scan={"loadColumnFamiliesOnDemand":null,"startRow":"IntegrationTestBigLinkedList,\\xE1\\x9B\\xB4\\xF0\\xB3(JT\\xDC\\x86pf|y\\xF3\\xE9,99999999999999","stopRow":"IntegrationTestBigLinkedList,,","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":-1,"families":{"info":["ALL"]},"caching":5,"maxVersions":1,"timeRange":[0,9223372036854775807]} on region region=hbase:meta,,1.1588230740, hostname=c4-hadoop-tst-st26.bj,29100,1550660298519, seqNum=-1
> 2019-02-21 03:48:20,354 INFO [main] org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: Mapper took 60023ms to process 0 rows
> 2019-02-21 03:48:20,355 INFO [main] org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46., hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=16, exceptions: Thu Feb 21 03:48:20 CST 2019, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=60215: Call to c4-hadoop-tst-st30.bj/10.132.2.41:29100 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=7102, waitTime=60006, rpcTimeout=60000 row 'ᛴ�(JT܆pf|y��' on table 'IntegrationTestBigLinkedList' at region=IntegrationTestBigLinkedList,\xDD\xDD\xDD\xDD\xDD\xDD\xDD\xDD,1550661322522.d5d29d2f1e8fee42d666c117709c3a46., hostname=c4-hadoop-tst-st30.bj,29100,1550652984371, seqNum=1007960 at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:299) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:242) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:58) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:266) at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434) at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:309) at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:594) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:237)
> {code}
>  
> When TableRecordReaderImpl#nextKeyValue first scan faile, it recoverd by close the old scaner and open a new one. But the new scanner failed after 60 seconds. Then it throw the exception and don't try to recovery and make the task failed.
>  
> Or if the scanner timeout is a operation timeout of scan call, we should set the default scanner timeout to 1200000 which is same with the default operation timeout.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)