You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/06/16 18:14:00 UTC
[jira] [Resolved] (HBASE-9675) Root-cause analysis is difficult with AsyncProcess

     [ https://issues.apache.org/jira/browse/HBASE-9675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Kyle Purtell resolved HBASE-9675.
----------------------------------------
    Resolution: Later

> Root-cause analysis is difficult with AsyncProcess
> --------------------------------------------------
>
>                 Key: HBASE-9675
>                 URL: https://issues.apache.org/jira/browse/HBASE-9675
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 0.95.2
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> The AsyncProcess swallows up informative exceptions returned from the RS. For instance, diagnosing the cause of a failed mapreduce task is pretty opaque right now. The client log tells me that my async requests have been retried and rejected, but not why:
> {noformat}
> 2013-09-27 22:47:03,391 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of running tasks to be equals or less than 0, tasksSent=6184, tasksDone=6180, currentTasksDone=6180, tableName=IntegrationTestLoadAndVerify
> // repeat about 30 times
> 2013-09-27 22:53:13,886 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of running tasks to be equals or less than 0, tasksSent=12140, tasksDone=12135, currentTasksDone=12135, tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:13,890 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 4 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9., hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,030 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9., hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,179 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9., hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,251 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9., hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:14,277 WARN [htable-pool1-t51] org.apache.hadoop.hbase.client.AsyncProcess: Attempt #35/35 failed for 8 ops on hor17n37.gq1.ygridcore.net,60020,1380321099619 NOT resubmitting.region=IntegrationTestLoadAndVerify,+\xC6\x02\x00\x00\x00\x00\x00/000009_0,1380321873859.871b9482f305686a5848615f26a8a7f9., hostname=hor17n37.gq1.ygridcore.net,60020,1380321099619, seqNum=4990837
> 2013-09-27 22:53:26,862 INFO [main] org.apache.hadoop.hbase.client.AsyncProcess: : Waiting for the global number of running tasks to be equals or less than 0, tasksSent=12186, tasksDone=12185, currentTasksDone=12185, tableName=IntegrationTestLoadAndVerify
> 2013-09-27 22:53:26,915 INFO [main] org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14160ad6eb404c3
> 2013-09-27 22:53:26,920 INFO [main] org.apache.zookeeper.ZooKeeper: Session: 0x14160ad6eb404c3 closed
> 2013-09-27 22:53:26,920 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2013-09-27 22:53:26,936 ERROR [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hrt_qa (auth:SIMPLE) cause:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
> 2013-09-27 22:53:26,937 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 36 actions: ConnectException: 4 times, FailedServerException: 32 times, 
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:163)
> 	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:147)
> 	at org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:711)
> 	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:939)
> 	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1315)
> 	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:237)
> 	at org.apache.hadoop.hbase.test.IntegrationTestLoadAndVerify$LoadMapper.map(IntegrationTestLoadAndVerify.java:159)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1483)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> {noformat}
> Likewise, there's nothing "broken" in the RS logs. It's sending back RegionTooBusyExceptions but neither RS nor client inform the user of this.
> We need to add logging somewhere, probably on the client side, but maybe both, to tell our owner-operator what's going on. The client log doesn't say much and all you have in the RS log is a whole bunch of flush and compaction activity. In this case, it would have been nice to see something like "RegionTooBusyException: above the memstore limit". Even better, include a recommendation to get around the problem, such as "slow down, tiger, that's quite the write throughput" or "maybe you want to increase memstore size a bit". Or maybe the right approach is to advise tweaks to the client settings (retries, sleep periods, buffer size, &c.)?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)