You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/05 22:34:00 UTC

[jira] [Created] (HBASE-21851) slow meta can cause master to sortof deadlock and bring down the cluster

Sergey Shelukhin created HBASE-21851:
----------------------------------------

             Summary: slow meta can cause master to sortof deadlock and bring down the cluster
                 Key: HBASE-21851
                 URL: https://issues.apache.org/jira/browse/HBASE-21851
             Project: HBase
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Due to many threads sync-retrying to update meta for a really long time, master doesn't appear to have enough threads to process requests.
Meta server died but it's SCP is not processed, I'm not sure if it's because of the threads being full, or some other reason (the ZK issue we've seen earlier in our cluster?)

{noformat}
2019-02-05 13:20:39,225 INFO  [KeepAlivePEWorker-32] assignment.RegionStateStore: pid=805758 updating hbase:meta row=7130dac84857699b8cd0061298b6fe9c, regionState=OPENING, regionLocation=server,17020,1549400274239                                                                                                                              
...
2019-02-05 13:39:42,521 WARN  [ProcExecTimeout] procedure2.ProcedureExecutor: Worker stuck KeepAlivePEWorker-32(pid=805758), run time 19mins, 3.296sec
{noformat}                              

It starts dropping timed out calls:
{noformat}
2019-02-05 13:39:45,877 WARN  [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] ipc.RpcServer: Dropping timed out call: callId: 7 service: RegionServerStatusService methodName: RegionServerReport size: 102 connection: ...:35743 deadline: 1549401663387 ...

RS:
2019-02-05 13:39:45,521 INFO  [RS_OPEN_REGION-regionserver/..:17020-4] regionserver.HRegionServer: Failed report transition server ...
org.apache.hadoop.hbase.CallQueueTooBigException: Call queue is full on ..., too many items queued ?
{noformat}

This eventually causes RSes to kill themselves I think and further increases load on master.

I wonder if meta retry should be async? That way other calls could be processed.








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)