You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Shuai Lin (JIRA)" <ji...@apache.org> on 2016/06/17 13:51:05 UTC

[jira] [Created] (HBASE-16057) Regions never get online when the region server fails to handle the OpenRegionRequest before the rpc timeout

Shuai Lin created HBASE-16057:
---------------------------------

             Summary: Regions never get online when the region server fails to handle the OpenRegionRequest before the rpc timeout
                 Key: HBASE-16057
                 URL: https://issues.apache.org/jira/browse/HBASE-16057
             Project: HBase
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: Shuai Lin


(See the mailing list archive [here|http://mail-archives.apache.org/mod_mbox/hbase-user/201605.mbox/%3CCA+VO6itpGgRzOPBuEphSo5ROYumD9_x1=Pa9N0WMv-C+oxhJKQ@mail.gmail.com%3E] or [here with a better layout|http://search-hadoop.com/m/YGbb5BVXb164SIM&subj=Some+regions+never+get+online+after+a+region+server+crashes])


We have hbase cluster of 3 masters and 7 region servers. One of the region server crashed, but some regions never got
online again on other RSes. I've gone through the logs, and here is the
timeline about some of the events:

* 13:03:50 on of the region server, rs-node7, died because of a disk
failure. Master started to split rs-node7's WALs

{code}
2016-04-30 13:03:50,953 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for rs-node7.example.com,60020,1458724695128 before assignment; region
count=133
2016-04-30 13:03:50,966 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to
split
2016-04-30 13:03:50,966 INFO
org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com,60020,1458724695128-splitting]
for [rs-node7.example.com,60020,1458724695128]

* 13:10:47 WAL splits done, master began to re-assign regions

2016-04-30 13:10:47,655 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
(and 0 regions(s) that were opening on this server)
2016-04-30 13:10:47,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
region(s) across 6 server(s), round-robin=true
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node1.example.com,60020,1458720625688
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node2.example.com,60020,1458721110988
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node3.example.com,60020,1458721713906
2016-04-30 13:10:47,679 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s) to
rs-node4.example.com,60020,1458722335527
2016-04-30 13:10:47,691 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node5.example.com,60020,1458722992296
2016-04-30 13:10:47,702 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node6.example.com,60020,1458723856883
{code}


* 13:11:47 the opening regions rpc sent by master to region servers timed
out after 1 minutes

{code}
2016-04-30 13:11:47,780 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
with rs-node3.example.com,60020,1458721713906 in order to assign regions
java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 failed
on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=4, waitTime=60001, operationTimeout=60000 expired.
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
regions to server rs-node6.example.com,60020,1458723856883, reassigning them
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
regions to server rs-node4.example.com,60020,1458722335527, reassigning them
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
regions to server rs-node3.example.com,60020,1458721713906, reassigning them
2016-04-30 13:11:47,783 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Force region state
offline {a65660e421f114e93862194f7cc35644 state=OPENING, ts=1462021907753,
server=rs-node6.example.com,60020,1458723856883}
{code}

* After that, part of the regions (40 out of 130 regions) never got online,
and the following lines were logged repeatly in master log:

{code}
2016-04-30 13:12:37,188 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: update
{d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
server=rs-node6.example.com,60020,1458723856883} the timestamp.
2016-04-30 13:12:37,188 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
transition; waiting up to 10668ms
{code}

{code}
$ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
/var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
484
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)