You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Heng Chen (JIRA)" <ji...@apache.org> on 2016/06/20 06:57:05 UTC

[jira] [Commented] (HBASE-16057) Regions never get online when the region server fails to handle the OpenRegionRequest before the rpc timeout

    [ https://issues.apache.org/jira/browse/HBASE-16057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339084#comment-15339084 ] 

Heng Chen commented on HBASE-16057:
-----------------------------------

Could you paste log of region 'd1c7f3f455f2529da82a2f713b5ee067' on rs-node6.example.com ?

> Regions never get online when the region server fails to handle the OpenRegionRequest before the rpc timeout
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-16057
>                 URL: https://issues.apache.org/jira/browse/HBASE-16057
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Shuai Lin
>
> (See the mailing list archive [here|http://mail-archives.apache.org/mod_mbox/hbase-user/201605.mbox/%3CCA+VO6itpGgRzOPBuEphSo5ROYumD9_x1=Pa9N0WMv-C+oxhJKQ@mail.gmail.com%3E] or [here with a better layout|http://search-hadoop.com/m/YGbb5BVXb164SIM&subj=Some+regions+never+get+online+after+a+region+server+crashes])
> We have hbase cluster of 3 masters and 7 region servers. One of the region server crashed, but some regions never got
> online again on other RSes. I've gone through the logs, and here is the
> timeline about some of the events:
> * 13:03:50 on of the region server, rs-node7, died because of a disk
> failure. Master started to split rs-node7's WALs
> {code}
> 2016-04-30 13:03:50,953 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> logs for rs-node7.example.com,60020,1458724695128 before assignment; region
> count=133
> 2016-04-30 13:03:50,966 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to
> split
> 2016-04-30 13:03:50,966 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
> in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com,60020,1458724695128-splitting]
> for [rs-node7.example.com,60020,1458724695128]
> * 13:10:47 WAL splits done, master began to re-assign regions
> 2016-04-30 13:10:47,655 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> 133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
> (and 0 regions(s) that were opening on this server)
> 2016-04-30 13:10:47,665 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> region(s) across 6 server(s), round-robin=true
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node1.example.com,60020,1458720625688
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node2.example.com,60020,1458721110988
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node3.example.com,60020,1458721713906
> 2016-04-30 13:10:47,679 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s) to
> rs-node4.example.com,60020,1458722335527
> 2016-04-30 13:10:47,691 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node5.example.com,60020,1458722992296
> 2016-04-30 13:10:47,702 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node6.example.com,60020,1458723856883
> {code}
> * 13:11:47 the opening regions rpc sent by master to region servers timed
> out after 1 minutes
> {code}
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node3.example.com,60020,1458721713906 in order to assign regions
> java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=60000 expired.
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node6.example.com,60020,1458723856883, reassigning them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> regions to server rs-node4.example.com,60020,1458722335527, reassigning them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node3.example.com,60020,1458721713906, reassigning them
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {a65660e421f114e93862194f7cc35644 state=OPENING, ts=1462021907753,
> server=rs-node6.example.com,60020,1458723856883}
> {code}
> * After that, part of the regions (40 out of 130 regions) never got online,
> and the following lines were logged repeatly in master log:
> {code}
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: update
> {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> server=rs-node6.example.com,60020,1458723856883} the timestamp.
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> transition; waiting up to 10668ms
> {code}
> {code}
> $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> 484
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)