You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shuai Lin <li...@gmail.com> on 2016/05/05 01:07:33 UTC

Some regions never get online after a region server crashes

Hi list,

Last weekend I got a region server crashed, but some regions never got
online again on other RSes. I've gone through the logs, and here is the
timeline about some of the events:

* 13:03:50 on of the region server, rs-node7, died because of a disk
failure. Master started to split rs-node7's WALs


2016-04-30 13:03:50,953 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for rs-node7.example.com,60020,1458724695128 before assignment; region
count=133
2016-04-30 13:03:50,966 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to
split
2016-04-30 13:03:50,966 INFO
org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com,60020,1458724695128-splitting]
for [rs-node7.example.com,60020,1458724695128]

* 13:10:47 WAL splits done, master began to re-assign regions

2016-04-30 13:10:47,655 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
(and 0 regions(s) that were opening on this server)
2016-04-30 13:10:47,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
region(s) across 6 server(s), round-robin=true
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node1.example.com,60020,1458720625688
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node2.example.com,60020,1458721110988
2016-04-30 13:10:47,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node3.example.com,60020,1458721713906
2016-04-30 13:10:47,679 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s) to
rs-node4.example.com,60020,1458722335527
2016-04-30 13:10:47,691 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node5.example.com,60020,1458722992296
2016-04-30 13:10:47,702 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
rs-node6.example.com,60020,1458723856883

* 13:11:47 the opening regions rpc sent by master to region servers timed
out after 1 minutes

2016-04-30 13:11:47,780 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
with rs-node3.example.com,60020,1458721713906 in order to assign regions
java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 failed
on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=4, waitTime=60001, operationTimeout=60000 expired.
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
regions to server rs-node6.example.com,60020,1458723856883, reassigning them
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
regions to server rs-node4.example.com,60020,1458722335527, reassigning them
2016-04-30 13:11:47,782 INFO
org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
regions to server rs-node3.example.com,60020,1458721713906, reassigning them
2016-04-30 13:11:47,783 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Force region state
offline {a65660e421f114e93862194f7cc35644 state=OPENING, ts=1462021907753,
server=rs-node6.example.com,60020,1458723856883}


* After that, part of the regions (40 out of 130 regions) never got online,
and the following lines were logged repeatly in master log:

2016-04-30 13:12:37,188 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: update
{d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
server=rs-node6.example.com,60020,1458723856883} the timestamp.
2016-04-30 13:12:37,188 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
transition; waiting up to 10668ms

$ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
/var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
484


I've searched in mailing list archive and hbase JIRA, but didn't find any
similar situations like this one. The most similar one is HBASE-14407
<https://issues.apache.org/jira/browse/HBASE-14407> , but after reading its
discussion I don't think that's the same problem.

Anyone have a clue? Thanks!

Regards,
Shuai

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Hi Ted, Could be please take a look at the log I've pasted above? Thanks!

On Thu, May 5, 2016 at 11:02 AM, Shuai Lin <li...@gmail.com> wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here
> <http://paste.openstack.org/raw/496174/> .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you pastebin related server log w.r.t.
>> d1c7f3f455f2529da82a2f713b5ee067
>> from rs-node6 ?
>>
>> Which release of hbase are you using ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com> wrote:
>>
>> > Hi list,
>> >
>> > Last weekend I got a region server crashed, but some regions never got
>> > online again on other RSes. I've gone through the logs, and here is the
>> > timeline about some of the events:
>> >
>> > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > failure. Master started to split rs-node7's WALs
>> >
>> >
>> > 2016-04-30 13:03:50,953 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
>> > region
>> > count=133
>> > 2016-04-30 13:03:50,966 DEBUG
>> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> logs to
>> > split
>> > 2016-04-30 13:03:50,966 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> logs
>> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > ,60020,1458724695128-splitting]
>> > for [rs-node7.example.com,60020,1458724695128]
>> >
>> > * 13:10:47 WAL splits done, master began to re-assign regions
>> >
>> > 2016-04-30 13:10:47,655 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Reassigning
>> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> carrying
>> > (and 0 regions(s) that were opening on this server)
>> > 2016-04-30 13:10:47,665 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > region(s) across 6 server(s), round-robin=true
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node1.example.com,60020,1458720625688
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node2.example.com,60020,1458721110988
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node3.example.com,60020,1458721713906
>> > 2016-04-30 13:10:47,679 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> region(s) to
>> > rs-node4.example.com,60020,1458722335527
>> > 2016-04-30 13:10:47,691 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node5.example.com,60020,1458722992296
>> > 2016-04-30 13:10:47,702 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node6.example.com,60020,1458723856883
>> >
>> > * 13:11:47 the opening regions rpc sent by master to region servers
>> timed
>> > out after 1 minutes
>> >
>> > 2016-04-30 13:11:47,780 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> > with rs-node3.example.com,60020,1458721713906 in order to assign
>> regions
>> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> failed
>> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> Call
>> > id=4, waitTime=60001, operationTimeout=60000 expired.
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
>> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
>> > them
>> > 2016-04-30 13:11:47,783 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> ts=1462021907753,
>> > server=rs-node6.example.com,60020,1458723856883}
>> >
>> >
>> > * After that, part of the regions (40 out of 130 regions) never got
>> online,
>> > and the following lines were logged repeatly in master log:
>> >
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
>> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
>> > transition; waiting up to 10668ms
>> >
>> > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
>> > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>> > 484
>> >
>> >
>> > I've searched in mailing list archive and hbase JIRA, but didn't find
>> any
>> > similar situations like this one. The most similar one is HBASE-14407
>> > <https://issues.apache.org/jira/browse/HBASE-14407> , but after reading
>> > its
>> > discussion I don't think that's the same problem.
>> >
>> > Anyone have a clue? Thanks!
>> >
>> > Regards,
>> > Shuai
>> >
>>
>
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Hi Ted, does my conclusion above make sense? Should I open an issue in
hbase JIRA?

On Fri, May 20, 2016 at 9:20 PM, Shuai Lin <li...@gmail.com> wrote:

> Because of the "opening regions" rpc call sent by master to the region
> server node6 got timed out after 1 minutes?
>
> *RPC call was sent:*
>
> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>
> *After 1 minute:*
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node6.example.com,60020,1458723856883 in order to assign regions,
> java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=60000 expired.
>
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>
>
> I have checked hbase source code, but don't find any specific timeout
> settings for "open region" rpc call I can use. So I guess the it's using
> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
> are 20+ regions being assigned to node6 almost at the same moment, node6
> gets overloaded and can't finish opening all of them within one minute.
>
> So this looks like a hbase bug to me (regions never get online when the
> region server failed to handle the OpenRegionRequest before the rpc
> timeout), am I right?
>
>
> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>> when it was opening, leading to RegionAlreadyInTransitionException.
>>
>> Was there any clue in master log why the close request was sent ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>> The
>> > RS logs on node6 can be found here <
>> http://paste.openstack.org/raw/496174/
>> > >
>> >  .
>> >
>> > Thanks!
>> >
>> > Shuai
>> >
>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Can you pastebin related server log w.r.t.
>> > d1c7f3f455f2529da82a2f713b5ee067
>> > > from rs-node6 ?
>> > >
>> > > Which release of hbase are you using ?
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
>> > wrote:
>> > >
>> > > > Hi list,
>> > > >
>> > > > Last weekend I got a region server crashed, but some regions never
>> got
>> > > > online again on other RSes. I've gone through the logs, and here is
>> the
>> > > > timeline about some of the events:
>> > > >
>> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > > > failure. Master started to split rs-node7's WALs
>> > > >
>> > > >
>> > > > 2016-04-30 13:03:50,953 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Splitting
>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>> assignment;
>> > > > region
>> > > > count=133
>> > > > 2016-04-30 13:03:50,966 DEBUG
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> > logs
>> > > to
>> > > > split
>> > > > 2016-04-30 13:03:50,966 INFO
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> > logs
>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > > > ,60020,1458724695128-splitting]
>> > > > for [rs-node7.example.com,60020,1458724695128]
>> > > >
>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>> > > >
>> > > > 2016-04-30 13:10:47,655 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> > Reassigning
>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> > carrying
>> > > > (and 0 regions(s) that were opening on this server)
>> > > > 2016-04-30 13:10:47,665 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > > > region(s) across 6 server(s), round-robin=true
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node1.example.com,60020,1458720625688
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node2.example.com,60020,1458721110988
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node3.example.com,60020,1458721713906
>> > > > 2016-04-30 13:10:47,679 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> > region(s)
>> > > to
>> > > > rs-node4.example.com,60020,1458722335527
>> > > > 2016-04-30 13:10:47,691 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node5.example.com,60020,1458722992296
>> > > > 2016-04-30 13:10:47,702 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node6.example.com,60020,1458723856883
>> > > >
>> > > > * 13:11:47 the opening regions rpc sent by master to region servers
>> > timed
>> > > > out after 1 minutes
>> > > >
>> > > > 2016-04-30 13:11:47,780 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>> communicate
>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>> > regions
>> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> > > failed
>> > > > on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> > > Call
>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 22
>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 23
>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 22
>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,783 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> > > ts=1462021907753,
>> > > > server=rs-node6.example.com,60020,1458723856883}
>> > > >
>> > > >
>> > > > * After that, part of the regions (40 out of 130 regions) never got
>> > > online,
>> > > > and the following lines were logged repeatly in master log:
>> > > >
>> > > > 2016-04-30 13:12:37,188 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021957087,
>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > > > 2016-04-30 13:12:37,188 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already
>> in
>> > > > transition; waiting up to 10668ms
>> > > >
>> > > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
>> > > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc
>> -l
>> > > > 484
>> > > >
>> > > >
>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>> find
>> > any
>> > > > similar situations like this one. The most similar one is
>> HBASE-14407
>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>> > reading
>> > > > its
>> > > > discussion I don't think that's the same problem.
>> > > >
>> > > > Anyone have a clue? Thanks!
>> > > >
>> > > > Regards,
>> > > > Shuai
>> > > >
>> > >
>> >
>>
>
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Would do that, thanks!

On Sat, Jun 4, 2016 at 6:55 PM, Ted Yu <yu...@gmail.com> wrote:

> I think this sounds like a bug.
>
> Search in HBase JIRA first. If there is no JIRA with the same symptom,
> consider filing one.
>
> Cheers
>
> On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin <li...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> I'm kind of confused, so is this normal behaviour or a bug of hbase? For
>> me it looks like a bug, should I fire a JIRA?
>>
>> Thanks,
>>
>> Shuai
>>
>> On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> There were 7 regions Master tried to close which were opening but not
>>> yet served.
>>>
>>> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>>>
>>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <li...@gmail.com>
>>> wrote:
>>>
>>>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>>>> http://paste.openstack.org/raw/505826/
>>>>
>>>> The master asked node6 to open several regions. Node6 opened the first
>>>> 4 very fast (within 1 seconsd) and got stuck at the 5th one. But there is
>>>> no errors at that time.
>>>>
>>>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> In AssignmentManager#assign(), you should find:
>>>>>
>>>>>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>>>>           // regions will be assigned individually.
>>>>>           long maxWaitTime = System.currentTimeMillis() +
>>>>>             this.server.getConfiguration().
>>>>>               getLong("hbase.regionserver.rpc.startup.waittime",
>>>>> 60000);
>>>>>
>>>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Because of the "opening regions" rpc call sent by master to the
>>>>>> region server node6 got timed out after 1 minutes?
>>>>>>
>>>>>> *RPC call was sent:*
>>>>>>
>>>>>> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
>>>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>>>>
>>>>>> *After 1 minute:*
>>>>>>
>>>>>> 2016-04-30 13:11:47,780 INFO
>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>>>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>>>>> regions, java.io.IOException: Call to
>>>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>>>>> waitTime=60001, operationTimeout=60000 expired.
>>>>>>
>>>>>> 2016-04-30 13:11:47,783 DEBUG
>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>>>>
>>>>>>
>>>>>> I have checked hbase source code, but don't find any specific timeout
>>>>>> settings for "open region" rpc call I can use. So I guess the it's using
>>>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>>>>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>>>>> gets overloaded and can't finish opening all of them within one minute.
>>>>>>
>>>>>> So this looks like a hbase bug to me (regions never get online when
>>>>>> the region server failed to handle the OpenRegionRequest before the rpc
>>>>>> timeout), am I right?
>>>>>>
>>>>>>
>>>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>>>>>> request
>>>>>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>>>>>
>>>>>>> Was there any clue in master log why the close request was sent ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Hi Ted,
>>>>>>> >
>>>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>>>>>>> 5.4.8. The
>>>>>>> > RS logs on node6 can be found here <
>>>>>>> http://paste.openstack.org/raw/496174/
>>>>>>> > >
>>>>>>> >  .
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> >
>>>>>>> > Shuai
>>>>>>> >
>>>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Can you pastebin related server log w.r.t.
>>>>>>> > d1c7f3f455f2529da82a2f713b5ee067
>>>>>>> > > from rs-node6 ?
>>>>>>> > >
>>>>>>> > > Which release of hbase are you using ?
>>>>>>> > >
>>>>>>> > > Cheers
>>>>>>> > >
>>>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <
>>>>>>> linshuai2012@gmail.com>
>>>>>>> > wrote:
>>>>>>> > >
>>>>>>> > > > Hi list,
>>>>>>> > > >
>>>>>>> > > > Last weekend I got a region server crashed, but some regions
>>>>>>> never got
>>>>>>> > > > online again on other RSes. I've gone through the logs, and
>>>>>>> here is the
>>>>>>> > > > timeline about some of the events:
>>>>>>> > > >
>>>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of
>>>>>>> a disk
>>>>>>> > > > failure. Master started to split rs-node7's WALs
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:03:50,953 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>>> Splitting
>>>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>>>>>> assignment;
>>>>>>> > > > region
>>>>>>> > > > count=133
>>>>>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling
>>>>>>> batch of
>>>>>>> > logs
>>>>>>> > > to
>>>>>>> > > > split
>>>>>>> > > > 2016-04-30 13:03:50,966 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started
>>>>>>> splitting 33
>>>>>>> > logs
>>>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>>>>>> > > > ,60020,1458724695128-splitting]
>>>>>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>>>>>> > > >
>>>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:10:47,655 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>>> > Reassigning
>>>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128
>>>>>>> was
>>>>>>> > carrying
>>>>>>> > > > (and 0 regions(s) that were opening on this server)
>>>>>>> > > > 2016-04-30 13:10:47,665 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk
>>>>>>> assigning 133
>>>>>>> > > > region(s) across 6 server(s), round-robin=true
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node1.example.com,60020,1458720625688
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node2.example.com,60020,1458721110988
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node3.example.com,60020,1458721713906
>>>>>>> > > > 2016-04-30 13:10:47,679 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node4.example.com,60020,1458722335527
>>>>>>> > > > 2016-04-30 13:10:47,691 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node5.example.com,60020,1458722992296
>>>>>>> > > > 2016-04-30 13:10:47,702 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node6.example.com,60020,1458723856883
>>>>>>> > > >
>>>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region
>>>>>>> servers
>>>>>>> > timed
>>>>>>> > > > out after 1 minutes
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:11:47,780 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>>>>>> communicate
>>>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to
>>>>>>> assign
>>>>>>> > regions
>>>>>>> > > > java.io.IOException: Call to
>>>>>>> rs-node3.example.com/172.16.6.3:60020
>>>>>>> > > failed
>>>>>>> > > > on local exception:
>>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>>>>>> > > Call
>>>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 22
>>>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 23
>>>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 22
>>>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>>>>>> state
>>>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>>>>>> > > ts=1462021907753,
>>>>>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > * After that, part of the regions (40 out of 130 regions)
>>>>>>> never got
>>>>>>> > > online,
>>>>>>> > > > and the following lines were logged repeatly in master log:
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>>>> ts=1462021957087,
>>>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the
>>>>>>> timestamp.
>>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>>>>>> already in
>>>>>>> > > > transition; waiting up to 10668ms
>>>>>>> > > >
>>>>>>> > > > $ grep 'AssignmentManager: update
>>>>>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>>>>>> > > >
>>>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>>>>>> > > > 484
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > I've searched in mailing list archive and hbase JIRA, but
>>>>>>> didn't find
>>>>>>> > any
>>>>>>> > > > similar situations like this one. The most similar one is
>>>>>>> HBASE-14407
>>>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but
>>>>>>> after
>>>>>>> > reading
>>>>>>> > > > its
>>>>>>> > > > discussion I don't think that's the same problem.
>>>>>>> > > >
>>>>>>> > > > Anyone have a clue? Thanks!
>>>>>>> > > >
>>>>>>> > > > Regards,
>>>>>>> > > > Shuai
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Some regions never get online after a region server crashes

Posted by Ted Yu <yu...@gmail.com>.
I think this sounds like a bug.

Search in HBase JIRA first. If there is no JIRA with the same symptom,
consider filing one.

Cheers

On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin <li...@gmail.com> wrote:

> Hi Ted,
>
> I'm kind of confused, so is this normal behaviour or a bug of hbase? For
> me it looks like a bug, should I fire a JIRA?
>
> Thanks,
>
> Shuai
>
> On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> There were 7 regions Master tried to close which were opening but not
>> yet served.
>>
>> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>>
>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <li...@gmail.com>
>> wrote:
>>
>>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>>> http://paste.openstack.org/raw/505826/
>>>
>>> The master asked node6 to open several regions. Node6 opened the first 4
>>> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
>>> errors at that time.
>>>
>>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> In AssignmentManager#assign(), you should find:
>>>>
>>>>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>>>           // regions will be assigned individually.
>>>>           long maxWaitTime = System.currentTimeMillis() +
>>>>             this.server.getConfiguration().
>>>>               getLong("hbase.regionserver.rpc.startup.waittime", 60000);
>>>>
>>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Because of the "opening regions" rpc call sent by master to the region
>>>>> server node6 got timed out after 1 minutes?
>>>>>
>>>>> *RPC call was sent:*
>>>>>
>>>>> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
>>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>>>
>>>>> *After 1 minute:*
>>>>>
>>>>> 2016-04-30 13:11:47,780 INFO
>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>>>> regions, java.io.IOException: Call to
>>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>>>> waitTime=60001, operationTimeout=60000 expired.
>>>>>
>>>>> 2016-04-30 13:11:47,783 DEBUG
>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>>>
>>>>>
>>>>> I have checked hbase source code, but don't find any specific timeout
>>>>> settings for "open region" rpc call I can use. So I guess the it's using
>>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>>>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>>>> gets overloaded and can't finish opening all of them within one minute.
>>>>>
>>>>> So this looks like a hbase bug to me (regions never get online when
>>>>> the region server failed to handle the OpenRegionRequest before the rpc
>>>>> timeout), am I right?
>>>>>
>>>>>
>>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>>>>> request
>>>>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>>>>
>>>>>> Was there any clue in master log why the close request was sent ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Hi Ted,
>>>>>> >
>>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>>>>>> 5.4.8. The
>>>>>> > RS logs on node6 can be found here <
>>>>>> http://paste.openstack.org/raw/496174/
>>>>>> > >
>>>>>> >  .
>>>>>> >
>>>>>> > Thanks!
>>>>>> >
>>>>>> > Shuai
>>>>>> >
>>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> >
>>>>>> > > Can you pastebin related server log w.r.t.
>>>>>> > d1c7f3f455f2529da82a2f713b5ee067
>>>>>> > > from rs-node6 ?
>>>>>> > >
>>>>>> > > Which release of hbase are you using ?
>>>>>> > >
>>>>>> > > Cheers
>>>>>> > >
>>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <linshuai2012@gmail.com
>>>>>> >
>>>>>> > wrote:
>>>>>> > >
>>>>>> > > > Hi list,
>>>>>> > > >
>>>>>> > > > Last weekend I got a region server crashed, but some regions
>>>>>> never got
>>>>>> > > > online again on other RSes. I've gone through the logs, and
>>>>>> here is the
>>>>>> > > > timeline about some of the events:
>>>>>> > > >
>>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>>>>> disk
>>>>>> > > > failure. Master started to split rs-node7's WALs
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > 2016-04-30 13:03:50,953 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>> Splitting
>>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>>>>> assignment;
>>>>>> > > > region
>>>>>> > > > count=133
>>>>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling
>>>>>> batch of
>>>>>> > logs
>>>>>> > > to
>>>>>> > > > split
>>>>>> > > > 2016-04-30 13:03:50,966 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started
>>>>>> splitting 33
>>>>>> > logs
>>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>>>>> > > > ,60020,1458724695128-splitting]
>>>>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>>>>> > > >
>>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>>>>> > > >
>>>>>> > > > 2016-04-30 13:10:47,655 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>> > Reassigning
>>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>>>>> > carrying
>>>>>> > > > (and 0 regions(s) that were opening on this server)
>>>>>> > > > 2016-04-30 13:10:47,665 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk
>>>>>> assigning 133
>>>>>> > > > region(s) across 6 server(s), round-robin=true
>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node1.example.com,60020,1458720625688
>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node2.example.com,60020,1458721110988
>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node3.example.com,60020,1458721713906
>>>>>> > > > 2016-04-30 13:10:47,679 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node4.example.com,60020,1458722335527
>>>>>> > > > 2016-04-30 13:10:47,691 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node5.example.com,60020,1458722992296
>>>>>> > > > 2016-04-30 13:10:47,702 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>> > region(s)
>>>>>> > > to
>>>>>> > > > rs-node6.example.com,60020,1458723856883
>>>>>> > > >
>>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region
>>>>>> servers
>>>>>> > timed
>>>>>> > > > out after 1 minutes
>>>>>> > > >
>>>>>> > > > 2016-04-30 13:11:47,780 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>>>>> communicate
>>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to
>>>>>> assign
>>>>>> > regions
>>>>>> > > > java.io.IOException: Call to
>>>>>> rs-node3.example.com/172.16.6.3:60020
>>>>>> > > failed
>>>>>> > > > on local exception:
>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>>>>> > > Call
>>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>> assigning 22
>>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>>>>> > reassigning
>>>>>> > > > them
>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>> assigning 23
>>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>>>>> > reassigning
>>>>>> > > > them
>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>> assigning 22
>>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>>>>> > reassigning
>>>>>> > > > them
>>>>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>>>>> state
>>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>>>>> > > ts=1462021907753,
>>>>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > * After that, part of the regions (40 out of 130 regions) never
>>>>>> got
>>>>>> > > online,
>>>>>> > > > and the following lines were logged repeatly in master log:
>>>>>> > > >
>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>>> ts=1462021957087,
>>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>>>>> already in
>>>>>> > > > transition; waiting up to 10668ms
>>>>>> > > >
>>>>>> > > > $ grep 'AssignmentManager: update
>>>>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>>>>> > > >
>>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>>>>> > > > 484
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > I've searched in mailing list archive and hbase JIRA, but
>>>>>> didn't find
>>>>>> > any
>>>>>> > > > similar situations like this one. The most similar one is
>>>>>> HBASE-14407
>>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>>>>>> > reading
>>>>>> > > > its
>>>>>> > > > discussion I don't think that's the same problem.
>>>>>> > > >
>>>>>> > > > Anyone have a clue? Thanks!
>>>>>> > > >
>>>>>> > > > Regards,
>>>>>> > > > Shuai
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Hi Ted,

I'm kind of confused, so is this normal behaviour or a bug of hbase? For me
it looks like a bug, should I fire a JIRA?

Thanks,

Shuai

On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yu...@gmail.com> wrote:

> There were 7 regions Master tried to close which were opening but not yet
> served.
>
> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>
> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <li...@gmail.com>
> wrote:
>
>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>> http://paste.openstack.org/raw/505826/
>>
>> The master asked node6 to open several regions. Node6 opened the first 4
>> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
>> errors at that time.
>>
>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> In AssignmentManager#assign(), you should find:
>>>
>>>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>>           // regions will be assigned individually.
>>>           long maxWaitTime = System.currentTimeMillis() +
>>>             this.server.getConfiguration().
>>>               getLong("hbase.regionserver.rpc.startup.waittime", 60000);
>>>
>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>>
>>> Cheers
>>>
>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com>
>>> wrote:
>>>
>>>> Because of the "opening regions" rpc call sent by master to the region
>>>> server node6 got timed out after 1 minutes?
>>>>
>>>> *RPC call was sent:*
>>>>
>>>> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>>
>>>> *After 1 minute:*
>>>>
>>>> 2016-04-30 13:11:47,780 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>>> regions, java.io.IOException: Call to
>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>>> waitTime=60001, operationTimeout=60000 expired.
>>>>
>>>> 2016-04-30 13:11:47,783 DEBUG
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>>
>>>>
>>>> I have checked hbase source code, but don't find any specific timeout
>>>> settings for "open region" rpc call I can use. So I guess the it's using
>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>>> gets overloaded and can't finish opening all of them within one minute.
>>>>
>>>> So this looks like a hbase bug to me (regions never get online when the
>>>> region server failed to handle the OpenRegionRequest before the rpc
>>>> timeout), am I right?
>>>>
>>>>
>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>>>> request
>>>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>>>
>>>>> Was there any clue in master log why the close request was sent ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi Ted,
>>>>> >
>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>>>>> 5.4.8. The
>>>>> > RS logs on node6 can be found here <
>>>>> http://paste.openstack.org/raw/496174/
>>>>> > >
>>>>> >  .
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> > Shuai
>>>>> >
>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>> >
>>>>> > > Can you pastebin related server log w.r.t.
>>>>> > d1c7f3f455f2529da82a2f713b5ee067
>>>>> > > from rs-node6 ?
>>>>> > >
>>>>> > > Which release of hbase are you using ?
>>>>> > >
>>>>> > > Cheers
>>>>> > >
>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
>>>>> > wrote:
>>>>> > >
>>>>> > > > Hi list,
>>>>> > > >
>>>>> > > > Last weekend I got a region server crashed, but some regions
>>>>> never got
>>>>> > > > online again on other RSes. I've gone through the logs, and here
>>>>> is the
>>>>> > > > timeline about some of the events:
>>>>> > > >
>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>>>> disk
>>>>> > > > failure. Master started to split rs-node7's WALs
>>>>> > > >
>>>>> > > >
>>>>> > > > 2016-04-30 13:03:50,953 INFO
>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>> Splitting
>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>>>> assignment;
>>>>> > > > region
>>>>> > > > count=133
>>>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch
>>>>> of
>>>>> > logs
>>>>> > > to
>>>>> > > > split
>>>>> > > > 2016-04-30 13:03:50,966 INFO
>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started
>>>>> splitting 33
>>>>> > logs
>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>>>> > > > ,60020,1458724695128-splitting]
>>>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>>>> > > >
>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>>>> > > >
>>>>> > > > 2016-04-30 13:10:47,655 INFO
>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>> > Reassigning
>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>>>> > carrying
>>>>> > > > (and 0 regions(s) that were opening on this server)
>>>>> > > > 2016-04-30 13:10:47,665 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning
>>>>> 133
>>>>> > > > region(s) across 6 server(s), round-robin=true
>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node1.example.com,60020,1458720625688
>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node2.example.com,60020,1458721110988
>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node3.example.com,60020,1458721713906
>>>>> > > > 2016-04-30 13:10:47,679 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node4.example.com,60020,1458722335527
>>>>> > > > 2016-04-30 13:10:47,691 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node5.example.com,60020,1458722992296
>>>>> > > > 2016-04-30 13:10:47,702 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>> > region(s)
>>>>> > > to
>>>>> > > > rs-node6.example.com,60020,1458723856883
>>>>> > > >
>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region
>>>>> servers
>>>>> > timed
>>>>> > > > out after 1 minutes
>>>>> > > >
>>>>> > > > 2016-04-30 13:11:47,780 INFO
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>>>> communicate
>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>>>>> > regions
>>>>> > > > java.io.IOException: Call to
>>>>> rs-node3.example.com/172.16.6.3:60020
>>>>> > > failed
>>>>> > > > on local exception:
>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>>>> > > Call
>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>> assigning 22
>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>>>> > reassigning
>>>>> > > > them
>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>> assigning 23
>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>>>> > reassigning
>>>>> > > > them
>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>> assigning 22
>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>>>> > reassigning
>>>>> > > > them
>>>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>>>> state
>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>>>> > > ts=1462021907753,
>>>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>>>> > > >
>>>>> > > >
>>>>> > > > * After that, part of the regions (40 out of 130 regions) never
>>>>> got
>>>>> > > online,
>>>>> > > > and the following lines were logged repeatly in master log:
>>>>> > > >
>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>> ts=1462021957087,
>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>>>> already in
>>>>> > > > transition; waiting up to 10668ms
>>>>> > > >
>>>>> > > > $ grep 'AssignmentManager: update
>>>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>>>> > > >
>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>>>> > > > 484
>>>>> > > >
>>>>> > > >
>>>>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>>>>> find
>>>>> > any
>>>>> > > > similar situations like this one. The most similar one is
>>>>> HBASE-14407
>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>>>>> > reading
>>>>> > > > its
>>>>> > > > discussion I don't think that's the same problem.
>>>>> > > >
>>>>> > > > Anyone have a clue? Thanks!
>>>>> > > >
>>>>> > > > Regards,
>>>>> > > > Shuai
>>>>> > > >
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Some regions never get online after a region server crashes

Posted by Ted Yu <yu...@gmail.com>.
There were 7 regions Master tried to close which were opening but not yet
served.

d1c7f3f455f2529da82a2f713b5ee067 was one of them.

On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <li...@gmail.com> wrote:

> Here is the complete log on node6 between 13:10:47 and 13:11:47:
> http://paste.openstack.org/raw/505826/
>
> The master asked node6 to open several regions. Node6 opened the first 4
> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
> errors at that time.
>
> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> In AssignmentManager#assign(), you should find:
>>
>>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>           // regions will be assigned individually.
>>           long maxWaitTime = System.currentTimeMillis() +
>>             this.server.getConfiguration().
>>               getLong("hbase.regionserver.rpc.startup.waittime", 60000);
>>
>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>
>> Cheers
>>
>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com>
>> wrote:
>>
>>> Because of the "opening regions" rpc call sent by master to the region
>>> server node6 got timed out after 1 minutes?
>>>
>>> *RPC call was sent:*
>>>
>>> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>
>>> *After 1 minute:*
>>>
>>> 2016-04-30 13:11:47,780 INFO
>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>> regions, java.io.IOException: Call to
>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>> waitTime=60001, operationTimeout=60000 expired.
>>>
>>> 2016-04-30 13:11:47,783 DEBUG
>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>
>>>
>>> I have checked hbase source code, but don't find any specific timeout
>>> settings for "open region" rpc call I can use. So I guess the it's using
>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>> gets overloaded and can't finish opening all of them within one minute.
>>>
>>> So this looks like a hbase bug to me (regions never get online when the
>>> region server failed to handle the OpenRegionRequest before the rpc
>>> timeout), am I right?
>>>
>>>
>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>>> request
>>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>>
>>>> Was there any clue in master log why the close request was sent ?
>>>>
>>>> Cheers
>>>>
>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi Ted,
>>>> >
>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>>>> The
>>>> > RS logs on node6 can be found here <
>>>> http://paste.openstack.org/raw/496174/
>>>> > >
>>>> >  .
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Shuai
>>>> >
>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>>>> >
>>>> > > Can you pastebin related server log w.r.t.
>>>> > d1c7f3f455f2529da82a2f713b5ee067
>>>> > > from rs-node6 ?
>>>> > >
>>>> > > Which release of hbase are you using ?
>>>> > >
>>>> > > Cheers
>>>> > >
>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > > > Hi list,
>>>> > > >
>>>> > > > Last weekend I got a region server crashed, but some regions
>>>> never got
>>>> > > > online again on other RSes. I've gone through the logs, and here
>>>> is the
>>>> > > > timeline about some of the events:
>>>> > > >
>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>>> disk
>>>> > > > failure. Master started to split rs-node7's WALs
>>>> > > >
>>>> > > >
>>>> > > > 2016-04-30 13:03:50,953 INFO
>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>> Splitting
>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>>> assignment;
>>>> > > > region
>>>> > > > count=133
>>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch
>>>> of
>>>> > logs
>>>> > > to
>>>> > > > split
>>>> > > > 2016-04-30 13:03:50,966 INFO
>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
>>>> 33
>>>> > logs
>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>>> > > > ,60020,1458724695128-splitting]
>>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>>> > > >
>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>>> > > >
>>>> > > > 2016-04-30 13:10:47,655 INFO
>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>> > Reassigning
>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>>> > carrying
>>>> > > > (and 0 regions(s) that were opening on this server)
>>>> > > > 2016-04-30 13:10:47,665 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning
>>>> 133
>>>> > > > region(s) across 6 server(s), round-robin=true
>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node1.example.com,60020,1458720625688
>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node2.example.com,60020,1458721110988
>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node3.example.com,60020,1458721713906
>>>> > > > 2016-04-30 13:10:47,679 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node4.example.com,60020,1458722335527
>>>> > > > 2016-04-30 13:10:47,691 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node5.example.com,60020,1458722992296
>>>> > > > 2016-04-30 13:10:47,702 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>> > region(s)
>>>> > > to
>>>> > > > rs-node6.example.com,60020,1458723856883
>>>> > > >
>>>> > > > * 13:11:47 the opening regions rpc sent by master to region
>>>> servers
>>>> > timed
>>>> > > > out after 1 minutes
>>>> > > >
>>>> > > > 2016-04-30 13:11:47,780 INFO
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>>> communicate
>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>>>> > regions
>>>> > > > java.io.IOException: Call to
>>>> rs-node3.example.com/172.16.6.3:60020
>>>> > > failed
>>>> > > > on local exception:
>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>>> > > Call
>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>> assigning 22
>>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>>> > reassigning
>>>> > > > them
>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>> assigning 23
>>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>>> > reassigning
>>>> > > > them
>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>> assigning 22
>>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>>> > reassigning
>>>> > > > them
>>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>>> state
>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>>> > > ts=1462021907753,
>>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>>> > > >
>>>> > > >
>>>> > > > * After that, part of the regions (40 out of 130 regions) never
>>>> got
>>>> > > online,
>>>> > > > and the following lines were logged repeatly in master log:
>>>> > > >
>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>> ts=1462021957087,
>>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>>> already in
>>>> > > > transition; waiting up to 10668ms
>>>> > > >
>>>> > > > $ grep 'AssignmentManager: update
>>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>>> > > >
>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>>> > > > 484
>>>> > > >
>>>> > > >
>>>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>>>> find
>>>> > any
>>>> > > > similar situations like this one. The most similar one is
>>>> HBASE-14407
>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>>>> > reading
>>>> > > > its
>>>> > > > discussion I don't think that's the same problem.
>>>> > > >
>>>> > > > Anyone have a clue? Thanks!
>>>> > > >
>>>> > > > Regards,
>>>> > > > Shuai
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Here is the complete log on node6 between 13:10:47 and 13:11:47:
http://paste.openstack.org/raw/505826/

The master asked node6 to open several regions. Node6 opened the first 4
very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
errors at that time.

On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yu...@gmail.com> wrote:

> In AssignmentManager#assign(), you should find:
>
>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>           // regions will be assigned individually.
>           long maxWaitTime = System.currentTimeMillis() +
>             this.server.getConfiguration().
>               getLong("hbase.regionserver.rpc.startup.waittime", 60000);
>
> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>
> Cheers
>
> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com> wrote:
>
>> Because of the "opening regions" rpc call sent by master to the region
>> server node6 got timed out after 1 minutes?
>>
>> *RPC call was sent:*
>>
>> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>
>> *After 1 minute:*
>>
>> 2016-04-30 13:11:47,780 INFO
>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> with rs-node6.example.com,60020,1458723856883 in order to assign
>> regions, java.io.IOException: Call to
>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>> waitTime=60001, operationTimeout=60000 expired.
>>
>> 2016-04-30 13:11:47,783 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>
>>
>> I have checked hbase source code, but don't find any specific timeout
>> settings for "open region" rpc call I can use. So I guess the it's using
>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>> are 20+ regions being assigned to node6 almost at the same moment, node6
>> gets overloaded and can't finish opening all of them within one minute.
>>
>> So this looks like a hbase bug to me (regions never get online when the
>> region server failed to handle the OpenRegionRequest before the rpc
>> timeout), am I right?
>>
>>
>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>
>>> Was there any clue in master log why the close request was sent ?
>>>
>>> Cheers
>>>
>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com>
>>> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>>> The
>>> > RS logs on node6 can be found here <
>>> http://paste.openstack.org/raw/496174/
>>> > >
>>> >  .
>>> >
>>> > Thanks!
>>> >
>>> > Shuai
>>> >
>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>>> >
>>> > > Can you pastebin related server log w.r.t.
>>> > d1c7f3f455f2529da82a2f713b5ee067
>>> > > from rs-node6 ?
>>> > >
>>> > > Which release of hbase are you using ?
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > Hi list,
>>> > > >
>>> > > > Last weekend I got a region server crashed, but some regions never
>>> got
>>> > > > online again on other RSes. I've gone through the logs, and here
>>> is the
>>> > > > timeline about some of the events:
>>> > > >
>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>> disk
>>> > > > failure. Master started to split rs-node7's WALs
>>> > > >
>>> > > >
>>> > > > 2016-04-30 13:03:50,953 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> Splitting
>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>> assignment;
>>> > > > region
>>> > > > count=133
>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>>> > logs
>>> > > to
>>> > > > split
>>> > > > 2016-04-30 13:03:50,966 INFO
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
>>> 33
>>> > logs
>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>> > > > ,60020,1458724695128-splitting]
>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>> > > >
>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>> > > >
>>> > > > 2016-04-30 13:10:47,655 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> > Reassigning
>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>> > carrying
>>> > > > (and 0 regions(s) that were opening on this server)
>>> > > > 2016-04-30 13:10:47,665 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning
>>> 133
>>> > > > region(s) across 6 server(s), round-robin=true
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node1.example.com,60020,1458720625688
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node2.example.com,60020,1458721110988
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node3.example.com,60020,1458721713906
>>> > > > 2016-04-30 13:10:47,679 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>> > region(s)
>>> > > to
>>> > > > rs-node4.example.com,60020,1458722335527
>>> > > > 2016-04-30 13:10:47,691 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node5.example.com,60020,1458722992296
>>> > > > 2016-04-30 13:10:47,702 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node6.example.com,60020,1458723856883
>>> > > >
>>> > > > * 13:11:47 the opening regions rpc sent by master to region servers
>>> > timed
>>> > > > out after 1 minutes
>>> > > >
>>> > > > 2016-04-30 13:11:47,780 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>> communicate
>>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>>> > regions
>>> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>>> > > failed
>>> > > > on local exception:
>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>> > > Call
>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 22
>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 23
>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 22
>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>> state
>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>> > > ts=1462021907753,
>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>> > > >
>>> > > >
>>> > > > * After that, part of the regions (40 out of 130 regions) never got
>>> > > online,
>>> > > > and the following lines were logged repeatly in master log:
>>> > > >
>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>> ts=1462021957087,
>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>> already in
>>> > > > transition; waiting up to 10668ms
>>> > > >
>>> > > > $ grep 'AssignmentManager: update
>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>> > > >
>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>> > > > 484
>>> > > >
>>> > > >
>>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>>> find
>>> > any
>>> > > > similar situations like this one. The most similar one is
>>> HBASE-14407
>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>>> > reading
>>> > > > its
>>> > > > discussion I don't think that's the same problem.
>>> > > >
>>> > > > Anyone have a clue? Thanks!
>>> > > >
>>> > > > Regards,
>>> > > > Shuai
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Some regions never get online after a region server crashes

Posted by Ted Yu <yu...@gmail.com>.
In AssignmentManager#assign(), you should find:

          // Send OPEN RPC. If it fails on a IOE or RemoteException,
          // regions will be assigned individually.
          long maxWaitTime = System.currentTimeMillis() +
            this.server.getConfiguration().
              getLong("hbase.regionserver.rpc.startup.waittime", 60000);

BTW can you see what caused rs-node6 to not respond around 13:11:47 ?

Cheers

On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <li...@gmail.com> wrote:

> Because of the "opening regions" rpc call sent by master to the region
> server node6 got timed out after 1 minutes?
>
> *RPC call was sent:*
>
> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>
> *After 1 minute:*
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node6.example.com,60020,1458723856883 in order to assign regions,
> java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=60000 expired.
>
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>
>
> I have checked hbase source code, but don't find any specific timeout
> settings for "open region" rpc call I can use. So I guess the it's using
> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
> are 20+ regions being assigned to node6 almost at the same moment, node6
> gets overloaded and can't finish opening all of them within one minute.
>
> So this looks like a hbase bug to me (regions never get online when the
> region server failed to handle the OpenRegionRequest before the rpc
> timeout), am I right?
>
>
> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>> when it was opening, leading to RegionAlreadyInTransitionException.
>>
>> Was there any clue in master log why the close request was sent ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>> The
>> > RS logs on node6 can be found here <
>> http://paste.openstack.org/raw/496174/
>> > >
>> >  .
>> >
>> > Thanks!
>> >
>> > Shuai
>> >
>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Can you pastebin related server log w.r.t.
>> > d1c7f3f455f2529da82a2f713b5ee067
>> > > from rs-node6 ?
>> > >
>> > > Which release of hbase are you using ?
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
>> > wrote:
>> > >
>> > > > Hi list,
>> > > >
>> > > > Last weekend I got a region server crashed, but some regions never
>> got
>> > > > online again on other RSes. I've gone through the logs, and here is
>> the
>> > > > timeline about some of the events:
>> > > >
>> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > > > failure. Master started to split rs-node7's WALs
>> > > >
>> > > >
>> > > > 2016-04-30 13:03:50,953 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Splitting
>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>> assignment;
>> > > > region
>> > > > count=133
>> > > > 2016-04-30 13:03:50,966 DEBUG
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> > logs
>> > > to
>> > > > split
>> > > > 2016-04-30 13:03:50,966 INFO
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> > logs
>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > > > ,60020,1458724695128-splitting]
>> > > > for [rs-node7.example.com,60020,1458724695128]
>> > > >
>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>> > > >
>> > > > 2016-04-30 13:10:47,655 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> > Reassigning
>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> > carrying
>> > > > (and 0 regions(s) that were opening on this server)
>> > > > 2016-04-30 13:10:47,665 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > > > region(s) across 6 server(s), round-robin=true
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node1.example.com,60020,1458720625688
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node2.example.com,60020,1458721110988
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node3.example.com,60020,1458721713906
>> > > > 2016-04-30 13:10:47,679 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> > region(s)
>> > > to
>> > > > rs-node4.example.com,60020,1458722335527
>> > > > 2016-04-30 13:10:47,691 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node5.example.com,60020,1458722992296
>> > > > 2016-04-30 13:10:47,702 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node6.example.com,60020,1458723856883
>> > > >
>> > > > * 13:11:47 the opening regions rpc sent by master to region servers
>> > timed
>> > > > out after 1 minutes
>> > > >
>> > > > 2016-04-30 13:11:47,780 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>> communicate
>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>> > regions
>> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> > > failed
>> > > > on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> > > Call
>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 22
>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 23
>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,782 INFO
>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>> assigning 22
>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>> > reassigning
>> > > > them
>> > > > 2016-04-30 13:11:47,783 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> > > ts=1462021907753,
>> > > > server=rs-node6.example.com,60020,1458723856883}
>> > > >
>> > > >
>> > > > * After that, part of the regions (40 out of 130 regions) never got
>> > > online,
>> > > > and the following lines were logged repeatly in master log:
>> > > >
>> > > > 2016-04-30 13:12:37,188 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021957087,
>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > > > 2016-04-30 13:12:37,188 DEBUG
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already
>> in
>> > > > transition; waiting up to 10668ms
>> > > >
>> > > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
>> > > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc
>> -l
>> > > > 484
>> > > >
>> > > >
>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>> find
>> > any
>> > > > similar situations like this one. The most similar one is
>> HBASE-14407
>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>> > reading
>> > > > its
>> > > > discussion I don't think that's the same problem.
>> > > >
>> > > > Anyone have a clue? Thanks!
>> > > >
>> > > > Regards,
>> > > > Shuai
>> > > >
>> > >
>> >
>>
>
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Because of the "opening regions" rpc call sent by master to the region
server node6 got timed out after 1 minutes?

*RPC call was sent:*

2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Assigning 22 region(s) tors-node6.example.com,60020,1458723856883

*After 1 minute:*

2016-04-30 13:11:47,780 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
with rs-node6.example.com,60020,1458723856883 in order to assign regions,
java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=4, waitTime=60001, operationTimeout=60000 expired.

2016-04-30 13:11:47,783 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Force region state
offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}


I have checked hbase source code, but don't find any specific timeout
settings for "open region" rpc call I can use. So I guess the it's using
the default "hbase.rpc.timeout", which defaults to 60secs. And since there
are 20+ regions being assigned to node6 almost at the same moment, node6
gets overloaded and can't finish opening all of them within one minute.

So this looks like a hbase bug to me (regions never get online when the
region server failed to handle the OpenRegionRequest before the rpc
timeout), am I right?


On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yu...@gmail.com> wrote:

> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
> when it was opening, leading to RegionAlreadyInTransitionException.
>
> Was there any clue in master log why the close request was sent ?
>
> Cheers
>
> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> > RS logs on node6 can be found here <
> http://paste.openstack.org/raw/496174/
> > >
> >  .
> >
> > Thanks!
> >
> > Shuai
> >
> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Can you pastebin related server log w.r.t.
> > d1c7f3f455f2529da82a2f713b5ee067
> > > from rs-node6 ?
> > >
> > > Which release of hbase are you using ?
> > >
> > > Cheers
> > >
> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
> > wrote:
> > >
> > > > Hi list,
> > > >
> > > > Last weekend I got a region server crashed, but some regions never
> got
> > > > online again on other RSes. I've gone through the logs, and here is
> the
> > > > timeline about some of the events:
> > > >
> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > > failure. Master started to split rs-node7's WALs
> > > >
> > > >
> > > > 2016-04-30 13:03:50,953 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> > > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > > region
> > > > count=133
> > > > 2016-04-30 13:03:50,966 DEBUG
> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> > logs
> > > to
> > > > split
> > > > 2016-04-30 13:03:50,966 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> > logs
> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > > ,60020,1458724695128-splitting]
> > > > for [rs-node7.example.com,60020,1458724695128]
> > > >
> > > > * 13:10:47 WAL splits done, master began to re-assign regions
> > > >
> > > > 2016-04-30 13:10:47,655 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > Reassigning
> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> > carrying
> > > > (and 0 regions(s) that were opening on this server)
> > > > 2016-04-30 13:10:47,665 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > > region(s) across 6 server(s), round-robin=true
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node1.example.com,60020,1458720625688
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node2.example.com,60020,1458721110988
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node3.example.com,60020,1458721713906
> > > > 2016-04-30 13:10:47,679 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> > region(s)
> > > to
> > > > rs-node4.example.com,60020,1458722335527
> > > > 2016-04-30 13:10:47,691 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node5.example.com,60020,1458722992296
> > > > 2016-04-30 13:10:47,702 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node6.example.com,60020,1458723856883
> > > >
> > > > * 13:11:47 the opening regions rpc sent by master to region servers
> > timed
> > > > out after 1 minutes
> > > >
> > > > 2016-04-30 13:11:47,780 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
> communicate
> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
> > regions
> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> > > failed
> > > > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> > > Call
> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 22
> > > > regions to server rs-node6.example.com,60020,1458723856883,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 23
> > > > regions to server rs-node4.example.com,60020,1458722335527,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 22
> > > > regions to server rs-node3.example.com,60020,1458721713906,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,783 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> > > ts=1462021907753,
> > > > server=rs-node6.example.com,60020,1458723856883}
> > > >
> > > >
> > > > * After that, part of the regions (40 out of 130 regions) never got
> > > online,
> > > > and the following lines were logged repeatly in master log:
> > > >
> > > > 2016-04-30 13:12:37,188 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021957087,
> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
> > > > 2016-04-30 13:12:37,188 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already
> in
> > > > transition; waiting up to 10668ms
> > > >
> > > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> > > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc
> -l
> > > > 484
> > > >
> > > >
> > > > I've searched in mailing list archive and hbase JIRA, but didn't find
> > any
> > > > similar situations like this one. The most similar one is HBASE-14407
> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
> > reading
> > > > its
> > > > discussion I don't think that's the same problem.
> > > >
> > > > Anyone have a clue? Thanks!
> > > >
> > > > Regards,
> > > > Shuai
> > > >
> > >
> >
>

Re: Some regions never get online after a region server crashes

Posted by Ted Yu <yu...@gmail.com>.
Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
when it was opening, leading to RegionAlreadyInTransitionException.

Was there any clue in master log why the close request was sent ?

Cheers

On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <li...@gmail.com> wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here <http://paste.openstack.org/raw/496174/
> >
>  .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Can you pastebin related server log w.r.t.
> d1c7f3f455f2529da82a2f713b5ee067
> > from rs-node6 ?
> >
> > Which release of hbase are you using ?
> >
> > Cheers
> >
> > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com>
> wrote:
> >
> > > Hi list,
> > >
> > > Last weekend I got a region server crashed, but some regions never got
> > > online again on other RSes. I've gone through the logs, and here is the
> > > timeline about some of the events:
> > >
> > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > failure. Master started to split rs-node7's WALs
> > >
> > >
> > > 2016-04-30 13:03:50,953 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > region
> > > count=133
> > > 2016-04-30 13:03:50,966 DEBUG
> > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> logs
> > to
> > > split
> > > 2016-04-30 13:03:50,966 INFO
> > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> logs
> > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > ,60020,1458724695128-splitting]
> > > for [rs-node7.example.com,60020,1458724695128]
> > >
> > > * 13:10:47 WAL splits done, master began to re-assign regions
> > >
> > > 2016-04-30 13:10:47,655 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Reassigning
> > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> carrying
> > > (and 0 regions(s) that were opening on this server)
> > > 2016-04-30 13:10:47,665 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > region(s) across 6 server(s), round-robin=true
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node1.example.com,60020,1458720625688
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node2.example.com,60020,1458721110988
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node3.example.com,60020,1458721713906
> > > 2016-04-30 13:10:47,679 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> region(s)
> > to
> > > rs-node4.example.com,60020,1458722335527
> > > 2016-04-30 13:10:47,691 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node5.example.com,60020,1458722992296
> > > 2016-04-30 13:10:47,702 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node6.example.com,60020,1458723856883
> > >
> > > * 13:11:47 the opening regions rpc sent by master to region servers
> timed
> > > out after 1 minutes
> > >
> > > 2016-04-30 13:11:47,780 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> > > with rs-node3.example.com,60020,1458721713906 in order to assign
> regions
> > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> > failed
> > > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> > Call
> > > id=4, waitTime=60001, operationTimeout=60000 expired.
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node6.example.com,60020,1458723856883,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> > > regions to server rs-node4.example.com,60020,1458722335527,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node3.example.com,60020,1458721713906,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,783 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> > ts=1462021907753,
> > > server=rs-node6.example.com,60020,1458723856883}
> > >
> > >
> > > * After that, part of the regions (40 out of 130 regions) never got
> > online,
> > > and the following lines were logged repeatly in master log:
> > >
> > > 2016-04-30 13:12:37,188 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: update
> > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
> > > 2016-04-30 13:12:37,188 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> > > transition; waiting up to 10668ms
> > >
> > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> > > 484
> > >
> > >
> > > I've searched in mailing list archive and hbase JIRA, but didn't find
> any
> > > similar situations like this one. The most similar one is HBASE-14407
> > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
> reading
> > > its
> > > discussion I don't think that's the same problem.
> > >
> > > Anyone have a clue? Thanks!
> > >
> > > Regards,
> > > Shuai
> > >
> >
>

Re: Some regions never get online after a region server crashes

Posted by Shuai Lin <li...@gmail.com>.
Hi Ted,

The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
RS logs on node6 can be found here <http://paste.openstack.org/raw/496174/>
 .

Thanks!

Shuai

On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yu...@gmail.com> wrote:

> Can you pastebin related server log w.r.t. d1c7f3f455f2529da82a2f713b5ee067
> from rs-node6 ?
>
> Which release of hbase are you using ?
>
> Cheers
>
> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com> wrote:
>
> > Hi list,
> >
> > Last weekend I got a region server crashed, but some regions never got
> > online again on other RSes. I've gone through the logs, and here is the
> > timeline about some of the events:
> >
> > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > failure. Master started to split rs-node7's WALs
> >
> >
> > 2016-04-30 13:03:50,953 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > region
> > count=133
> > 2016-04-30 13:03:50,966 DEBUG
> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs
> to
> > split
> > 2016-04-30 13:03:50,966 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > ,60020,1458724695128-splitting]
> > for [rs-node7.example.com,60020,1458724695128]
> >
> > * 13:10:47 WAL splits done, master began to re-assign regions
> >
> > 2016-04-30 13:10:47,655 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
> > (and 0 regions(s) that were opening on this server)
> > 2016-04-30 13:10:47,665 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > region(s) across 6 server(s), round-robin=true
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node1.example.com,60020,1458720625688
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node2.example.com,60020,1458721110988
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node3.example.com,60020,1458721713906
> > 2016-04-30 13:10:47,679 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s)
> to
> > rs-node4.example.com,60020,1458722335527
> > 2016-04-30 13:10:47,691 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node5.example.com,60020,1458722992296
> > 2016-04-30 13:10:47,702 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node6.example.com,60020,1458723856883
> >
> > * 13:11:47 the opening regions rpc sent by master to region servers timed
> > out after 1 minutes
> >
> > 2016-04-30 13:11:47,780 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> > with rs-node3.example.com,60020,1458721713906 in order to assign regions
> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> failed
> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call
> > id=4, waitTime=60001, operationTimeout=60000 expired.
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
> > them
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
> > them
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
> > them
> > 2016-04-30 13:11:47,783 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> ts=1462021907753,
> > server=rs-node6.example.com,60020,1458723856883}
> >
> >
> > * After that, part of the regions (40 out of 130 regions) never got
> online,
> > and the following lines were logged repeatly in master log:
> >
> > 2016-04-30 13:12:37,188 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: update
> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
> > 2016-04-30 13:12:37,188 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> > transition; waiting up to 10668ms
> >
> > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> > 484
> >
> >
> > I've searched in mailing list archive and hbase JIRA, but didn't find any
> > similar situations like this one. The most similar one is HBASE-14407
> > <https://issues.apache.org/jira/browse/HBASE-14407> , but after reading
> > its
> > discussion I don't think that's the same problem.
> >
> > Anyone have a clue? Thanks!
> >
> > Regards,
> > Shuai
> >
>

Re: Some regions never get online after a region server crashes

Posted by Ted Yu <yu...@gmail.com>.
Can you pastebin related server log w.r.t. d1c7f3f455f2529da82a2f713b5ee067
from rs-node6 ?

Which release of hbase are you using ?

Cheers

On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <li...@gmail.com> wrote:

> Hi list,
>
> Last weekend I got a region server crashed, but some regions never got
> online again on other RSes. I've gone through the logs, and here is the
> timeline about some of the events:
>
> * 13:03:50 on of the region server, rs-node7, died because of a disk
> failure. Master started to split rs-node7's WALs
>
>
> 2016-04-30 13:03:50,953 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> logs for rs-node7.example.com,60020,1458724695128 before assignment;
> region
> count=133
> 2016-04-30 13:03:50,966 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to
> split
> 2016-04-30 13:03:50,966 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
> in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> ,60020,1458724695128-splitting]
> for [rs-node7.example.com,60020,1458724695128]
>
> * 13:10:47 WAL splits done, master began to re-assign regions
>
> 2016-04-30 13:10:47,655 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> 133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
> (and 0 regions(s) that were opening on this server)
> 2016-04-30 13:10:47,665 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> region(s) across 6 server(s), round-robin=true
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node1.example.com,60020,1458720625688
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node2.example.com,60020,1458721110988
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node3.example.com,60020,1458721713906
> 2016-04-30 13:10:47,679 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s) to
> rs-node4.example.com,60020,1458722335527
> 2016-04-30 13:10:47,691 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node5.example.com,60020,1458722992296
> 2016-04-30 13:10:47,702 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node6.example.com,60020,1458723856883
>
> * 13:11:47 the opening regions rpc sent by master to region servers timed
> out after 1 minutes
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node3.example.com,60020,1458721713906 in order to assign regions
> java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=60000 expired.
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node6.example.com,60020,1458723856883, reassigning
> them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> regions to server rs-node4.example.com,60020,1458722335527, reassigning
> them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node3.example.com,60020,1458721713906, reassigning
> them
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {a65660e421f114e93862194f7cc35644 state=OPENING, ts=1462021907753,
> server=rs-node6.example.com,60020,1458723856883}
>
>
> * After that, part of the regions (40 out of 130 regions) never got online,
> and the following lines were logged repeatly in master log:
>
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: update
> {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> server=rs-node6.example.com,60020,1458723856883} the timestamp.
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> transition; waiting up to 10668ms
>
> $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> 484
>
>
> I've searched in mailing list archive and hbase JIRA, but didn't find any
> similar situations like this one. The most similar one is HBASE-14407
> <https://issues.apache.org/jira/browse/HBASE-14407> , but after reading
> its
> discussion I don't think that's the same problem.
>
> Anyone have a clue? Thanks!
>
> Regards,
> Shuai
>