You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Austin Heyne <ah...@ccri.com> on 2018/10/01 14:46:04 UTC

Re: Regions Stuck PENDING_OPEN

I'm running HBase 1.4.4 on EMR. In following your suggestions I realized 
that the master is trying to assign the regions to dead/non-existant 
region servers. While trying to fix this problem I had killed the EMR 
cluster and started a new one. It's still trying to assign some regions 
to those region servers in the previous cluster. I tried to manually 
move one of the regions to a good region server but I'm getting 'ERROR: 
No route to host' when I try to close the region.

I've tried nuking the /hbase directory in Zookeeper but that didn't seem 
to help so I'm not sure where it's getting these references from.

-Austin


On 09/30/2018 02:38 PM, Josh Elser wrote:
> First off: You're on EMR? What version of HBase you're using? (Maybe 
> Zach or Stephen can help here too). Can you figure out the 
> RegionServer(s) which are stuck opening these PENDING_OPEN regions? 
> Can you get a jstack/thread-dump from those RS's?
>
> In terms of how the system is supposed to work: the PENDING_OPEN state 
> for a Region "R" means: the active Master has asked a RegionServer to 
> open R. That RS should have an active thread which is trying to open 
> R. Upon success, the state of R will move from PENDING_OPEN to OPEN. 
> Otherwise, the Master will try to assign R again.
>
> In absence of any custom coprocessors (including Phoenix), this would 
> mean some subset of RegionServers are in a bad state. Figuring out 
> what those RS's are trying to do will be the next step in figuring out 
> why they're stuck like that. It might be obvious from the UI, or you 
> might have to look at hbase:meta or the master log to figure it out.
>
> One caveat, it's possible that the Master is just not doing the right 
> thing as described above. If the steps described above don't seem to 
> be matching what your system is doing, you might have to look closer 
> at the Master log. Make sure you have DEBUG on to get anything of 
> value out of the system.
>
> On 9/30/18 1:43 PM, Austin Heyne wrote:
>> I'm having a strange problem that my usual bag of tricks is having 
>> trouble sorting out. On Friday queries stoped returning for some 
>> reason. You could see them come in and there would be a resource 
>> utilization spike that would fade out after an appropriate amount of 
>> time, however, the query would never actually return. This could be 
>> related to our client code but I wasn't able to dig into it since 
>> this was the middle of the day on a production system. Since this had 
>> happened before and bouncing HBase cleared it up, I proceeded to 
>> disable tables and restart HBase. Upon bringing HBase backup a few 
>> thousand regions are stuck in PENDING_OPEN state and refuse to move 
>> from that state. I've run hbck -repair a number of times under a few 
>> conditions (even the offline repair), have deleted everything out of 
>> /hbase in zookeeper and even migrated the cluster to new servers 
>> (EMR) with no luck. When I spin HBase up the regions are already at 
>> PENDING_OPEN even though the tables are offline.
>>
>> Any ideas on what's going on here would be a huge help.
>>
>> Thanks,
>> Austin
>>

-- 
Austin L. Heyne


Re: Regions Stuck PENDING_OPEN

Posted by Josh Elser <el...@apache.org>.
Thanks so much for sharing this, Austin! This is a wonderful write-up of 
what you did. I'm sure it will be frequented from the archives.

If you have the old state from meta saved and have the ability to 
sanitize some of the data/server-names, we might be able to give you 
some suggestions about what happened. Let me know if that would be 
helpful. If we can't get to the bottom of how this happened, maybe we 
can figure out why hbck couldn't fix it.

On 10/3/18 12:53 PM, Austin Heyne wrote:
> Josh: Thanks for all your help! You got us going down a path that lead 
> to a solution.
> 
> Thought I would just follow up with what we ended up doing (I do not 
> recommend anyone attempt this).
> 
> When this problems started I'm not sure what issues were wrong with the 
> hbase:meta table but one of the steps we tried was to migrate the 
> database to a new cluster. I believe this cleared up the original issue 
> we were having but for some reason, perhaps because of the initial 
> problem, the migration didn't work correctly. This left us with 
> references to servers in hbase:meta that didn't exists, for regions that 
> were marked as PENDING_OPEN. When HBase would come online it was 
> behaving as if it had already asked those region servers to open those 
> regions so they were not getting reassigned to good servers. Another 
> symptom of this was that there were dead region servers listed in the 
> HBase web UI from the previous cluster (the servers that were listed in 
> the hbase:meta).
> 
> Since in the hbase:meta table regions were marked as PENDING_OPEN on 
> region servers that didn't exist we were unable to close or move the 
> regions since master couldn't communicate with the region server. For 
> some reason hbck -fix wasn't able to repair the assignments or didn't 
> realize it needed to. This might be due to some other meta 
> inconsistencies like overlapping regions or duplicated start keys. I'm 
> unsure why it couldn't clear things up.
> 
> To repair this we initially backed up the meta directory in s3 while 
> everything was offline. Then while HBase was online and the tables were 
> disabled we used a Scala REPL to rewrite the hbase:meta entries for each 
> effected region (~4500 regions); replacing the 'server' and 'sn' with 
> valid values and setting 'state' to 'OPEN'. We then flushed/compacted 
> the meta table and took down HBase. After nuking /hbase in ZK we brought 
> everything back up. There initially was a lot of churn with region 
> assignments but after things settled everything was online. I think this 
> worked because of the state the meta table was in when HBase stopped. I 
> think it looked like a crash and HBase went through it's normal repair 
> cycle of re-opening regions and using previous assignments.
> 
> Like I said, I don't recommend manually rewriting the hbase:meta table 
> but it did work for us.
> 
> Thanks,
> Austin
> 
> On 10/01/2018 01:28 PM, Josh Elser wrote:
>> That seems pretty wrong. The Master should know that old RS's are no 
>> longer alive and not try to assign regions to them. I don't have much 
>> familiarity with 1.4 to say if, hypothetically, that might be fixed in 
>> a release 1.4.5-1.4.7.
>>
>> I don't have specific suggestions, but can share how I'd approach it.
>>
>> I'd pick one specific region and try to trace the logic around just 
>> that one region. Start with the state in hbase:meta -- see if there is 
>> a column in meta for this old server. Expand out to WALs in HDFS. 
>> Since you can wipe ZK and this still happens, it seems clear it's not 
>> coming from ZK data.
>>
>> Compare the data you find with what DEBUG logging in the Master says, 
>> see if you can figure out some more about how the Master chose to make 
>> the decision it did. That will help lead you to what the appropriate 
>> "fix" should be.
>>
>> On 10/1/18 10:46 AM, Austin Heyne wrote:
>>> I'm running HBase 1.4.4 on EMR. In following your suggestions I 
>>> realized that the master is trying to assign the regions to 
>>> dead/non-existant region servers. While trying to fix this problem I 
>>> had killed the EMR cluster and started a new one. It's still trying 
>>> to assign some regions to those region servers in the previous 
>>> cluster. I tried to manually move one of the regions to a good region 
>>> server but I'm getting 'ERROR: No route to host' when I try to close 
>>> the region.
>>>
>>> I've tried nuking the /hbase directory in Zookeeper but that didn't 
>>> seem to help so I'm not sure where it's getting these references from.
>>>
>>> -Austin
>>>
>>>
>>> On 09/30/2018 02:38 PM, Josh Elser wrote:
>>>> First off: You're on EMR? What version of HBase you're using? (Maybe 
>>>> Zach or Stephen can help here too). Can you figure out the 
>>>> RegionServer(s) which are stuck opening these PENDING_OPEN regions? 
>>>> Can you get a jstack/thread-dump from those RS's?
>>>>
>>>> In terms of how the system is supposed to work: the PENDING_OPEN 
>>>> state for a Region "R" means: the active Master has asked a 
>>>> RegionServer to open R. That RS should have an active thread which 
>>>> is trying to open R. Upon success, the state of R will move from 
>>>> PENDING_OPEN to OPEN. Otherwise, the Master will try to assign R again.
>>>>
>>>> In absence of any custom coprocessors (including Phoenix), this 
>>>> would mean some subset of RegionServers are in a bad state. Figuring 
>>>> out what those RS's are trying to do will be the next step in 
>>>> figuring out why they're stuck like that. It might be obvious from 
>>>> the UI, or you might have to look at hbase:meta or the master log to 
>>>> figure it out.
>>>>
>>>> One caveat, it's possible that the Master is just not doing the 
>>>> right thing as described above. If the steps described above don't 
>>>> seem to be matching what your system is doing, you might have to 
>>>> look closer at the Master log. Make sure you have DEBUG on to get 
>>>> anything of value out of the system.
>>>>
>>>> On 9/30/18 1:43 PM, Austin Heyne wrote:
>>>>> I'm having a strange problem that my usual bag of tricks is having 
>>>>> trouble sorting out. On Friday queries stoped returning for some 
>>>>> reason. You could see them come in and there would be a resource 
>>>>> utilization spike that would fade out after an appropriate amount 
>>>>> of time, however, the query would never actually return. This could 
>>>>> be related to our client code but I wasn't able to dig into it 
>>>>> since this was the middle of the day on a production system. Since 
>>>>> this had happened before and bouncing HBase cleared it up, I 
>>>>> proceeded to disable tables and restart HBase. Upon bringing HBase 
>>>>> backup a few thousand regions are stuck in PENDING_OPEN state and 
>>>>> refuse to move from that state. I've run hbck -repair a number of 
>>>>> times under a few conditions (even the offline repair), have 
>>>>> deleted everything out of /hbase in zookeeper and even migrated the 
>>>>> cluster to new servers (EMR) with no luck. When I spin HBase up the 
>>>>> regions are already at PENDING_OPEN even though the tables are 
>>>>> offline.
>>>>>
>>>>> Any ideas on what's going on here would be a huge help.
>>>>>
>>>>> Thanks,
>>>>> Austin
>>>>>
>>>
> 

Re: Regions Stuck PENDING_OPEN

Posted by Austin Heyne <ah...@ccri.com>.
Josh: Thanks for all your help! You got us going down a path that lead 
to a solution.

Thought I would just follow up with what we ended up doing (I do not 
recommend anyone attempt this).

When this problems started I'm not sure what issues were wrong with the 
hbase:meta table but one of the steps we tried was to migrate the 
database to a new cluster. I believe this cleared up the original issue 
we were having but for some reason, perhaps because of the initial 
problem, the migration didn't work correctly. This left us with 
references to servers in hbase:meta that didn't exists, for regions that 
were marked as PENDING_OPEN. When HBase would come online it was 
behaving as if it had already asked those region servers to open those 
regions so they were not getting reassigned to good servers. Another 
symptom of this was that there were dead region servers listed in the 
HBase web UI from the previous cluster (the servers that were listed in 
the hbase:meta).

Since in the hbase:meta table regions were marked as PENDING_OPEN on 
region servers that didn't exist we were unable to close or move the 
regions since master couldn't communicate with the region server. For 
some reason hbck -fix wasn't able to repair the assignments or didn't 
realize it needed to. This might be due to some other meta 
inconsistencies like overlapping regions or duplicated start keys. I'm 
unsure why it couldn't clear things up.

To repair this we initially backed up the meta directory in s3 while 
everything was offline. Then while HBase was online and the tables were 
disabled we used a Scala REPL to rewrite the hbase:meta entries for each 
effected region (~4500 regions); replacing the 'server' and 'sn' with 
valid values and setting 'state' to 'OPEN'. We then flushed/compacted 
the meta table and took down HBase. After nuking /hbase in ZK we brought 
everything back up. There initially was a lot of churn with region 
assignments but after things settled everything was online. I think this 
worked because of the state the meta table was in when HBase stopped. I 
think it looked like a crash and HBase went through it's normal repair 
cycle of re-opening regions and using previous assignments.

Like I said, I don't recommend manually rewriting the hbase:meta table 
but it did work for us.

Thanks,
Austin

On 10/01/2018 01:28 PM, Josh Elser wrote:
> That seems pretty wrong. The Master should know that old RS's are no 
> longer alive and not try to assign regions to them. I don't have much 
> familiarity with 1.4 to say if, hypothetically, that might be fixed in 
> a release 1.4.5-1.4.7.
>
> I don't have specific suggestions, but can share how I'd approach it.
>
> I'd pick one specific region and try to trace the logic around just 
> that one region. Start with the state in hbase:meta -- see if there is 
> a column in meta for this old server. Expand out to WALs in HDFS. 
> Since you can wipe ZK and this still happens, it seems clear it's not 
> coming from ZK data.
>
> Compare the data you find with what DEBUG logging in the Master says, 
> see if you can figure out some more about how the Master chose to make 
> the decision it did. That will help lead you to what the appropriate 
> "fix" should be.
>
> On 10/1/18 10:46 AM, Austin Heyne wrote:
>> I'm running HBase 1.4.4 on EMR. In following your suggestions I 
>> realized that the master is trying to assign the regions to 
>> dead/non-existant region servers. While trying to fix this problem I 
>> had killed the EMR cluster and started a new one. It's still trying 
>> to assign some regions to those region servers in the previous 
>> cluster. I tried to manually move one of the regions to a good region 
>> server but I'm getting 'ERROR: No route to host' when I try to close 
>> the region.
>>
>> I've tried nuking the /hbase directory in Zookeeper but that didn't 
>> seem to help so I'm not sure where it's getting these references from.
>>
>> -Austin
>>
>>
>> On 09/30/2018 02:38 PM, Josh Elser wrote:
>>> First off: You're on EMR? What version of HBase you're using? (Maybe 
>>> Zach or Stephen can help here too). Can you figure out the 
>>> RegionServer(s) which are stuck opening these PENDING_OPEN regions? 
>>> Can you get a jstack/thread-dump from those RS's?
>>>
>>> In terms of how the system is supposed to work: the PENDING_OPEN 
>>> state for a Region "R" means: the active Master has asked a 
>>> RegionServer to open R. That RS should have an active thread which 
>>> is trying to open R. Upon success, the state of R will move from 
>>> PENDING_OPEN to OPEN. Otherwise, the Master will try to assign R again.
>>>
>>> In absence of any custom coprocessors (including Phoenix), this 
>>> would mean some subset of RegionServers are in a bad state. Figuring 
>>> out what those RS's are trying to do will be the next step in 
>>> figuring out why they're stuck like that. It might be obvious from 
>>> the UI, or you might have to look at hbase:meta or the master log to 
>>> figure it out.
>>>
>>> One caveat, it's possible that the Master is just not doing the 
>>> right thing as described above. If the steps described above don't 
>>> seem to be matching what your system is doing, you might have to 
>>> look closer at the Master log. Make sure you have DEBUG on to get 
>>> anything of value out of the system.
>>>
>>> On 9/30/18 1:43 PM, Austin Heyne wrote:
>>>> I'm having a strange problem that my usual bag of tricks is having 
>>>> trouble sorting out. On Friday queries stoped returning for some 
>>>> reason. You could see them come in and there would be a resource 
>>>> utilization spike that would fade out after an appropriate amount 
>>>> of time, however, the query would never actually return. This could 
>>>> be related to our client code but I wasn't able to dig into it 
>>>> since this was the middle of the day on a production system. Since 
>>>> this had happened before and bouncing HBase cleared it up, I 
>>>> proceeded to disable tables and restart HBase. Upon bringing HBase 
>>>> backup a few thousand regions are stuck in PENDING_OPEN state and 
>>>> refuse to move from that state. I've run hbck -repair a number of 
>>>> times under a few conditions (even the offline repair), have 
>>>> deleted everything out of /hbase in zookeeper and even migrated the 
>>>> cluster to new servers (EMR) with no luck. When I spin HBase up the 
>>>> regions are already at PENDING_OPEN even though the tables are 
>>>> offline.
>>>>
>>>> Any ideas on what's going on here would be a huge help.
>>>>
>>>> Thanks,
>>>> Austin
>>>>
>>

-- 
Austin L. Heyne


Re: Regions Stuck PENDING_OPEN

Posted by Josh Elser <el...@apache.org>.
That seems pretty wrong. The Master should know that old RS's are no 
longer alive and not try to assign regions to them. I don't have much 
familiarity with 1.4 to say if, hypothetically, that might be fixed in a 
release 1.4.5-1.4.7.

I don't have specific suggestions, but can share how I'd approach it.

I'd pick one specific region and try to trace the logic around just that 
one region. Start with the state in hbase:meta -- see if there is a 
column in meta for this old server. Expand out to WALs in HDFS. Since 
you can wipe ZK and this still happens, it seems clear it's not coming 
from ZK data.

Compare the data you find with what DEBUG logging in the Master says, 
see if you can figure out some more about how the Master chose to make 
the decision it did. That will help lead you to what the appropriate 
"fix" should be.

On 10/1/18 10:46 AM, Austin Heyne wrote:
> I'm running HBase 1.4.4 on EMR. In following your suggestions I realized 
> that the master is trying to assign the regions to dead/non-existant 
> region servers. While trying to fix this problem I had killed the EMR 
> cluster and started a new one. It's still trying to assign some regions 
> to those region servers in the previous cluster. I tried to manually 
> move one of the regions to a good region server but I'm getting 'ERROR: 
> No route to host' when I try to close the region.
> 
> I've tried nuking the /hbase directory in Zookeeper but that didn't seem 
> to help so I'm not sure where it's getting these references from.
> 
> -Austin
> 
> 
> On 09/30/2018 02:38 PM, Josh Elser wrote:
>> First off: You're on EMR? What version of HBase you're using? (Maybe 
>> Zach or Stephen can help here too). Can you figure out the 
>> RegionServer(s) which are stuck opening these PENDING_OPEN regions? 
>> Can you get a jstack/thread-dump from those RS's?
>>
>> In terms of how the system is supposed to work: the PENDING_OPEN state 
>> for a Region "R" means: the active Master has asked a RegionServer to 
>> open R. That RS should have an active thread which is trying to open 
>> R. Upon success, the state of R will move from PENDING_OPEN to OPEN. 
>> Otherwise, the Master will try to assign R again.
>>
>> In absence of any custom coprocessors (including Phoenix), this would 
>> mean some subset of RegionServers are in a bad state. Figuring out 
>> what those RS's are trying to do will be the next step in figuring out 
>> why they're stuck like that. It might be obvious from the UI, or you 
>> might have to look at hbase:meta or the master log to figure it out.
>>
>> One caveat, it's possible that the Master is just not doing the right 
>> thing as described above. If the steps described above don't seem to 
>> be matching what your system is doing, you might have to look closer 
>> at the Master log. Make sure you have DEBUG on to get anything of 
>> value out of the system.
>>
>> On 9/30/18 1:43 PM, Austin Heyne wrote:
>>> I'm having a strange problem that my usual bag of tricks is having 
>>> trouble sorting out. On Friday queries stoped returning for some 
>>> reason. You could see them come in and there would be a resource 
>>> utilization spike that would fade out after an appropriate amount of 
>>> time, however, the query would never actually return. This could be 
>>> related to our client code but I wasn't able to dig into it since 
>>> this was the middle of the day on a production system. Since this had 
>>> happened before and bouncing HBase cleared it up, I proceeded to 
>>> disable tables and restart HBase. Upon bringing HBase backup a few 
>>> thousand regions are stuck in PENDING_OPEN state and refuse to move 
>>> from that state. I've run hbck -repair a number of times under a few 
>>> conditions (even the offline repair), have deleted everything out of 
>>> /hbase in zookeeper and even migrated the cluster to new servers 
>>> (EMR) with no luck. When I spin HBase up the regions are already at 
>>> PENDING_OPEN even though the tables are offline.
>>>
>>> Any ideas on what's going on here would be a huge help.
>>>
>>> Thanks,
>>> Austin
>>>
>