You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sean Busbey <bu...@apache.org> on 2018/12/06 15:45:38 UTC

How should I be getting the set of regions in transition?

This week I've run into two cases where I needed the set of regions in
 transition so I could recover them and I ran into what I think is a
gap in our operator tooling. I'm hoping folks will have some ideas
I've missed.

Depending on how this thread goes, I'll make some follow-on on the
dev@hbase list for implementing changes and documentation.

Case 1: HBase 1.2-ish RIT following RS crash

Cluster had a handful of region servers fail and for whatever reason a
few regions were stuck in transition. The operator I was helping
already is used to dealing with the occasional manual recovery. Their
normal process looks like this:

1) Got to Master UI website
2) Scroll down to Regions in Transition list
3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
4) confirm on RS logs that the RS associated in the above is now in
good health and doesn't expect to do anything with said region
5) run "assign" in the hbase shell for the region

Unfortunately, the cluster's HDFS was under duress and so listing
snapshot information was super slow. This caused the Master UI website
to hang prior to displaying the RIT list.

We ended up looking at the master log file.

Case 2: HBase 2.1-ish RIT following cluster wide crash

AFAICT cluster had experienced a failure of all RS and masters. Upon
coming back up Master was left with ~10% of ~10K regions in a state of
PENDING_OPEN or OPENING all with a RS that had no idea it was involved
with those regions. I'm pretty sure this is a bug;  I'm still triaging
it and I don't think it's relevant to the current question.

Once I confirmed the given RS was not currently doing anything for any
of those regions I figured I'd use HBCK2 to run an assigns to get
things fixed. However, since there were like 900 RITs, the Master UI
was unusable for getting a complete list. Also with that many all in
the same state I want to be able to automate running against each of
them.

I ended up greping the master log file and pulling out the WARN
messages about RIT to tease out the list of regions, then passed those
to hbck2.

----

Am I missing some obvious place where I can use a CLI tool to get a
list of RIT? I don't see anything in the ref guide. I looked through
the help of HBCK 1 and the shell and couldn't find anything.

I think I can use Admin.getClusterStatus() and getClusterMetrics() to
get this info from the Java API. That means there's some way to get it
in the hbase shell, but it'll probably be ugly. If there's not already
an easier way I'll want to wrap that so it's a simple command.

Re: How should I be getting the set of regions in transition?

Posted by Andrew Purtell <ap...@apache.org>.

> In the second one the master was still initializing so it'd depend on if
master will respond to cluster status requests in that state.

HBASE-21521 proposes to change the order of master initialization so the UI
is up immediately and able to show the current initialization step and
status. I'm thinking about taking on that issue but a change of that nature
will introduce a bunch of new corner cases so we will see how far I get.


On Thu, Dec 6, 2018 at 9:29 AM Sean Busbey <bu...@apache.org> wrote:

> Excellent! that definitely would have been perfect for the first case.
> In the second one the master was still initializing so it'd depend on
> if master will respond to cluster status requests in that state.
>
> I'll go grab the patch and get ready to use it the next time I hit
> this. If it works I'm going to add some docs to the ref guide and a
> release note.
> On Thu, Dec 6, 2018 at 11:05 AM Andrew Purtell <an...@gmail.com>
> wrote:
> >
> > HBASE-21283
> >
> > > On Dec 6, 2018, at 8:55 AM, Andrew Purtell <an...@gmail.com>
> wrote:
> > >
> > > I recently added a shell command "rit" that displays the list of
> current RIT. Would that have worked? It does require that the master is
> responsive to a GetClusterStatus request.
> > >
> > >
> > >> On Dec 6, 2018, at 7:45 AM, Sean Busbey <bu...@apache.org> wrote:
> > >>
> > >> This week I've run into two cases where I needed the set of regions in
> > >> transition so I could recover them and I ran into what I think is a
> > >> gap in our operator tooling. I'm hoping folks will have some ideas
> > >> I've missed.
> > >>
> > >> Depending on how this thread goes, I'll make some follow-on on the
> > >> dev@hbase list for implementing changes and documentation.
> > >>
> > >> Case 1: HBase 1.2-ish RIT following RS crash
> > >>
> > >> Cluster had a handful of region servers fail and for whatever reason a
> > >> few regions were stuck in transition. The operator I was helping
> > >> already is used to dealing with the occasional manual recovery. Their
> > >> normal process looks like this:
> > >>
> > >> 1) Got to Master UI website
> > >> 2) Scroll down to Regions in Transition list
> > >> 3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
> > >> 4) confirm on RS logs that the RS associated in the above is now in
> > >> good health and doesn't expect to do anything with said region
> > >> 5) run "assign" in the hbase shell for the region
> > >>
> > >> Unfortunately, the cluster's HDFS was under duress and so listing
> > >> snapshot information was super slow. This caused the Master UI website
> > >> to hang prior to displaying the RIT list.
> > >>
> > >> We ended up looking at the master log file.
> > >>
> > >> Case 2: HBase 2.1-ish RIT following cluster wide crash
> > >>
> > >> AFAICT cluster had experienced a failure of all RS and masters. Upon
> > >> coming back up Master was left with ~10% of ~10K regions in a state of
> > >> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
> > >> with those regions. I'm pretty sure this is a bug;  I'm still triaging
> > >> it and I don't think it's relevant to the current question.
> > >>
> > >> Once I confirmed the given RS was not currently doing anything for any
> > >> of those regions I figured I'd use HBCK2 to run an assigns to get
> > >> things fixed. However, since there were like 900 RITs, the Master UI
> > >> was unusable for getting a complete list. Also with that many all in
> > >> the same state I want to be able to automate running against each of
> > >> them.
> > >>
> > >> I ended up greping the master log file and pulling out the WARN
> > >> messages about RIT to tease out the list of regions, then passed those
> > >> to hbck2.
> > >>
> > >> ----
> > >>
> > >> Am I missing some obvious place where I can use a CLI tool to get a
> > >> list of RIT? I don't see anything in the ref guide. I looked through
> > >> the help of HBCK 1 and the shell and couldn't find anything.
> > >>
> > >> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
> > >> get this info from the Java API. That means there's some way to get it
> > >> in the hbase shell, but it'll probably be ugly. If there's not already
> > >> an easier way I'll want to wrap that so it's a simple command.
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: How should I be getting the set of regions in transition?

Posted by Sean Busbey <bu...@apache.org>.

Excellent! that definitely would have been perfect for the first case.
In the second one the master was still initializing so it'd depend on
if master will respond to cluster status requests in that state.

I'll go grab the patch and get ready to use it the next time I hit
this. If it works I'm going to add some docs to the ref guide and a
release note.
On Thu, Dec 6, 2018 at 11:05 AM Andrew Purtell <an...@gmail.com> wrote:
>
> HBASE-21283
>
> > On Dec 6, 2018, at 8:55 AM, Andrew Purtell <an...@gmail.com> wrote:
> >
> > I recently added a shell command "rit" that displays the list of current RIT. Would that have worked? It does require that the master is responsive to a GetClusterStatus request.
> >
> >
> >> On Dec 6, 2018, at 7:45 AM, Sean Busbey <bu...@apache.org> wrote:
> >>
> >> This week I've run into two cases where I needed the set of regions in
> >> transition so I could recover them and I ran into what I think is a
> >> gap in our operator tooling. I'm hoping folks will have some ideas
> >> I've missed.
> >>
> >> Depending on how this thread goes, I'll make some follow-on on the
> >> dev@hbase list for implementing changes and documentation.
> >>
> >> Case 1: HBase 1.2-ish RIT following RS crash
> >>
> >> Cluster had a handful of region servers fail and for whatever reason a
> >> few regions were stuck in transition. The operator I was helping
> >> already is used to dealing with the occasional manual recovery. Their
> >> normal process looks like this:
> >>
> >> 1) Got to Master UI website
> >> 2) Scroll down to Regions in Transition list
> >> 3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
> >> 4) confirm on RS logs that the RS associated in the above is now in
> >> good health and doesn't expect to do anything with said region
> >> 5) run "assign" in the hbase shell for the region
> >>
> >> Unfortunately, the cluster's HDFS was under duress and so listing
> >> snapshot information was super slow. This caused the Master UI website
> >> to hang prior to displaying the RIT list.
> >>
> >> We ended up looking at the master log file.
> >>
> >> Case 2: HBase 2.1-ish RIT following cluster wide crash
> >>
> >> AFAICT cluster had experienced a failure of all RS and masters. Upon
> >> coming back up Master was left with ~10% of ~10K regions in a state of
> >> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
> >> with those regions. I'm pretty sure this is a bug;  I'm still triaging
> >> it and I don't think it's relevant to the current question.
> >>
> >> Once I confirmed the given RS was not currently doing anything for any
> >> of those regions I figured I'd use HBCK2 to run an assigns to get
> >> things fixed. However, since there were like 900 RITs, the Master UI
> >> was unusable for getting a complete list. Also with that many all in
> >> the same state I want to be able to automate running against each of
> >> them.
> >>
> >> I ended up greping the master log file and pulling out the WARN
> >> messages about RIT to tease out the list of regions, then passed those
> >> to hbck2.
> >>
> >> ----
> >>
> >> Am I missing some obvious place where I can use a CLI tool to get a
> >> list of RIT? I don't see anything in the ref guide. I looked through
> >> the help of HBCK 1 and the shell and couldn't find anything.
> >>
> >> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
> >> get this info from the Java API. That means there's some way to get it
> >> in the hbase shell, but it'll probably be ugly. If there's not already
> >> an easier way I'll want to wrap that so it's a simple command.

Re: How should I be getting the set of regions in transition?

Posted by Andrew Purtell <an...@gmail.com>.

HBASE-21283

> On Dec 6, 2018, at 8:55 AM, Andrew Purtell <an...@gmail.com> wrote:
> 
> I recently added a shell command "rit" that displays the list of current RIT. Would that have worked? It does require that the master is responsive to a GetClusterStatus request. 
> 
> 
>> On Dec 6, 2018, at 7:45 AM, Sean Busbey <bu...@apache.org> wrote:
>> 
>> This week I've run into two cases where I needed the set of regions in
>> transition so I could recover them and I ran into what I think is a
>> gap in our operator tooling. I'm hoping folks will have some ideas
>> I've missed.
>> 
>> Depending on how this thread goes, I'll make some follow-on on the
>> dev@hbase list for implementing changes and documentation.
>> 
>> Case 1: HBase 1.2-ish RIT following RS crash
>> 
>> Cluster had a handful of region servers fail and for whatever reason a
>> few regions were stuck in transition. The operator I was helping
>> already is used to dealing with the occasional manual recovery. Their
>> normal process looks like this:
>> 
>> 1) Got to Master UI website
>> 2) Scroll down to Regions in Transition list
>> 3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
>> 4) confirm on RS logs that the RS associated in the above is now in
>> good health and doesn't expect to do anything with said region
>> 5) run "assign" in the hbase shell for the region
>> 
>> Unfortunately, the cluster's HDFS was under duress and so listing
>> snapshot information was super slow. This caused the Master UI website
>> to hang prior to displaying the RIT list.
>> 
>> We ended up looking at the master log file.
>> 
>> Case 2: HBase 2.1-ish RIT following cluster wide crash
>> 
>> AFAICT cluster had experienced a failure of all RS and masters. Upon
>> coming back up Master was left with ~10% of ~10K regions in a state of
>> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
>> with those regions. I'm pretty sure this is a bug;  I'm still triaging
>> it and I don't think it's relevant to the current question.
>> 
>> Once I confirmed the given RS was not currently doing anything for any
>> of those regions I figured I'd use HBCK2 to run an assigns to get
>> things fixed. However, since there were like 900 RITs, the Master UI
>> was unusable for getting a complete list. Also with that many all in
>> the same state I want to be able to automate running against each of
>> them.
>> 
>> I ended up greping the master log file and pulling out the WARN
>> messages about RIT to tease out the list of regions, then passed those
>> to hbck2.
>> 
>> ----
>> 
>> Am I missing some obvious place where I can use a CLI tool to get a
>> list of RIT? I don't see anything in the ref guide. I looked through
>> the help of HBCK 1 and the shell and couldn't find anything.
>> 
>> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
>> get this info from the Java API. That means there's some way to get it
>> in the hbase shell, but it'll probably be ugly. If there's not already
>> an easier way I'll want to wrap that so it's a simple command.

Re: How should I be getting the set of regions in transition?

Posted by Andrew Purtell <an...@gmail.com>.

I recently added a shell command "rit" that displays the list of current RIT. Would that have worked? It does require that the master is responsive to a GetClusterStatus request. 


> On Dec 6, 2018, at 7:45 AM, Sean Busbey <bu...@apache.org> wrote:
> 
> This week I've run into two cases where I needed the set of regions in
> transition so I could recover them and I ran into what I think is a
> gap in our operator tooling. I'm hoping folks will have some ideas
> I've missed.
> 
> Depending on how this thread goes, I'll make some follow-on on the
> dev@hbase list for implementing changes and documentation.
> 
> Case 1: HBase 1.2-ish RIT following RS crash
> 
> Cluster had a handful of region servers fail and for whatever reason a
> few regions were stuck in transition. The operator I was helping
> already is used to dealing with the occasional manual recovery. Their
> normal process looks like this:
> 
> 1) Got to Master UI website
> 2) Scroll down to Regions in Transition list
> 3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
> 4) confirm on RS logs that the RS associated in the above is now in
> good health and doesn't expect to do anything with said region
> 5) run "assign" in the hbase shell for the region
> 
> Unfortunately, the cluster's HDFS was under duress and so listing
> snapshot information was super slow. This caused the Master UI website
> to hang prior to displaying the RIT list.
> 
> We ended up looking at the master log file.
> 
> Case 2: HBase 2.1-ish RIT following cluster wide crash
> 
> AFAICT cluster had experienced a failure of all RS and masters. Upon
> coming back up Master was left with ~10% of ~10K regions in a state of
> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
> with those regions. I'm pretty sure this is a bug;  I'm still triaging
> it and I don't think it's relevant to the current question.
> 
> Once I confirmed the given RS was not currently doing anything for any
> of those regions I figured I'd use HBCK2 to run an assigns to get
> things fixed. However, since there were like 900 RITs, the Master UI
> was unusable for getting a complete list. Also with that many all in
> the same state I want to be able to automate running against each of
> them.
> 
> I ended up greping the master log file and pulling out the WARN
> messages about RIT to tease out the list of regions, then passed those
> to hbck2.
> 
> ----
> 
> Am I missing some obvious place where I can use a CLI tool to get a
> list of RIT? I don't see anything in the ref guide. I looked through
> the help of HBCK 1 and the shell and couldn't find anything.
> 
> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
> get this info from the Java API. That means there's some way to get it
> in the hbase shell, but it'll probably be ugly. If there's not already
> an easier way I'll want to wrap that so it's a simple command.

Re: How should I be getting the set of regions in transition?

Posted by Sean Busbey <bu...@apache.org>.

I fixed the fix version on HBASE-21410. it'll be new in 2.1.2 if the
current RC passes.
On Mon, Dec 10, 2018 at 5:49 PM Sean Busbey <bu...@apache.org> wrote:
>
> On Mon, Dec 10, 2018 at 11:32 AM Stack <st...@duboce.net> wrote:
> >
> > On Thu, Dec 6, 2018 at 7:45 AM Sean Busbey <bu...@apache.org> wrote:
> >
> > > ...
> >
> > > Once I confirmed the given RS was not currently doing anything for any
> > > of those regions I figured I'd use HBCK2 to run an assigns to get
> > > things fixed. However, since there were like 900 RITs, the Master UI
> > > was unusable for getting a complete list.
> >
> >
> >
> > How unusable Sean? Was it up?
> >
>
> It was up. but we paginate the results so there's only 5 at a time.
> I'm not going to click through ~200 pages to get the list of things to
> copy/paste.
>
> >
> > > Also with that many all in
> > > the same state I want to be able to automate running against each of
> > > them.
> > >
> > > I ended up greping the master log file and pulling out the WARN
> > > messages about RIT to tease out the list of regions, then passed those
> > > to hbck2.
> > >
> > >
> >
> > Yeah. You saw the doc over on hbck2?
> > https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
> >
>
> indeed. super helpful and why I knew how to use the assigns to bypass things.
>
> > Did you have:
> >
> > commit fa6373660f622e7520a9f2639485cc386f18ede0
> > Author: jingyuntian <ti...@gmail.com>
> > Date:   Thu Nov 8 15:30:30 2018 +0800
> >
> >     HBASE-21410 A helper page that help find all problematic regions and
> > procedures
> >
> > It dumps the problematic on the UI so can save on messing in logs.
> >
>
> Hurm. the fix version on HBASE-21410 suggest I should have had it, but
> I don't think that page is present? I must be missing something.
>
> Between that and HBASE-21283 I should have plenty to put into a
> troubleshooting blurb. :)

Re: How should I be getting the set of regions in transition?

Posted by Sean Busbey <bu...@apache.org>.

On Mon, Dec 10, 2018 at 11:32 AM Stack <st...@duboce.net> wrote:
>
> On Thu, Dec 6, 2018 at 7:45 AM Sean Busbey <bu...@apache.org> wrote:
>
> > ...
>
> > Once I confirmed the given RS was not currently doing anything for any
> > of those regions I figured I'd use HBCK2 to run an assigns to get
> > things fixed. However, since there were like 900 RITs, the Master UI
> > was unusable for getting a complete list.
>
>
>
> How unusable Sean? Was it up?
>

It was up. but we paginate the results so there's only 5 at a time.
I'm not going to click through ~200 pages to get the list of things to
copy/paste.

>
> > Also with that many all in
> > the same state I want to be able to automate running against each of
> > them.
> >
> > I ended up greping the master log file and pulling out the WARN
> > messages about RIT to tease out the list of regions, then passed those
> > to hbck2.
> >
> >
>
> Yeah. You saw the doc over on hbck2?
> https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
>

indeed. super helpful and why I knew how to use the assigns to bypass things.

> Did you have:
>
> commit fa6373660f622e7520a9f2639485cc386f18ede0
> Author: jingyuntian <ti...@gmail.com>
> Date:   Thu Nov 8 15:30:30 2018 +0800
>
>     HBASE-21410 A helper page that help find all problematic regions and
> procedures
>
> It dumps the problematic on the UI so can save on messing in logs.
>

Hurm. the fix version on HBASE-21410 suggest I should have had it, but
I don't think that page is present? I must be missing something.

Between that and HBASE-21283 I should have plenty to put into a
troubleshooting blurb. :)

Re: How should I be getting the set of regions in transition?

Posted by Stack <st...@duboce.net>.

On Thu, Dec 6, 2018 at 7:45 AM Sean Busbey <bu...@apache.org> wrote:

> This week I've run into two cases where I needed the set of regions in
>  transition so I could recover them and I ran into what I think is a
> gap in our operator tooling. I'm hoping folks will have some ideas
> I've missed.
>
> Depending on how this thread goes, I'll make some follow-on on the
> dev@hbase list for implementing changes and documentation.
>
>
> ....


>
> Case 2: HBase 2.1-ish RIT following cluster wide crash
>
> AFAICT cluster had experienced a failure of all RS and masters. Upon
> coming back up Master was left with ~10% of ~10K regions in a state of
> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
> with those regions. I'm pretty sure this is a bug;  I'm still triaging
> it and I don't think it's relevant to the current question.
>
>
Yeah. This sounds like an interesting case.



> Once I confirmed the given RS was not currently doing anything for any
> of those regions I figured I'd use HBCK2 to run an assigns to get
> things fixed. However, since there were like 900 RITs, the Master UI
> was unusable for getting a complete list.



How unusable Sean? Was it up?


> Also with that many all in
> the same state I want to be able to automate running against each of
> them.
>
> I ended up greping the master log file and pulling out the WARN
> messages about RIT to tease out the list of regions, then passed those
> to hbck2.
>
>

Yeah. You saw the doc over on hbck2?
https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2

Did you have:

commit fa6373660f622e7520a9f2639485cc386f18ede0
Author: jingyuntian <ti...@gmail.com>
Date:   Thu Nov 8 15:30:30 2018 +0800

    HBASE-21410 A helper page that help find all problematic regions and
procedures

It dumps the problematic on the UI so can save on messing in logs.

Thanks,
S





> ----
>
> Am I missing some obvious place where I can use a CLI tool to get a
> list of RIT? I don't see anything in the ref guide. I looked through
> the help of HBCK 1 and the shell and couldn't find anything.
>
> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
> get this info from the Java API. That means there's some way to get it
> in the hbase shell, but it'll probably be ugly. If there's not already
> an easier way I'll want to wrap that so it's a simple command.
>