You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by kiran <ki...@gmail.com> on 2013/06/05 19:20:37 UTC

Handling regionserver crashes in production cluster

Dear All,

We have production cluster that runs on hbase 0.94.1. The issue we are
facing is whenever one regionserver goes down, the cluster becomes
unresponsive until all the regions are allocated to another
regionserver(s). The transition is taking about 3-5 mins and during this
time we are unable to any do client operation on the cluster.

Is there any way we can make the transition to run in background ?

Also, it is acceptable for us if the client operations such as scan or get
does not work on the rowkeys of regions in transition. But, they are not
working on the entire cluster until all the regions are moved out of
transition. We can't afford 3-5 minutes of downtime.

-- 
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late

RE: Handling regionserver crashes in production cluster

Posted by Sandeep L <sa...@outlook.com>.

Even we are facing same problem, is it fixed in hbase 0.94.8 or 0.97.6 ?
If it is fixed we will migrate, can some one conform about this?
Thanks,Sandeep.

> From: nkeywal@gmail.com
> Date: Thu, 13 Jun 2013 09:00:46 +0200
> Subject: Re: Handling regionserver crashes in production cluster
> To: user@hbase.apache.org
> 
> Hum... So even a simple get shows the issue?
> It would be a (surprising) critical bug. Could you please try the 95.1 or
> the 94.8? Or write an unit test?
> 
> Thanks,
> 
> Nicolas
> 
> 
> On Thu, Jun 13, 2013 at 5:43 AM, kiran <ki...@gmail.com> wrote:
> 
> > Its a simple kill...
> > Scan is used using startrow and stoprow
> > Scan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1"));
> >
> >
> > Our cluster size is 15. The load average when I see in master is 78%...It
> > is not that overloaded. but writes are happening in the cluster...
> >
> > Thanks
> > Kiran
> >
> >
> >
> > On Wed, Jun 12, 2013 at 10:49 PM, Nicolas Liochon <nk...@gmail.com>
> > wrote:
> >
> > > Yeah, it should not block the other regions.
> > >
> > > For the region server, was it a kill -9 or in simple kill (the former
> > > triggers a recovery, the later will close the region before stopping the
> > > process)?
> > >
> > > How do you select the scan scope? With stop/start rows?
> > > Can you share the client code you're using?
> > > What's the cluster size? Was it already very loaded before you killed the
> > > region server?
> > >
> > > Nicolas
> > >
> > >
> > >
> > > On Wed, Jun 12, 2013 at 6:11 PM, kiran <ki...@gmail.com>
> > > wrote:
> > >
> > > > Yes we killed the region server but datanode is still running on the
> > > > node...
> > > >
> > > > Sample Test scenario: Assume, I have table with pre-splits a upto z
> > > (about
> > > > 26 regions). I brought down region server purposefully with regions
> > > having
> > > > prefixes c and d. Then I used client API to scan data from regions with
> > > > prefixes other than c and d. The response was very slow and sometimes
> > not
> > > > coming at all.
> > > >
> > > > My doubt was if only regions with prefix c and d are getting relocated
> > or
> > > > in transition. Why is it affecting the regions with other prefixes....
> > > But
> > > > once the region transition is over, the response is very fast as
> > > expected.
> > > >
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
> > > > chrajeshbabu32@gmail.com> wrote:
> > > >
> > > > > You can configure below to more value to close more regions at a
> > time.
> > > > >
> > > > >  <property>
> > > > >     <name>hbase.regionserver.executor.closeregion.threads</name>
> > > > >     <value>3</value>
> > > > >   </property>
> > > > >
> > > > >
> > > > > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > What was your test exactly? You killed -9 a region server but kept
> > > the
> > > > > > datanode alive?
> > > > > > Could you detail the queries you were doing?
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 12, 2013 at 2:10 PM, kiran <
> > kiran.sarvabhotla@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > It is not possible for us to migrate to new version immediately.
> > > > > > >
> > > > > > > @Anoop we purposefully brought down one regionserver, then we
> > > > observed
> > > > > > the
> > > > > > > website is taking too much time to respond. We observed the
> > pattern
> > > > for
> > > > > > > about 5 min till the regions are relocated.
> > > > > > > Also we issued queries in our website taking care that the
> > queries
> > > > did
> > > > > > n't
> > > > > > > come under the regions in the regionserver we brought down.
> > > > > > >
> > > > > > > Is there any configuration workaround to mitigate it??
> > > > > > >
> > > > > > > Thanks
> > > > > > > Kiran
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > > > > > jean-marc@spaggiari.org
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Kiran,
> > > > > > > >
> > > > > > > > Also, any chance for you to migrate to 0.94.8? There have been
> > > > > > > > hundreds of fixes since 0.94.1...
> > > > > > > >
> > > > > > > > JM
> > > > > > > >
> > > > > > > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > > > > > > How many total RS in the cluster?  You mean u can not do any
> > > > > > operation
> > > > > > > on
> > > > > > > > > other regions in the live clusters?  It should not happen..
> >  Is
> > > > it
> > > > > so
> > > > > > > > > happening that the client ops are targetted at the regions
> > > which
> > > > > were
> > > > > > > in
> > > > > > > > > the dead RS( and in transition now)?   Can u have a closer
> > look
> > > > and
> > > > > > > see?
> > > > > > > > > If not pls check the RS threads were they are getting
> > blocked.
> > > > > > > > >
> > > > > > > > > -Anoop-
> > > > > > > > >
> > > > > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> > > > > kiran.sarvabhotla@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Dear All,
> > > > > > > > >>
> > > > > > > > >> We have production cluster that runs on hbase 0.94.1. The
> > > issue
> > > > we
> > > > > > are
> > > > > > > > >> facing is whenever one regionserver goes down, the cluster
> > > > becomes
> > > > > > > > >> unresponsive until all the regions are allocated to another
> > > > > > > > >> regionserver(s). The transition is taking about 3-5 mins and
> > > > > during
> > > > > > > this
> > > > > > > > >> time we are unable to any do client operation on the
> > cluster.
> > > > > > > > >>
> > > > > > > > >> Is there any way we can make the transition to run in
> > > > background ?
> > > > > > > > >>
> > > > > > > > >> Also, it is acceptable for us if the client operations such
> > as
> > > > > scan
> > > > > > or
> > > > > > > > get
> > > > > > > > >> does not work on the rowkeys of regions in transition. But,
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > >> working on the entire cluster until all the regions are
> > moved
> > > > out
> > > > > of
> > > > > > > > >> transition. We can't afford 3-5 minutes of downtime.
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Thank you
> > > > > > > > >> Kiran Sarvabhotla
> > > > > > > > >>
> > > > > > > > >> -----Even a correct decision is wrong when it is taken late
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thank you
> > > > > > > Kiran Sarvabhotla
> > > > > > >
> > > > > > > -----Even a correct decision is wrong when it is taken late
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thank you
> > > > Kiran Sarvabhotla
> > > >
> > > > -----Even a correct decision is wrong when it is taken late
> > > >
> > >
> >
> >
> >
> > --
> > Thank you
> > Kiran Sarvabhotla
> >
> > -----Even a correct decision is wrong when it is taken late
> >

Re: Handling regionserver crashes in production cluster

Posted by Nicolas Liochon <nk...@gmail.com>.

Hum... So even a simple get shows the issue?
It would be a (surprising) critical bug. Could you please try the 95.1 or
the 94.8? Or write an unit test?

Thanks,

Nicolas


On Thu, Jun 13, 2013 at 5:43 AM, kiran <ki...@gmail.com> wrote:

> Its a simple kill...
> Scan is used using startrow and stoprow
> Scan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1"));
>
>
> Our cluster size is 15. The load average when I see in master is 78%...It
> is not that overloaded. but writes are happening in the cluster...
>
> Thanks
> Kiran
>
>
>
> On Wed, Jun 12, 2013 at 10:49 PM, Nicolas Liochon <nk...@gmail.com>
> wrote:
>
> > Yeah, it should not block the other regions.
> >
> > For the region server, was it a kill -9 or in simple kill (the former
> > triggers a recovery, the later will close the region before stopping the
> > process)?
> >
> > How do you select the scan scope? With stop/start rows?
> > Can you share the client code you're using?
> > What's the cluster size? Was it already very loaded before you killed the
> > region server?
> >
> > Nicolas
> >
> >
> >
> > On Wed, Jun 12, 2013 at 6:11 PM, kiran <ki...@gmail.com>
> > wrote:
> >
> > > Yes we killed the region server but datanode is still running on the
> > > node...
> > >
> > > Sample Test scenario: Assume, I have table with pre-splits a upto z
> > (about
> > > 26 regions). I brought down region server purposefully with regions
> > having
> > > prefixes c and d. Then I used client API to scan data from regions with
> > > prefixes other than c and d. The response was very slow and sometimes
> not
> > > coming at all.
> > >
> > > My doubt was if only regions with prefix c and d are getting relocated
> or
> > > in transition. Why is it affecting the regions with other prefixes....
> > But
> > > once the region transition is over, the response is very fast as
> > expected.
> > >
> > >
> > >
> > > On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
> > > chrajeshbabu32@gmail.com> wrote:
> > >
> > > > You can configure below to more value to close more regions at a
> time.
> > > >
> > > >  <property>
> > > >     <name>hbase.regionserver.executor.closeregion.threads</name>
> > > >     <value>3</value>
> > > >   </property>
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com>
> > > > wrote:
> > > >
> > > > > What was your test exactly? You killed -9 a region server but kept
> > the
> > > > > datanode alive?
> > > > > Could you detail the queries you were doing?
> > > > >
> > > > >
> > > > > On Wed, Jun 12, 2013 at 2:10 PM, kiran <
> kiran.sarvabhotla@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > It is not possible for us to migrate to new version immediately.
> > > > > >
> > > > > > @Anoop we purposefully brought down one regionserver, then we
> > > observed
> > > > > the
> > > > > > website is taking too much time to respond. We observed the
> pattern
> > > for
> > > > > > about 5 min till the regions are relocated.
> > > > > > Also we issued queries in our website taking care that the
> queries
> > > did
> > > > > n't
> > > > > > come under the regions in the regionserver we brought down.
> > > > > >
> > > > > > Is there any configuration workaround to mitigate it??
> > > > > >
> > > > > > Thanks
> > > > > > Kiran
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > > > > jean-marc@spaggiari.org
> > > > > > > wrote:
> > > > > >
> > > > > > > Hi Kiran,
> > > > > > >
> > > > > > > Also, any chance for you to migrate to 0.94.8? There have been
> > > > > > > hundreds of fixes since 0.94.1...
> > > > > > >
> > > > > > > JM
> > > > > > >
> > > > > > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > > > > > How many total RS in the cluster?  You mean u can not do any
> > > > > operation
> > > > > > on
> > > > > > > > other regions in the live clusters?  It should not happen..
>  Is
> > > it
> > > > so
> > > > > > > > happening that the client ops are targetted at the regions
> > which
> > > > were
> > > > > > in
> > > > > > > > the dead RS( and in transition now)?   Can u have a closer
> look
> > > and
> > > > > > see?
> > > > > > > > If not pls check the RS threads were they are getting
> blocked.
> > > > > > > >
> > > > > > > > -Anoop-
> > > > > > > >
> > > > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> > > > kiran.sarvabhotla@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Dear All,
> > > > > > > >>
> > > > > > > >> We have production cluster that runs on hbase 0.94.1. The
> > issue
> > > we
> > > > > are
> > > > > > > >> facing is whenever one regionserver goes down, the cluster
> > > becomes
> > > > > > > >> unresponsive until all the regions are allocated to another
> > > > > > > >> regionserver(s). The transition is taking about 3-5 mins and
> > > > during
> > > > > > this
> > > > > > > >> time we are unable to any do client operation on the
> cluster.
> > > > > > > >>
> > > > > > > >> Is there any way we can make the transition to run in
> > > background ?
> > > > > > > >>
> > > > > > > >> Also, it is acceptable for us if the client operations such
> as
> > > > scan
> > > > > or
> > > > > > > get
> > > > > > > >> does not work on the rowkeys of regions in transition. But,
> > they
> > > > are
> > > > > > not
> > > > > > > >> working on the entire cluster until all the regions are
> moved
> > > out
> > > > of
> > > > > > > >> transition. We can't afford 3-5 minutes of downtime.
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Thank you
> > > > > > > >> Kiran Sarvabhotla
> > > > > > > >>
> > > > > > > >> -----Even a correct decision is wrong when it is taken late
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thank you
> > > > > > Kiran Sarvabhotla
> > > > > >
> > > > > > -----Even a correct decision is wrong when it is taken late
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thank you
> > > Kiran Sarvabhotla
> > >
> > > -----Even a correct decision is wrong when it is taken late
> > >
> >
>
>
>
> --
> Thank you
> Kiran Sarvabhotla
>
> -----Even a correct decision is wrong when it is taken late
>

Re: Handling regionserver crashes in production cluster

Posted by kiran <ki...@gmail.com>.

Its a simple kill...
Scan is used using startrow and stoprow
Scan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1"));


Our cluster size is 15. The load average when I see in master is 78%...It
is not that overloaded. but writes are happening in the cluster...

Thanks
Kiran



On Wed, Jun 12, 2013 at 10:49 PM, Nicolas Liochon <nk...@gmail.com> wrote:

> Yeah, it should not block the other regions.
>
> For the region server, was it a kill -9 or in simple kill (the former
> triggers a recovery, the later will close the region before stopping the
> process)?
>
> How do you select the scan scope? With stop/start rows?
> Can you share the client code you're using?
> What's the cluster size? Was it already very loaded before you killed the
> region server?
>
> Nicolas
>
>
>
> On Wed, Jun 12, 2013 at 6:11 PM, kiran <ki...@gmail.com>
> wrote:
>
> > Yes we killed the region server but datanode is still running on the
> > node...
> >
> > Sample Test scenario: Assume, I have table with pre-splits a upto z
> (about
> > 26 regions). I brought down region server purposefully with regions
> having
> > prefixes c and d. Then I used client API to scan data from regions with
> > prefixes other than c and d. The response was very slow and sometimes not
> > coming at all.
> >
> > My doubt was if only regions with prefix c and d are getting relocated or
> > in transition. Why is it affecting the regions with other prefixes....
> But
> > once the region transition is over, the response is very fast as
> expected.
> >
> >
> >
> > On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
> > chrajeshbabu32@gmail.com> wrote:
> >
> > > You can configure below to more value to close more regions at a time.
> > >
> > >  <property>
> > >     <name>hbase.regionserver.executor.closeregion.threads</name>
> > >     <value>3</value>
> > >   </property>
> > >
> > >
> > > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com>
> > > wrote:
> > >
> > > > What was your test exactly? You killed -9 a region server but kept
> the
> > > > datanode alive?
> > > > Could you detail the queries you were doing?
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 2:10 PM, kiran <ki...@gmail.com>
> > > > wrote:
> > > >
> > > > > It is not possible for us to migrate to new version immediately.
> > > > >
> > > > > @Anoop we purposefully brought down one regionserver, then we
> > observed
> > > > the
> > > > > website is taking too much time to respond. We observed the pattern
> > for
> > > > > about 5 min till the regions are relocated.
> > > > > Also we issued queries in our website taking care that the queries
> > did
> > > > n't
> > > > > come under the regions in the regionserver we brought down.
> > > > >
> > > > > Is there any configuration workaround to mitigate it??
> > > > >
> > > > > Thanks
> > > > > Kiran
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org
> > > > > > wrote:
> > > > >
> > > > > > Hi Kiran,
> > > > > >
> > > > > > Also, any chance for you to migrate to 0.94.8? There have been
> > > > > > hundreds of fixes since 0.94.1...
> > > > > >
> > > > > > JM
> > > > > >
> > > > > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > > > > How many total RS in the cluster?  You mean u can not do any
> > > > operation
> > > > > on
> > > > > > > other regions in the live clusters?  It should not happen..  Is
> > it
> > > so
> > > > > > > happening that the client ops are targetted at the regions
> which
> > > were
> > > > > in
> > > > > > > the dead RS( and in transition now)?   Can u have a closer look
> > and
> > > > > see?
> > > > > > > If not pls check the RS threads were they are getting blocked.
> > > > > > >
> > > > > > > -Anoop-
> > > > > > >
> > > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> > > kiran.sarvabhotla@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> Dear All,
> > > > > > >>
> > > > > > >> We have production cluster that runs on hbase 0.94.1. The
> issue
> > we
> > > > are
> > > > > > >> facing is whenever one regionserver goes down, the cluster
> > becomes
> > > > > > >> unresponsive until all the regions are allocated to another
> > > > > > >> regionserver(s). The transition is taking about 3-5 mins and
> > > during
> > > > > this
> > > > > > >> time we are unable to any do client operation on the cluster.
> > > > > > >>
> > > > > > >> Is there any way we can make the transition to run in
> > background ?
> > > > > > >>
> > > > > > >> Also, it is acceptable for us if the client operations such as
> > > scan
> > > > or
> > > > > > get
> > > > > > >> does not work on the rowkeys of regions in transition. But,
> they
> > > are
> > > > > not
> > > > > > >> working on the entire cluster until all the regions are moved
> > out
> > > of
> > > > > > >> transition. We can't afford 3-5 minutes of downtime.
> > > > > > >>
> > > > > > >> --
> > > > > > >> Thank you
> > > > > > >> Kiran Sarvabhotla
> > > > > > >>
> > > > > > >> -----Even a correct decision is wrong when it is taken late
> > > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thank you
> > > > > Kiran Sarvabhotla
> > > > >
> > > > > -----Even a correct decision is wrong when it is taken late
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thank you
> > Kiran Sarvabhotla
> >
> > -----Even a correct decision is wrong when it is taken late
> >
>



-- 
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late

Re: Handling regionserver crashes in production cluster

Posted by Nicolas Liochon <nk...@gmail.com>.

Yeah, it should not block the other regions.

For the region server, was it a kill -9 or in simple kill (the former
triggers a recovery, the later will close the region before stopping the
process)?

How do you select the scan scope? With stop/start rows?
Can you share the client code you're using?
What's the cluster size? Was it already very loaded before you killed the
region server?

Nicolas



On Wed, Jun 12, 2013 at 6:11 PM, kiran <ki...@gmail.com> wrote:

> Yes we killed the region server but datanode is still running on the
> node...
>
> Sample Test scenario: Assume, I have table with pre-splits a upto z (about
> 26 regions). I brought down region server purposefully with regions having
> prefixes c and d. Then I used client API to scan data from regions with
> prefixes other than c and d. The response was very slow and sometimes not
> coming at all.
>
> My doubt was if only regions with prefix c and d are getting relocated or
> in transition. Why is it affecting the regions with other prefixes.... But
> once the region transition is over, the response is very fast as expected.
>
>
>
> On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
> chrajeshbabu32@gmail.com> wrote:
>
> > You can configure below to more value to close more regions at a time.
> >
> >  <property>
> >     <name>hbase.regionserver.executor.closeregion.threads</name>
> >     <value>3</value>
> >   </property>
> >
> >
> > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com>
> > wrote:
> >
> > > What was your test exactly? You killed -9 a region server but kept the
> > > datanode alive?
> > > Could you detail the queries you were doing?
> > >
> > >
> > > On Wed, Jun 12, 2013 at 2:10 PM, kiran <ki...@gmail.com>
> > > wrote:
> > >
> > > > It is not possible for us to migrate to new version immediately.
> > > >
> > > > @Anoop we purposefully brought down one regionserver, then we
> observed
> > > the
> > > > website is taking too much time to respond. We observed the pattern
> for
> > > > about 5 min till the regions are relocated.
> > > > Also we issued queries in our website taking care that the queries
> did
> > > n't
> > > > come under the regions in the regionserver we brought down.
> > > >
> > > > Is there any configuration workaround to mitigate it??
> > > >
> > > > Thanks
> > > > Kiran
> > > >
> > > >
> > > >
> > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org
> > > > > wrote:
> > > >
> > > > > Hi Kiran,
> > > > >
> > > > > Also, any chance for you to migrate to 0.94.8? There have been
> > > > > hundreds of fixes since 0.94.1...
> > > > >
> > > > > JM
> > > > >
> > > > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > > > How many total RS in the cluster?  You mean u can not do any
> > > operation
> > > > on
> > > > > > other regions in the live clusters?  It should not happen..  Is
> it
> > so
> > > > > > happening that the client ops are targetted at the regions which
> > were
> > > > in
> > > > > > the dead RS( and in transition now)?   Can u have a closer look
> and
> > > > see?
> > > > > > If not pls check the RS threads were they are getting blocked.
> > > > > >
> > > > > > -Anoop-
> > > > > >
> > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> > kiran.sarvabhotla@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Dear All,
> > > > > >>
> > > > > >> We have production cluster that runs on hbase 0.94.1. The issue
> we
> > > are
> > > > > >> facing is whenever one regionserver goes down, the cluster
> becomes
> > > > > >> unresponsive until all the regions are allocated to another
> > > > > >> regionserver(s). The transition is taking about 3-5 mins and
> > during
> > > > this
> > > > > >> time we are unable to any do client operation on the cluster.
> > > > > >>
> > > > > >> Is there any way we can make the transition to run in
> background ?
> > > > > >>
> > > > > >> Also, it is acceptable for us if the client operations such as
> > scan
> > > or
> > > > > get
> > > > > >> does not work on the rowkeys of regions in transition. But, they
> > are
> > > > not
> > > > > >> working on the entire cluster until all the regions are moved
> out
> > of
> > > > > >> transition. We can't afford 3-5 minutes of downtime.
> > > > > >>
> > > > > >> --
> > > > > >> Thank you
> > > > > >> Kiran Sarvabhotla
> > > > > >>
> > > > > >> -----Even a correct decision is wrong when it is taken late
> > > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thank you
> > > > Kiran Sarvabhotla
> > > >
> > > > -----Even a correct decision is wrong when it is taken late
> > > >
> > >
> >
>
>
>
> --
> Thank you
> Kiran Sarvabhotla
>
> -----Even a correct decision is wrong when it is taken late
>

Re: Handling regionserver crashes in production cluster

Posted by kiran <ki...@gmail.com>.

Yes we killed the region server but datanode is still running on the node...

Sample Test scenario: Assume, I have table with pre-splits a upto z (about
26 regions). I brought down region server purposefully with regions having
prefixes c and d. Then I used client API to scan data from regions with
prefixes other than c and d. The response was very slow and sometimes not
coming at all.

My doubt was if only regions with prefix c and d are getting relocated or
in transition. Why is it affecting the regions with other prefixes.... But
once the region transition is over, the response is very fast as expected.



On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
chrajeshbabu32@gmail.com> wrote:

> You can configure below to more value to close more regions at a time.
>
>  <property>
>     <name>hbase.regionserver.executor.closeregion.threads</name>
>     <value>3</value>
>   </property>
>
>
> On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com>
> wrote:
>
> > What was your test exactly? You killed -9 a region server but kept the
> > datanode alive?
> > Could you detail the queries you were doing?
> >
> >
> > On Wed, Jun 12, 2013 at 2:10 PM, kiran <ki...@gmail.com>
> > wrote:
> >
> > > It is not possible for us to migrate to new version immediately.
> > >
> > > @Anoop we purposefully brought down one regionserver, then we observed
> > the
> > > website is taking too much time to respond. We observed the pattern for
> > > about 5 min till the regions are relocated.
> > > Also we issued queries in our website taking care that the queries did
> > n't
> > > come under the regions in the regionserver we brought down.
> > >
> > > Is there any configuration workaround to mitigate it??
> > >
> > > Thanks
> > > Kiran
> > >
> > >
> > >
> > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org
> > > > wrote:
> > >
> > > > Hi Kiran,
> > > >
> > > > Also, any chance for you to migrate to 0.94.8? There have been
> > > > hundreds of fixes since 0.94.1...
> > > >
> > > > JM
> > > >
> > > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > > How many total RS in the cluster?  You mean u can not do any
> > operation
> > > on
> > > > > other regions in the live clusters?  It should not happen..  Is it
> so
> > > > > happening that the client ops are targetted at the regions which
> were
> > > in
> > > > > the dead RS( and in transition now)?   Can u have a closer look and
> > > see?
> > > > > If not pls check the RS threads were they are getting blocked.
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> kiran.sarvabhotla@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Dear All,
> > > > >>
> > > > >> We have production cluster that runs on hbase 0.94.1. The issue we
> > are
> > > > >> facing is whenever one regionserver goes down, the cluster becomes
> > > > >> unresponsive until all the regions are allocated to another
> > > > >> regionserver(s). The transition is taking about 3-5 mins and
> during
> > > this
> > > > >> time we are unable to any do client operation on the cluster.
> > > > >>
> > > > >> Is there any way we can make the transition to run in background ?
> > > > >>
> > > > >> Also, it is acceptable for us if the client operations such as
> scan
> > or
> > > > get
> > > > >> does not work on the rowkeys of regions in transition. But, they
> are
> > > not
> > > > >> working on the entire cluster until all the regions are moved out
> of
> > > > >> transition. We can't afford 3-5 minutes of downtime.
> > > > >>
> > > > >> --
> > > > >> Thank you
> > > > >> Kiran Sarvabhotla
> > > > >>
> > > > >> -----Even a correct decision is wrong when it is taken late
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Thank you
> > > Kiran Sarvabhotla
> > >
> > > -----Even a correct decision is wrong when it is taken late
> > >
> >
>



-- 
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late

Re: Handling regionserver crashes in production cluster

Posted by rajesh babu chintaguntla <ch...@gmail.com>.

You can configure below to more value to close more regions at a time.

 <property>
    <name>hbase.regionserver.executor.closeregion.threads</name>
    <value>3</value>
  </property>


On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nk...@gmail.com> wrote:

> What was your test exactly? You killed -9 a region server but kept the
> datanode alive?
> Could you detail the queries you were doing?
>
>
> On Wed, Jun 12, 2013 at 2:10 PM, kiran <ki...@gmail.com>
> wrote:
>
> > It is not possible for us to migrate to new version immediately.
> >
> > @Anoop we purposefully brought down one regionserver, then we observed
> the
> > website is taking too much time to respond. We observed the pattern for
> > about 5 min till the regions are relocated.
> > Also we issued queries in our website taking care that the queries did
> n't
> > come under the regions in the regionserver we brought down.
> >
> > Is there any configuration workaround to mitigate it??
> >
> > Thanks
> > Kiran
> >
> >
> >
> > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org
> > > wrote:
> >
> > > Hi Kiran,
> > >
> > > Also, any chance for you to migrate to 0.94.8? There have been
> > > hundreds of fixes since 0.94.1...
> > >
> > > JM
> > >
> > > 2013/6/6 Anoop John <an...@gmail.com>:
> > > > How many total RS in the cluster?  You mean u can not do any
> operation
> > on
> > > > other regions in the live clusters?  It should not happen..  Is it so
> > > > happening that the client ops are targetted at the regions which were
> > in
> > > > the dead RS( and in transition now)?   Can u have a closer look and
> > see?
> > > > If not pls check the RS threads were they are getting blocked.
> > > >
> > > > -Anoop-
> > > >
> > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <ki...@gmail.com>
> > > wrote:
> > > >
> > > >> Dear All,
> > > >>
> > > >> We have production cluster that runs on hbase 0.94.1. The issue we
> are
> > > >> facing is whenever one regionserver goes down, the cluster becomes
> > > >> unresponsive until all the regions are allocated to another
> > > >> regionserver(s). The transition is taking about 3-5 mins and during
> > this
> > > >> time we are unable to any do client operation on the cluster.
> > > >>
> > > >> Is there any way we can make the transition to run in background ?
> > > >>
> > > >> Also, it is acceptable for us if the client operations such as scan
> or
> > > get
> > > >> does not work on the rowkeys of regions in transition. But, they are
> > not
> > > >> working on the entire cluster until all the regions are moved out of
> > > >> transition. We can't afford 3-5 minutes of downtime.
> > > >>
> > > >> --
> > > >> Thank you
> > > >> Kiran Sarvabhotla
> > > >>
> > > >> -----Even a correct decision is wrong when it is taken late
> > > >>
> > >
> >
> >
> >
> > --
> > Thank you
> > Kiran Sarvabhotla
> >
> > -----Even a correct decision is wrong when it is taken late
> >
>

Re: Handling regionserver crashes in production cluster

Posted by Nicolas Liochon <nk...@gmail.com>.

What was your test exactly? You killed -9 a region server but kept the
datanode alive?
Could you detail the queries you were doing?


On Wed, Jun 12, 2013 at 2:10 PM, kiran <ki...@gmail.com> wrote:

> It is not possible for us to migrate to new version immediately.
>
> @Anoop we purposefully brought down one regionserver, then we observed the
> website is taking too much time to respond. We observed the pattern for
> about 5 min till the regions are relocated.
> Also we issued queries in our website taking care that the queries did n't
> come under the regions in the regionserver we brought down.
>
> Is there any configuration workaround to mitigate it??
>
> Thanks
> Kiran
>
>
>
> On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org
> > wrote:
>
> > Hi Kiran,
> >
> > Also, any chance for you to migrate to 0.94.8? There have been
> > hundreds of fixes since 0.94.1...
> >
> > JM
> >
> > 2013/6/6 Anoop John <an...@gmail.com>:
> > > How many total RS in the cluster?  You mean u can not do any operation
> on
> > > other regions in the live clusters?  It should not happen..  Is it so
> > > happening that the client ops are targetted at the regions which were
> in
> > > the dead RS( and in transition now)?   Can u have a closer look and
> see?
> > > If not pls check the RS threads were they are getting blocked.
> > >
> > > -Anoop-
> > >
> > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <ki...@gmail.com>
> > wrote:
> > >
> > >> Dear All,
> > >>
> > >> We have production cluster that runs on hbase 0.94.1. The issue we are
> > >> facing is whenever one regionserver goes down, the cluster becomes
> > >> unresponsive until all the regions are allocated to another
> > >> regionserver(s). The transition is taking about 3-5 mins and during
> this
> > >> time we are unable to any do client operation on the cluster.
> > >>
> > >> Is there any way we can make the transition to run in background ?
> > >>
> > >> Also, it is acceptable for us if the client operations such as scan or
> > get
> > >> does not work on the rowkeys of regions in transition. But, they are
> not
> > >> working on the entire cluster until all the regions are moved out of
> > >> transition. We can't afford 3-5 minutes of downtime.
> > >>
> > >> --
> > >> Thank you
> > >> Kiran Sarvabhotla
> > >>
> > >> -----Even a correct decision is wrong when it is taken late
> > >>
> >
>
>
>
> --
> Thank you
> Kiran Sarvabhotla
>
> -----Even a correct decision is wrong when it is taken late
>

Re: Handling regionserver crashes in production cluster

Posted by kiran <ki...@gmail.com>.

It is not possible for us to migrate to new version immediately.

@Anoop we purposefully brought down one regionserver, then we observed the
website is taking too much time to respond. We observed the pattern for
about 5 min till the regions are relocated.
Also we issued queries in our website taking care that the queries did n't
come under the regions in the regionserver we brought down.

Is there any configuration workaround to mitigate it??

Thanks
Kiran



On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi Kiran,
>
> Also, any chance for you to migrate to 0.94.8? There have been
> hundreds of fixes since 0.94.1...
>
> JM
>
> 2013/6/6 Anoop John <an...@gmail.com>:
> > How many total RS in the cluster?  You mean u can not do any operation on
> > other regions in the live clusters?  It should not happen..  Is it so
> > happening that the client ops are targetted at the regions which were in
> > the dead RS( and in transition now)?   Can u have a closer look and see?
> > If not pls check the RS threads were they are getting blocked.
> >
> > -Anoop-
> >
> > On Wed, Jun 5, 2013 at 10:50 PM, kiran <ki...@gmail.com>
> wrote:
> >
> >> Dear All,
> >>
> >> We have production cluster that runs on hbase 0.94.1. The issue we are
> >> facing is whenever one regionserver goes down, the cluster becomes
> >> unresponsive until all the regions are allocated to another
> >> regionserver(s). The transition is taking about 3-5 mins and during this
> >> time we are unable to any do client operation on the cluster.
> >>
> >> Is there any way we can make the transition to run in background ?
> >>
> >> Also, it is acceptable for us if the client operations such as scan or
> get
> >> does not work on the rowkeys of regions in transition. But, they are not
> >> working on the entire cluster until all the regions are moved out of
> >> transition. We can't afford 3-5 minutes of downtime.
> >>
> >> --
> >> Thank you
> >> Kiran Sarvabhotla
> >>
> >> -----Even a correct decision is wrong when it is taken late
> >>
>



-- 
Thank you
Kiran Sarvabhotla

-----Even a correct decision is wrong when it is taken late

Re: Handling regionserver crashes in production cluster

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Kiran,

Also, any chance for you to migrate to 0.94.8? There have been
hundreds of fixes since 0.94.1...

JM

2013/6/6 Anoop John <an...@gmail.com>:
> How many total RS in the cluster?  You mean u can not do any operation on
> other regions in the live clusters?  It should not happen..  Is it so
> happening that the client ops are targetted at the regions which were in
> the dead RS( and in transition now)?   Can u have a closer look and see?
> If not pls check the RS threads were they are getting blocked.
>
> -Anoop-
>
> On Wed, Jun 5, 2013 at 10:50 PM, kiran <ki...@gmail.com> wrote:
>
>> Dear All,
>>
>> We have production cluster that runs on hbase 0.94.1. The issue we are
>> facing is whenever one regionserver goes down, the cluster becomes
>> unresponsive until all the regions are allocated to another
>> regionserver(s). The transition is taking about 3-5 mins and during this
>> time we are unable to any do client operation on the cluster.
>>
>> Is there any way we can make the transition to run in background ?
>>
>> Also, it is acceptable for us if the client operations such as scan or get
>> does not work on the rowkeys of regions in transition. But, they are not
>> working on the entire cluster until all the regions are moved out of
>> transition. We can't afford 3-5 minutes of downtime.
>>
>> --
>> Thank you
>> Kiran Sarvabhotla
>>
>> -----Even a correct decision is wrong when it is taken late
>>

Re: Handling regionserver crashes in production cluster

Posted by Anoop John <an...@gmail.com>.

How many total RS in the cluster?  You mean u can not do any operation on
other regions in the live clusters?  It should not happen..  Is it so
happening that the client ops are targetted at the regions which were in
the dead RS( and in transition now)?   Can u have a closer look and see?
If not pls check the RS threads were they are getting blocked.

-Anoop-

On Wed, Jun 5, 2013 at 10:50 PM, kiran <ki...@gmail.com> wrote:

> Dear All,
>
> We have production cluster that runs on hbase 0.94.1. The issue we are
> facing is whenever one regionserver goes down, the cluster becomes
> unresponsive until all the regions are allocated to another
> regionserver(s). The transition is taking about 3-5 mins and during this
> time we are unable to any do client operation on the cluster.
>
> Is there any way we can make the transition to run in background ?
>
> Also, it is acceptable for us if the client operations such as scan or get
> does not work on the rowkeys of regions in transition. But, they are not
> working on the entire cluster until all the regions are moved out of
> transition. We can't afford 3-5 minutes of downtime.
>
> --
> Thank you
> Kiran Sarvabhotla
>
> -----Even a correct decision is wrong when it is taken late
>