You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by James Baldassari <ja...@dataxu.com> on 2010/01/27 06:03:09 UTC

Region gets stuck in transition state

Hi,

I'm using the Cloudera distribution of HBase, version
0.20.0~1-1.cloudera, in a fully-distributed cluster of 10 nodes. I'm
using all default config options except for hbase.zookeeper.quorum,
hbase.rootdir, hbase.cluster.distributed, and an updated regionservers
file containing all our region servers.

After running a map/reduce job which inserted around 180,000 rows into
HBase, HBase appeared to be fine. We could do a count on our table, and
no errors were reported. We then tried to truncate the table in
preparation for another test but were unable to do so because the region
became stuck in a transition state. I restarted each region server
individually, but it did not fix the problem. I tried the
disable_region and close_region commands from the hbase shell, but that
didn't work either. After doing all of that, a status 'detailed' showed
this:

1 regionsInTransition
name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false

Then I restarted the master and all region servers, and it looked like this:

1 regionsInTransition
name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false

I noticed messages in some of the region server logs indicating that
their zookeeper sessions had expired. I'm not sure if this has anything
to do with the problem. I should mention that this scenario is quite
repeatable, and the last few times it has happened we had to shut down
HBase and manually remove the /hbase root from HDFS, then start HBase
and recreate the table.

Any ideas what could put the region into this state or what do to do fix
it? How can I prevent this from happening in the future?

I was also wondering whether it was normal for there to be only one
region with 180,000+ rows. Shouldn't this region be split into several
regions and distributed among the region servers? I'm new to HBase, so
maybe my understanding of how it's supposed to work is wrong.

Thanks,
James

Re: Region gets stuck in transition state

Posted by Ryan Rawson <ry...@gmail.com>.

Restarting the master can help. Some of these bugs were fixed in 0.20.3
which was just released. Upgrade if you can!

On Jan 26, 2010 9:03 PM, "James Baldassari" <ja...@dataxu.com> wrote:

Hi,

I'm using the Cloudera distribution of HBase, version
0.20.0~1-1.cloudera, in a fully-distributed cluster of 10 nodes.  I'm
using all default config options except for hbase.zookeeper.quorum,
hbase.rootdir, hbase.cluster.distributed, and an updated regionservers
file containing all our region servers.

After running a map/reduce job which inserted around 180,000 rows into
HBase, HBase appeared to be fine.  We could do a count on our table, and
no errors were reported.  We then tried to truncate the table in
preparation for another test but were unable to do so because the region
became stuck in a transition state.  I restarted each region server
individually, but it did not fix the problem.  I tried the
disable_region and close_region commands from the hbase shell, but that
didn't work either.  After doing all of that, a status 'detailed' showed
this:

1 regionsInTransition
   name=retargeting,,1264546222144, unassigned=false, pendingOpen=false,
open=false, closing=true, pendingClose=false, closed=false, offlined=false

Then I restarted the master and all region servers, and it looked like this:

1 regionsInTransition
   name=retargeting,,1264546222144, unassigned=false, pendingOpen=true,
open=false, closing=false, pendingClose=false, closed=false, offlined=false

I noticed messages in some of the region server logs indicating that
their zookeeper sessions had expired.  I'm not sure if this has anything
to do with the problem.  I should mention that this scenario is quite
repeatable, and the last few times it has happened we had to shut down
HBase and manually remove the /hbase root from HDFS, then start HBase
and recreate the table.

Any ideas what could put the region into this state or what do to do fix
it?  How can I prevent this from happening in the future?

I was also wondering whether it was normal for there to be only one
region with 180,000+ rows.  Shouldn't this region be split into several
regions and distributed among the region servers?  I'm new to HBase, so
maybe my understanding of how it's supposed to work is wrong.

Thanks,
James

Re: Region gets stuck in transition state

Posted by Stack <st...@duboce.net>.

Oh, the cloudera lads are working on updating their distro to 0.20.3.
Will flag list when done.
St.Ack

On Wed, Jan 27, 2010 at 2:51 PM, Stack <st...@duboce.net> wrote:
> On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote:
>>
>> First we shut down the master and all region servers and then manually
>> removed the /hbase root through hadoop/HDFS.  One of my colleagues
>> increased some timeout values (I think they were ZooKeeper timeouts).
>
> ticktime?
>
>> Another change was that I recreated the table without LZO compression
>> and without setting the IN_MEMORY flag.  I learned that we did not have
>> the LZO libraries installed, and the table had been created originally
>> with compression set to LZO, so I imagine that would cause problems.  I
>> didn't see any errors about it in the logs, however.  Maybe this
>> explains why we lost data during our initial testing after shutting down
>> HBase.  Perhaps it was unable to write the data to HDFS because the LZO
>> libraries were not available?
>>
>
> If lzo enabled and libs are not in place, no data is written IIRC.
> Its a problem.
>
>> Anyway, everything seems to be ok for now.  We can restart HBase without
>> data loss or errors, and we can truncate the table without any problems.
>> If any other issues crop up we plan on upgrading to 0.20.3, but our
>> preference is to stay with the Cloudera distro if we can.  We're doing
>> additional testing tonight with a larger dataset, so I'll keep an eye on
>> it and post back if we learn anything new.
>
> Avoid truncating tables if you are not on 0.20.3.  Its flakey and may
> put you back in the spot you complained of orignally.
>
> St.Ack
>
>>
>> Thanks again for your help.
>>
>> -James
>>
>>
>> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>>> >
>>> > After running a map/reduce job which inserted around 180,000 rows into
>>> > HBase, HBase appeared to be fine.  We could do a count on our table, and
>>> > no errors were reported.  We then tried to truncate the table in
>>> > preparation for another test but were unable to do so because the region
>>> > became stuck in a transition state.
>>>
>>> Yes.  In older hbase, truncate of > small tables was flakey.  Its
>>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>>> version they bundle especially since 0.20.3 just went out).
>>>
>>>  I restarted each region server
>>> > individually, but it did not fix the problem.  I tried the
>>> > disable_region and close_region commands from the hbase shell, but that
>>> > didn't work either.  After doing all of that, a status 'detailed' showed
>>> > this:
>>> >
>>> > 1 regionsInTransition
>>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false
>>> >
>>> > Then I restarted the master and all region servers, and it looked like this:
>>> >
>>> > 1 regionsInTransition
>>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false
>>>
>>>
>>> Even after a master restart?  Above is dump of a master internal
>>> datastructure that is kept in-memory.  Strange that it would pick up
>>> same exact state on restart (As Ryan says, a restart of the master
>>> alone is usually a radical but sufficient fix).
>>>
>>> I was going to say that you try onlining the individual region in the
>>> shell but I don't think that'll work either, not unless you update to
>>> 0.20.3 era hbase.
>>>
>>> >
>>> > I noticed messages in some of the region server logs indicating that
>>> > their zookeeper sessions had expired.  I'm not sure if this has anything
>>> > to do with the problem.
>>>
>>> It could.  The regionservers will restart if their session w/ zk
>>> expires.  Whats your hbase schema like?  How are you doing your
>>> upload?
>>>
>>> I should mention that this scenario is quite
>>> > repeatable, and the last few times it has happened we had to shut down
>>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>>> > and recreate the table.
>>> >
>>> For sure you've upped file descriptors and xceiver params as per the
>>> Getting Started?
>>>
>>> >
>>> > I was also wondering whether it was normal for there to be only one
>>> > region with 180,000+ rows.  Shouldn't this region be split into several
>>> > regions and distributed among the region servers?  I'm new to HBase, so
>>> > maybe my understanding of how it's supposed to work is wrong.
>>>
>>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>>> /hbase/table/regionname.  Region splits when its above a size
>>> threshold, 256M usually.
>>>
>>> St.Ack
>>>
>>> >
>>> > Thanks,
>>> > James
>>> >
>>> >
>>> >
>>
>>
>

Re: Region gets stuck in transition state

Posted by Stack <st...@duboce.net>.

On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote:
>
> First we shut down the master and all region servers and then manually
> removed the /hbase root through hadoop/HDFS.  One of my colleagues
> increased some timeout values (I think they were ZooKeeper timeouts).

ticktime?

> Another change was that I recreated the table without LZO compression
> and without setting the IN_MEMORY flag.  I learned that we did not have
> the LZO libraries installed, and the table had been created originally
> with compression set to LZO, so I imagine that would cause problems.  I
> didn't see any errors about it in the logs, however.  Maybe this
> explains why we lost data during our initial testing after shutting down
> HBase.  Perhaps it was unable to write the data to HDFS because the LZO
> libraries were not available?
>

If lzo enabled and libs are not in place, no data is written IIRC.
Its a problem.

> Anyway, everything seems to be ok for now.  We can restart HBase without
> data loss or errors, and we can truncate the table without any problems.
> If any other issues crop up we plan on upgrading to 0.20.3, but our
> preference is to stay with the Cloudera distro if we can.  We're doing
> additional testing tonight with a larger dataset, so I'll keep an eye on
> it and post back if we learn anything new.

Avoid truncating tables if you are not on 0.20.3.  Its flakey and may
put you back in the spot you complained of orignally.

St.Ack

>
> Thanks again for your help.
>
> -James
>
>
> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>> >
>> > After running a map/reduce job which inserted around 180,000 rows into
>> > HBase, HBase appeared to be fine.  We could do a count on our table, and
>> > no errors were reported.  We then tried to truncate the table in
>> > preparation for another test but were unable to do so because the region
>> > became stuck in a transition state.
>>
>> Yes.  In older hbase, truncate of > small tables was flakey.  Its
>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>> version they bundle especially since 0.20.3 just went out).
>>
>>  I restarted each region server
>> > individually, but it did not fix the problem.  I tried the
>> > disable_region and close_region commands from the hbase shell, but that
>> > didn't work either.  After doing all of that, a status 'detailed' showed
>> > this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false
>> >
>> > Then I restarted the master and all region servers, and it looked like this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false
>>
>>
>> Even after a master restart?  Above is dump of a master internal
>> datastructure that is kept in-memory.  Strange that it would pick up
>> same exact state on restart (As Ryan says, a restart of the master
>> alone is usually a radical but sufficient fix).
>>
>> I was going to say that you try onlining the individual region in the
>> shell but I don't think that'll work either, not unless you update to
>> 0.20.3 era hbase.
>>
>> >
>> > I noticed messages in some of the region server logs indicating that
>> > their zookeeper sessions had expired.  I'm not sure if this has anything
>> > to do with the problem.
>>
>> It could.  The regionservers will restart if their session w/ zk
>> expires.  Whats your hbase schema like?  How are you doing your
>> upload?
>>
>> I should mention that this scenario is quite
>> > repeatable, and the last few times it has happened we had to shut down
>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>> > and recreate the table.
>> >
>> For sure you've upped file descriptors and xceiver params as per the
>> Getting Started?
>>
>> >
>> > I was also wondering whether it was normal for there to be only one
>> > region with 180,000+ rows.  Shouldn't this region be split into several
>> > regions and distributed among the region servers?  I'm new to HBase, so
>> > maybe my understanding of how it's supposed to work is wrong.
>>
>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>> /hbase/table/regionname.  Region splits when its above a size
>> threshold, 256M usually.
>>
>> St.Ack
>>
>> >
>> > Thanks,
>> > James
>> >
>> >
>> >
>
>

Re: Region gets stuck in transition state

Posted by Ryan Rawson <ry...@gmail.com>.

Just fyi there are known bugs in 0.20.0, I strongly urge you to get on
0.20.3 asap, either on your own or as soon as CDH includes it.



On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote:
> Thank you for the suggestions.  I think we have managed to resolve the
> problem.  Due to our tight deadline on this project we weren't able to
> change one parameter, retest, and then change another, so I'm not sure
> exactly which change(s) fixed the problem.
>
> First we shut down the master and all region servers and then manually
> removed the /hbase root through hadoop/HDFS.  One of my colleagues
> increased some timeout values (I think they were ZooKeeper timeouts).
> Another change was that I recreated the table without LZO compression
> and without setting the IN_MEMORY flag.  I learned that we did not have
> the LZO libraries installed, and the table had been created originally
> with compression set to LZO, so I imagine that would cause problems.  I
> didn't see any errors about it in the logs, however.  Maybe this
> explains why we lost data during our initial testing after shutting down
> HBase.  Perhaps it was unable to write the data to HDFS because the LZO
> libraries were not available?
>
> Anyway, everything seems to be ok for now.  We can restart HBase without
> data loss or errors, and we can truncate the table without any problems.
> If any other issues crop up we plan on upgrading to 0.20.3, but our
> preference is to stay with the Cloudera distro if we can.  We're doing
> additional testing tonight with a larger dataset, so I'll keep an eye on
> it and post back if we learn anything new.
>
> Thanks again for your help.
>
> -James
>
>
> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>> >
>> > After running a map/reduce job which inserted around 180,000 rows into
>> > HBase, HBase appeared to be fine.  We could do a count on our table, and
>> > no errors were reported.  We then tried to truncate the table in
>> > preparation for another test but were unable to do so because the region
>> > became stuck in a transition state.
>>
>> Yes.  In older hbase, truncate of > small tables was flakey.  Its
>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>> version they bundle especially since 0.20.3 just went out).
>>
>>  I restarted each region server
>> > individually, but it did not fix the problem.  I tried the
>> > disable_region and close_region commands from the hbase shell, but that
>> > didn't work either.  After doing all of that, a status 'detailed' showed
>> > this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false
>> >
>> > Then I restarted the master and all region servers, and it looked like this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false
>>
>>
>> Even after a master restart?  Above is dump of a master internal
>> datastructure that is kept in-memory.  Strange that it would pick up
>> same exact state on restart (As Ryan says, a restart of the master
>> alone is usually a radical but sufficient fix).
>>
>> I was going to say that you try onlining the individual region in the
>> shell but I don't think that'll work either, not unless you update to
>> 0.20.3 era hbase.
>>
>> >
>> > I noticed messages in some of the region server logs indicating that
>> > their zookeeper sessions had expired.  I'm not sure if this has anything
>> > to do with the problem.
>>
>> It could.  The regionservers will restart if their session w/ zk
>> expires.  Whats your hbase schema like?  How are you doing your
>> upload?
>>
>> I should mention that this scenario is quite
>> > repeatable, and the last few times it has happened we had to shut down
>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>> > and recreate the table.
>> >
>> For sure you've upped file descriptors and xceiver params as per the
>> Getting Started?
>>
>> >
>> > I was also wondering whether it was normal for there to be only one
>> > region with 180,000+ rows.  Shouldn't this region be split into several
>> > regions and distributed among the region servers?  I'm new to HBase, so
>> > maybe my understanding of how it's supposed to work is wrong.
>>
>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>> /hbase/table/regionname.  Region splits when its above a size
>> threshold, 256M usually.
>>
>> St.Ack
>>
>> >
>> > Thanks,
>> > James
>> >
>> >
>> >
>
>

Re: Region gets stuck in transition state

Posted by James Baldassari <ja...@dataxu.com>.

Thank you for the suggestions.  I think we have managed to resolve the
problem.  Due to our tight deadline on this project we weren't able to
change one parameter, retest, and then change another, so I'm not sure
exactly which change(s) fixed the problem.

First we shut down the master and all region servers and then manually
removed the /hbase root through hadoop/HDFS.  One of my colleagues
increased some timeout values (I think they were ZooKeeper timeouts).
Another change was that I recreated the table without LZO compression
and without setting the IN_MEMORY flag.  I learned that we did not have
the LZO libraries installed, and the table had been created originally
with compression set to LZO, so I imagine that would cause problems.  I
didn't see any errors about it in the logs, however.  Maybe this
explains why we lost data during our initial testing after shutting down
HBase.  Perhaps it was unable to write the data to HDFS because the LZO
libraries were not available?

Anyway, everything seems to be ok for now.  We can restart HBase without
data loss or errors, and we can truncate the table without any problems.
If any other issues crop up we plan on upgrading to 0.20.3, but our
preference is to stay with the Cloudera distro if we can.  We're doing
additional testing tonight with a larger dataset, so I'll keep an eye on
it and post back if we learn anything new. 

Thanks again for your help.

-James

On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
> >
> > After running a map/reduce job which inserted around 180,000 rows into
> > HBase, HBase appeared to be fine.  We could do a count on our table, and
> > no errors were reported.  We then tried to truncate the table in
> > preparation for another test but were unable to do so because the region
> > became stuck in a transition state.
> 
> Yes.  In older hbase, truncate of > small tables was flakey.  Its
> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
> version they bundle especially since 0.20.3 just went out).
> 
>  I restarted each region server
> > individually, but it did not fix the problem.  I tried the
> > disable_region and close_region commands from the hbase shell, but that
> > didn't work either.  After doing all of that, a status 'detailed' showed
> > this:
> >
> > 1 regionsInTransition
> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false
> >
> > Then I restarted the master and all region servers, and it looked like this:
> >
> > 1 regionsInTransition
> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false
> 
> 
> Even after a master restart?  Above is dump of a master internal
> datastructure that is kept in-memory.  Strange that it would pick up
> same exact state on restart (As Ryan says, a restart of the master
> alone is usually a radical but sufficient fix).
> 
> I was going to say that you try onlining the individual region in the
> shell but I don't think that'll work either, not unless you update to
> 0.20.3 era hbase.
> 
> >
> > I noticed messages in some of the region server logs indicating that
> > their zookeeper sessions had expired.  I'm not sure if this has anything
> > to do with the problem.
> 
> It could.  The regionservers will restart if their session w/ zk
> expires.  Whats your hbase schema like?  How are you doing your
> upload?
> 
> I should mention that this scenario is quite
> > repeatable, and the last few times it has happened we had to shut down
> > HBase and manually remove the /hbase root from HDFS, then start HBase
> > and recreate the table.
> >
> For sure you've upped file descriptors and xceiver params as per the
> Getting Started?
> 
> >
> > I was also wondering whether it was normal for there to be only one
> > region with 180,000+ rows.  Shouldn't this region be split into several
> > regions and distributed among the region servers?  I'm new to HBase, so
> > maybe my understanding of how it's supposed to work is wrong.
> 
> Get the regions size on the filesystem: ./bin/hadoop fs -dus
> /hbase/table/regionname.  Region splits when its above a size
> threshold, 256M usually.
> 
> St.Ack
> 
> >
> > Thanks,
> > James
> >
> >
> >

Re: Region gets stuck in transition state

Posted by Stack <st...@duboce.net>.

On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>
> After running a map/reduce job which inserted around 180,000 rows into
> HBase, HBase appeared to be fine.  We could do a count on our table, and
> no errors were reported.  We then tried to truncate the table in
> preparation for another test but were unable to do so because the region
> became stuck in a transition state.

Yes.  In older hbase, truncate of > small tables was flakey.  Its
better in 0.20.3 (I wrote our brothers over at Cloudera about updating
version they bundle especially since 0.20.3 just went out).

 I restarted each region server
> individually, but it did not fix the problem.  I tried the
> disable_region and close_region commands from the hbase shell, but that
> didn't work either.  After doing all of that, a status 'detailed' showed
> this:
>
> 1 regionsInTransition
>    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, open=false, closing=true, pendingClose=false, closed=false, offlined=false
>
> Then I restarted the master and all region servers, and it looked like this:
>
> 1 regionsInTransition
>    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, open=false, closing=false, pendingClose=false, closed=false, offlined=false

Even after a master restart?  Above is dump of a master internal
datastructure that is kept in-memory.  Strange that it would pick up
same exact state on restart (As Ryan says, a restart of the master
alone is usually a radical but sufficient fix).

I was going to say that you try onlining the individual region in the
shell but I don't think that'll work either, not unless you update to
0.20.3 era hbase.

>
> I noticed messages in some of the region server logs indicating that
> their zookeeper sessions had expired.  I'm not sure if this has anything
> to do with the problem.

It could.  The regionservers will restart if their session w/ zk
expires.  Whats your hbase schema like?  How are you doing your
upload?

I should mention that this scenario is quite
> repeatable, and the last few times it has happened we had to shut down
> HBase and manually remove the /hbase root from HDFS, then start HBase
> and recreate the table.
>
For sure you've upped file descriptors and xceiver params as per the
Getting Started?

>
> I was also wondering whether it was normal for there to be only one
> region with 180,000+ rows.  Shouldn't this region be split into several
> regions and distributed among the region servers?  I'm new to HBase, so
> maybe my understanding of how it's supposed to work is wrong.

Get the regions size on the filesystem: ./bin/hadoop fs -dus
/hbase/table/regionname.  Region splits when its above a size
threshold, 256M usually.

St.Ack

>
> Thanks,
> James
>
>
>