You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bryan Beaudreault <bb...@hubspot.com> on 2012/04/16 17:21:04 UTC

regions stuck in transition

Hello,

We've recently had a problem where regions will get stuck in transition for
a long period of time.  In fact, they don't ever appear to get
out-of-transition unless we take manual action.  Last time this happened I
restarted the master and they were cleared out.  This time I wanted to
consult the list first.

I checked the admin ui for all 24 of our servers, and the region does not
appear to be hosted anywhere.  If I look in hdfs, I do see the region there
and it has 2 files.  The first instance of this region in my HMaster logs
is:

2/04/15 17:48:06 INFO master.HMaster: balance
> hri=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.,
> src=XXXXXXXXX.ec2.internal,60020,1334064456919,
> dest=XXXXXXXX.ec2.internal,60020,1334064197946
> 12/04/15 17:48:06 INFO master.AssignmentManager: Server
> serverName=XXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
> regions=0, usedHeap=0, maxHeap=0) returned
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Received close for
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e


It then keeps saying the same few logs every ~30 mins:

12/04/15 18:18:18 INFO master.AssignmentManager: Regions in transition
> timed out:
>  visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> state=PENDING_CLOSE, ts=1334526491544, server=null
> 12/04/15 18:18:18 INFO master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> 12/04/15 18:18:18 INFO master.AssignmentManager: Server
> serverName=XXXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
> regions=0, usedHeap=0, maxHeap=0) returned
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Received close for
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e


Any ideas how I can avoid this, or a better solution than restarting the
HMaster?

Thanks,

Bryan

Re: regions stuck in transition

Posted by Stack <st...@duboce.net>.
On Wed, Apr 18, 2012 at 7:38 AM, Bryan Beaudreault
<bb...@hubspot.com> wrote:
> Yes, I can get data from the region through the shell.  The problem is the
> balancer cannot run when a region is in transition, so it is never running.
>  Our region servers are becoming increasingly unbalanced.
>

Yes.  This is by design.  We don't want balancer running when regions
are in-transition out on the cluster.  Unfortunately, in your case,
you have a stuck region.

> I don't want to restart a RegionServer, because it would cause a blip in
> requests for any regions on that server.  At least restarting the master
> seems to not affect reads.
>

FYI, see http://hbase.apache.org/book.html#node.management for a way
of decommissioning a regionserver in a way that minimizes the blip.

St.Ack

Re: regions stuck in transition

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Yes, I can get data from the region through the shell.  The problem is the
balancer cannot run when a region is in transition, so it is never running.
 Our region servers are becoming increasingly unbalanced.

I don't want to restart a RegionServer, because it would cause a blip in
requests for any regions on that server.  At least restarting the master
seems to not affect reads.

Any other ways to avoid this happening or fix it without restarting
services?

Thanks,

Bryan

On Tue, Apr 17, 2012 at 4:38 PM, Alex Baranau <al...@gmail.com>wrote:

> I've seen similar behavior  at our cluster too.
>
> From the top of my head, you can try to restart particular RegionServer,
> where those regions belong too (in cases I saw usually single regionserver
> was an issue).
>
> Have you tried to access data from that region (e.g. in shell)? I think it
> should still be served.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
>
> On Mon, Apr 16, 2012 at 11:21 AM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
>
> > Hello,
> >
> > We've recently had a problem where regions will get stuck in transition
> for
> > a long period of time.  In fact, they don't ever appear to get
> > out-of-transition unless we take manual action.  Last time this happened
> I
> > restarted the master and they were cleared out.  This time I wanted to
> > consult the list first.
> >
> > I checked the admin ui for all 24 of our servers, and the region does not
> > appear to be hosted anywhere.  If I look in hdfs, I do see the region
> there
> > and it has 2 files.  The first instance of this region in my HMaster logs
> > is:
> >
> > 2/04/15 17:48:06 INFO master.HMaster: balance
> > >
> >
> hri=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.,
> > > src=XXXXXXXXX.ec2.internal,60020,1334064456919,
> > > dest=XXXXXXXX.ec2.internal,60020,1334064197946
> > > 12/04/15 17:48:06 INFO master.AssignmentManager: Server
> > > serverName=XXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
> > > regions=0, usedHeap=0, maxHeap=0) returned
> > > org.apache.hadoop.hbase.NotServingRegionException:
> > > org.apache.hadoop.hbase.NotServingRegionException: Received close for
> > >
> >
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > > but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
> >
> >
> > It then keeps saying the same few logs every ~30 mins:
> >
> > 12/04/15 18:18:18 INFO master.AssignmentManager: Regions in transition
> > > timed out:
> > >
> >
>  visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > > state=PENDING_CLOSE, ts=1334526491544, server=null
> > > 12/04/15 18:18:18 INFO master.AssignmentManager: Region has been
> > > PENDING_CLOSE for too long, running forced unassign again on
> > >
> >
> region=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > > 12/04/15 18:18:18 INFO master.AssignmentManager: Server
> > > serverName=XXXXXXXXX.ec2.internal,60020,1334064456919,
> load=(requests=0,
> > > regions=0, usedHeap=0, maxHeap=0) returned
> > > org.apache.hadoop.hbase.NotServingRegionException:
> > > org.apache.hadoop.hbase.NotServingRegionException: Received close for
> > >
> >
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > > but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
> >
> >
> > Any ideas how I can avoid this, or a better solution than restarting the
> > HMaster?
> >
> > Thanks,
> >
> > Bryan
> >
>

Re: regions stuck in transition

Posted by Alex Baranau <al...@gmail.com>.
I've seen similar behavior  at our cluster too.

>From the top of my head, you can try to restart particular RegionServer,
where those regions belong too (in cases I saw usually single regionserver
was an issue).

Have you tried to access data from that region (e.g. in shell)? I think it
should still be served.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Mon, Apr 16, 2012 at 11:21 AM, Bryan Beaudreault <
bbeaudreault@hubspot.com> wrote:

> Hello,
>
> We've recently had a problem where regions will get stuck in transition for
> a long period of time.  In fact, they don't ever appear to get
> out-of-transition unless we take manual action.  Last time this happened I
> restarted the master and they were cleared out.  This time I wanted to
> consult the list first.
>
> I checked the admin ui for all 24 of our servers, and the region does not
> appear to be hosted anywhere.  If I look in hdfs, I do see the region there
> and it has 2 files.  The first instance of this region in my HMaster logs
> is:
>
> 2/04/15 17:48:06 INFO master.HMaster: balance
> >
> hri=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.,
> > src=XXXXXXXXX.ec2.internal,60020,1334064456919,
> > dest=XXXXXXXX.ec2.internal,60020,1334064197946
> > 12/04/15 17:48:06 INFO master.AssignmentManager: Server
> > serverName=XXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
> > regions=0, usedHeap=0, maxHeap=0) returned
> > org.apache.hadoop.hbase.NotServingRegionException:
> > org.apache.hadoop.hbase.NotServingRegionException: Received close for
> >
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>
>
> It then keeps saying the same few logs every ~30 mins:
>
> 12/04/15 18:18:18 INFO master.AssignmentManager: Regions in transition
> > timed out:
> >
>  visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > state=PENDING_CLOSE, ts=1334526491544, server=null
> > 12/04/15 18:18:18 INFO master.AssignmentManager: Region has been
> > PENDING_CLOSE for too long, running forced unassign again on
> >
> region=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > 12/04/15 18:18:18 INFO master.AssignmentManager: Server
> > serverName=XXXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
> > regions=0, usedHeap=0, maxHeap=0) returned
> > org.apache.hadoop.hbase.NotServingRegionException:
> > org.apache.hadoop.hbase.NotServingRegionException: Received close for
> >
> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
> > but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>
>
> Any ideas how I can avoid this, or a better solution than restarting the
> HMaster?
>
> Thanks,
>
> Bryan
>

Re: regions stuck in transition

Posted by Stack <st...@duboce.net>.
On Mon, Apr 16, 2012 at 8:21 AM, Bryan Beaudreault
<bb...@hubspot.com> wrote:
> We've recently had a problem where regions will get stuck in transition for
> a long period of time.  In fact, they don't ever appear to get
> out-of-transition unless we take manual action.  Last time this happened I
> restarted the master and they were cleared out.  This time I wanted to
> consult the list first.
>

Yeah, sometimes the master's notion of what the cluster state is goes
out of agreement w/ conditions on the ground and restart of master
forcing it to reconsult the cluster is the only way to clear up
certain states (much has been fixed around the issues that gave rise
to these conditions in later hbase's but that you probably figured and
its probably of little immediate help to you at the moment).

> I checked the admin ui for all 24 of our servers, and the region does not
> appear to be hosted anywhere.  If I look in hdfs, I do see the region there
> and it has 2 files.  The first instance of this region in my HMaster logs
> is:
>
> 2/04/15 17:48:06 INFO master.HMaster: balance
>> hri=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.,
>> src=XXXXXXXXX.ec2.internal,60020,1334064456919,
>> dest=XXXXXXXX.ec2.internal,60020,1334064197946
>> 12/04/15 17:48:06 INFO master.AssignmentManager: Server
>> serverName=XXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0) returned
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Received close for
>> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>

This seems like classic case of master out of whack w/ the cluster;
its trying to rebalance a region that is not where it thinks it is.


> It then keeps saying the same few logs every ~30 mins:
>
> 12/04/15 18:18:18 INFO master.AssignmentManager: Regions in transition
>> timed out:

Yeah, every 30mins a checker runs.


>>  visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> state=PENDING_CLOSE, ts=1334526491544, server=null
>> 12/04/15 18:18:18 INFO master.AssignmentManager: Region has been
>> PENDING_CLOSE for too long, running forced unassign again on
>> region=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> 12/04/15 18:18:18 INFO master.AssignmentManager: Server
>> serverName=XXXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0) returned
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Received close for
>> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>
>
> Any ideas how I can avoid this, or a better solution than restarting the
> HMaster?
>

Can you grep this region in your master log so we can see its history?
 If its not deployed anywhere and all your data is online, restarting
the master might be the only think you can do in 0.90.x era hbase to
get rid of the above.  You could also try deleting that znode  from
zk.  Fire up the zk command line by doing ./bin/hbase zkcli.   Do
help.  You should be able to figure it.  If you can't find the above
znode in zk, then for sure its only the master's head and restart of
master is way to go (In later hbase's, should this condition arise,
there is an api that you can poke to make it clear the above so you
don't have to restart master).

St.Ack