You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2012/01/17 00:45:01 UTC

Table region got stuck, doesn't move/assign

Hi,

i have a table which seems to get stuck in a state where it can't be
queried, moved or split/compacted.

The logs don't have any error statements. Our admin tried hbck to no avail .

We stopped the region server, table did not get reassigned. (all other
did). when bround in UI, this table just showed "region server
offline". (??? shouldn't get reassigned as others did?)

Brining region server online loaded it with other regions, but not
that table. master apparently still thinks it is on that node (data6)
and so all requests are failing with region not serving message.

assign/move/ unassign commands have no effect (move fails, but
assing/unassign seems to be quiet with no apparent effect).

Another weirdness: it's the only table that is showing up under
hbase/table in zk and its region is listed under /hbase/unassigned.


What else can we do? this is a prod database (cdh3u0 ) and we pretty
much like to avoid any down time.

Where can i read about meaning and transitions of zookeeper nodes under /hbase ?

Thank you very much.
-d

Re: Table region got stuck, doesn't move/assign

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Another weirdness, where it all started while that regions was still
where it was, was that any attempt to query that table caused
indefinite hanging (call never returned). Now of course that we
killed/restarted the RS it just says 'region not serving'.

On Mon, Jan 16, 2012 at 3:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Hi,
>
> i have a table which seems to get stuck in a state where it can't be
> queried, moved or split/compacted.
>
> The logs don't have any error statements. Our admin tried hbck to no avail .
>
> We stopped the region server, table did not get reassigned. (all other
> did). when bround in UI, this table just showed "region server
> offline". (??? shouldn't get reassigned as others did?)
>
> Brining region server online loaded it with other regions, but not
> that table. master apparently still thinks it is on that node (data6)
> and so all requests are failing with region not serving message.
>
> assign/move/ unassign commands have no effect (move fails, but
> assing/unassign seems to be quiet with no apparent effect).
>
> Another weirdness: it's the only table that is showing up under
> hbase/table in zk and its region is listed under /hbase/unassigned.
>
>
> What else can we do? this is a prod database (cdh3u0 ) and we pretty
> much like to avoid any down time.
>
> Where can i read about meaning and transitions of zookeeper nodes under /hbase ?
>
> Thank you very much.
> -d

Re: Table region got stuck, doesn't move/assign

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
thank you, Michael.

problem is solved (for now) by moving region out after restarting the
region server although we don't really know the reason why and what
happened to that region.

Region server got stuck on any requests to a particular region and
only that one. Master was ok as i realized later. Why it couldn't
immediately move the region, i am mot sure; but as soon as we
restarted the region server and switched table offline/online, it was
able to complete move /reassign the region.

The real problem was that it happened to one (apparently random)
region in a region server but not others. Symptoms were region server
hanging, not returning any scan requests to that region (but not
others). the condition persisted for a long time (several days) and we
did not figure it out until we caught several jobs of low importance
timing out on reading from the table containing that region. The table
experiences asychronous reads and regular write updates (it's actually
a part of HBL cube).

I think there's really low chance we'll ever get down to the bottom of
it, so we dropped any further triage attempts at this point. I guess
we just also need to upgrade our hbase stack in prod.

Thank you very much, sir.

-d


On Wed, Jan 18, 2012 at 9:34 AM, Stack <st...@duboce.net> wrote:
> On Mon, Jan 16, 2012 at 3:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> i have a table which seems to get stuck in a state where it can't be
>> queried, moved or split/compacted.
>>
>
> How many regions in this table?  One only?
>
>> The logs don't have any error statements. Our admin tried hbck to no avail .
>>
>
> What did your admin see?
>
>
>> We stopped the region server, table did not get reassigned. (all other
>> did). when bround in UI, this table just showed "region server
>> offline". (??? shouldn't get reassigned as others did?)
>>
>
> Yes.  It should.
>
>> Brining region server online loaded it with other regions, but not
>> that table. master apparently still thinks it is on that node (data6)
>> and so all requests are failing with region not serving message.
>>
>
>
> So, there is something 'wrong' w/ that table.   Can you track it in
> master log and see what happens when master tries assign it?  Maybe
> its failing to open?
>
>> assign/move/ unassign commands have no effect (move fails, but
>> assing/unassign seems to be quiet with no apparent effect).
>>
>> Another weirdness: it's the only table that is showing up under
>> hbase/table in zk and its region is listed under /hbase/unassigned.
>>
>
>
> Maybe its stuck in transition?  You should see messages in master log
> if this the case.
>
>> Where can i read about meaning and transitions of zookeeper nodes under /hbase ?
>>
>
> I don't think this documented in the reference guide (its a little too
> much detail for most I'd say).  Best place to look is probably source
> code.  See here for an entrance into the wonderful world of
> master/regionserver state transitions:
> http://hbase.apache.org/xref/org/apache/hadoop/hbase/executor/EventHandler.html#93
>
> St.Ack

Re: Table region got stuck, doesn't move/assign

Posted by Stack <st...@duboce.net>.
On Mon, Jan 16, 2012 at 3:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> i have a table which seems to get stuck in a state where it can't be
> queried, moved or split/compacted.
>

How many regions in this table?  One only?

> The logs don't have any error statements. Our admin tried hbck to no avail .
>

What did your admin see?


> We stopped the region server, table did not get reassigned. (all other
> did). when bround in UI, this table just showed "region server
> offline". (??? shouldn't get reassigned as others did?)
>

Yes.  It should.

> Brining region server online loaded it with other regions, but not
> that table. master apparently still thinks it is on that node (data6)
> and so all requests are failing with region not serving message.
>


So, there is something 'wrong' w/ that table.   Can you track it in
master log and see what happens when master tries assign it?  Maybe
its failing to open?

> assign/move/ unassign commands have no effect (move fails, but
> assing/unassign seems to be quiet with no apparent effect).
>
> Another weirdness: it's the only table that is showing up under
> hbase/table in zk and its region is listed under /hbase/unassigned.
>


Maybe its stuck in transition?  You should see messages in master log
if this the case.

> Where can i read about meaning and transitions of zookeeper nodes under /hbase ?
>

I don't think this documented in the reference guide (its a little too
much detail for most I'd say).  Best place to look is probably source
code.  See here for an entrance into the wonderful world of
master/regionserver state transitions:
http://hbase.apache.org/xref/org/apache/hadoop/hbase/executor/EventHandler.html#93

St.Ack