You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ryan Rawson <ry...@gmail.com> on 2009/04/04 11:19:57 UTC

something wrong on trunk possibly

Hey guys,

There seems to be something wrong on trunk... I used to have long map-reduce
jobs, but now they are failing, unable to commit:

2009-04-04 01:17:09,279 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
locateRegionInMeta attempt 5 of 10 failed; retrying after sleep of 8000
java.io.IOException: HRegionInfo was null or empty in .META.
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:566)
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:484)
... etc

Basically mappers get stuck up on commits and make no progress, mapred kills
them, done.

I've spent some time banging at it - made sure that ulimit -n is good, set
the ipc handler limit to 30, cranked down the number of maps I'm doing,
etc.  To no avail.

At least I figured out how to debug hadoop jobs a bit.

Anyone have thoughts?

Re: something wrong on trunk possibly

Posted by stack <st...@duboce.net>.
On Sun, Apr 5, 2009 at 11:24 PM, Ryan Rawson <ry...@gmail.com> wrote:

> I'm hoping the new key format will help with these things a bit.  One
> solution is a dump/restore of .META. whereby you dump the known keys, then
> delete and reload the table after truncating and deleting all the store
> files.  We don't have tools for that yet iirc...



Add this to the hbfsck issue, I'd say.  Its a good idea.  Remove any
serverstartcode and/or serveraddress that does not have a corresponding
regioninfo entry (or that is on a row that does not have a region on the
filesystem).


maybe hbase-1234 will help fix this problem in a fundamental way?
>

I don't think so.  1234 will not change how deletes are done though it does
introduce the ability to change how we do deletes because it introduces
notion of key types.

St.Ack

Re: something wrong on trunk possibly

Posted by Ryan Rawson <ry...@gmail.com>.
I'm hoping the new key format will help with these things a bit.  One
solution is a dump/restore of .META. whereby you dump the known keys, then
delete and reload the table after truncating and deleting all the store
files.  We don't have tools for that yet iirc...

maybe hbase-1234 will help fix this problem in a fundamental way?




On Sun, Apr 5, 2009 at 1:25 PM, stack <st...@duboce.net> wrote:

> Sorry, I meant to write earlier.
>
> I've seen this condition in the past back when deletes were not working
> properly.  What I'd see is that the HRegionInfo entry in .META. had been
> deleted but, somehow, a but had it that the the accompanying startcode and
> server entries were not deleted.   These startcode and server entries would
> bubble up during getClosest but were not easily deletable since were not in
> current set of .META. regions.  We'd seemed to have put this issue behind
> us.   Maybe your OOME during the bulk upload brought it on?
>
> At powerset, we had a condition where a table had entries in .META. that
> had
> been made with an old hbase.  Updating to an hbase with the deletes fix was
> not sufficient; when these empty HRI's shine through, you can't delete the
> startcode and server entries seemingly because "they are not in the table"
> (Getting their timestamps proved awkward).  The only recourse back then was
> renaming the table so it a new namespace in .META.
>
> St.Ack
>
> On Sat, Apr 4, 2009 at 12:14 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
> > I looked at the commits on trunk, nothing new recently.
> >
> > Some weird corruption and scanner errors in trunk.... nuking /hbase and
> > restarting fixed it, something wrong with the .META. table obviously.
> >
> > Looks like what is happening is findClosestBefore() returns a 'empty'
> > RowResult, with absolutely no columns in it, futhermore, the row id
> doesnt
> > appear in my 'region list' Web-UI.  So it's not an active real alive
> > region,
> > it's some other artifact that is still hanging out. Maybe it's a phantom
> > delete showing up as an entry.
> >
> > I'm not sure it's worthwhile debugging until after HBASE-1234 comes out.
> > After all the buggy code is probably being substantially reworked and/or
> > removed.
> >
> > -ryan
> >
> > On Sat, Apr 4, 2009 at 2:19 AM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> > > Hey guys,
> > >
> > > There seems to be something wrong on trunk... I used to have long
> > > map-reduce jobs, but now they are failing, unable to commit:
> > >
> > > 2009-04-04 01:17:09,279 DEBUG
> > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
> > > locateRegionInMeta attempt 5 of 10 failed; retrying after sleep of 8000
> > > java.io.IOException: HRegionInfo was null or empty in .META.
> > >         at
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:566)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:484)
> > > ... etc
> > >
> > > Basically mappers get stuck up on commits and make no progress, mapred
> > > kills them, done.
> > >
> > > I've spent some time banging at it - made sure that ulimit -n is good,
> > set
> > > the ipc handler limit to 30, cranked down the number of maps I'm doing,
> > > etc.  To no avail.
> > >
> > > At least I figured out how to debug hadoop jobs a bit.
> > >
> > > Anyone have thoughts?
> > >
> >
>

Re: something wrong on trunk possibly

Posted by stack <st...@duboce.net>.
Sorry, I meant to write earlier.

I've seen this condition in the past back when deletes were not working
properly.  What I'd see is that the HRegionInfo entry in .META. had been
deleted but, somehow, a but had it that the the accompanying startcode and
server entries were not deleted.   These startcode and server entries would
bubble up during getClosest but were not easily deletable since were not in
current set of .META. regions.  We'd seemed to have put this issue behind
us.   Maybe your OOME during the bulk upload brought it on?

At powerset, we had a condition where a table had entries in .META. that had
been made with an old hbase.  Updating to an hbase with the deletes fix was
not sufficient; when these empty HRI's shine through, you can't delete the
startcode and server entries seemingly because "they are not in the table"
(Getting their timestamps proved awkward).  The only recourse back then was
renaming the table so it a new namespace in .META.

St.Ack

On Sat, Apr 4, 2009 at 12:14 PM, Ryan Rawson <ry...@gmail.com> wrote:

> I looked at the commits on trunk, nothing new recently.
>
> Some weird corruption and scanner errors in trunk.... nuking /hbase and
> restarting fixed it, something wrong with the .META. table obviously.
>
> Looks like what is happening is findClosestBefore() returns a 'empty'
> RowResult, with absolutely no columns in it, futhermore, the row id doesnt
> appear in my 'region list' Web-UI.  So it's not an active real alive
> region,
> it's some other artifact that is still hanging out. Maybe it's a phantom
> delete showing up as an entry.
>
> I'm not sure it's worthwhile debugging until after HBASE-1234 comes out.
> After all the buggy code is probably being substantially reworked and/or
> removed.
>
> -ryan
>
> On Sat, Apr 4, 2009 at 2:19 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
> > Hey guys,
> >
> > There seems to be something wrong on trunk... I used to have long
> > map-reduce jobs, but now they are failing, unable to commit:
> >
> > 2009-04-04 01:17:09,279 DEBUG
> > org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
> > locateRegionInMeta attempt 5 of 10 failed; retrying after sleep of 8000
> > java.io.IOException: HRegionInfo was null or empty in .META.
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:566)
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:484)
> > ... etc
> >
> > Basically mappers get stuck up on commits and make no progress, mapred
> > kills them, done.
> >
> > I've spent some time banging at it - made sure that ulimit -n is good,
> set
> > the ipc handler limit to 30, cranked down the number of maps I'm doing,
> > etc.  To no avail.
> >
> > At least I figured out how to debug hadoop jobs a bit.
> >
> > Anyone have thoughts?
> >
>

Re: something wrong on trunk possibly

Posted by Andrew Purtell <ap...@yahoo.com>.



--- On Sat, 4/4/09, Ryan Rawson <ry...@gmail.com> wrote:

> From: Ryan Rawson <ry...@gmail.com>
> Subject: Re: something wrong on trunk possibly
> To: hbase-dev@hadoop.apache.org
> Date: Saturday, April 4, 2009, 3:14 AM
> I looked at the commits on trunk, nothing new recently.
> 
> Some weird corruption and scanner errors in trunk....
> nuking /hbase and
> restarting fixed it, something wrong with the .META. table
> obviously.
> 
> Looks like what is happening is findClosestBefore() returns
> a 'empty'
> RowResult, with absolutely no columns in it, futhermore,
> the row id doesnt
> appear in my 'region list' Web-UI.  So it's not
> an active real alive region,
> it's some other artifact that is still hanging out.
> Maybe it's a phantom
> delete showing up as an entry.
> 
> I'm not sure it's worthwhile debugging until after
> HBASE-1234 comes out.
> After all the buggy code is probably being substantially
> reworked and/or
> removed.
> 
> -ryan
> 
> On Sat, Apr 4, 2009 at 2:19 AM, Ryan Rawson
> <ry...@gmail.com> wrote:
> 
> > Hey guys,
> >
> > There seems to be something wrong on trunk... I used
> to have long
> > map-reduce jobs, but now they are failing, unable to
> commit:
> >
> > 2009-04-04 01:17:09,279 DEBUG
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
> > locateRegionInMeta attempt 5 of 10 failed; retrying
> after sleep of 8000
> > java.io.IOException: HRegionInfo was null or empty in
> .META.
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:566)
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
> >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:484)
> > ... etc
> >
> > Basically mappers get stuck up on commits and make no
> progress, mapred
> > kills them, done.
> >
> > I've spent some time banging at it - made sure
> that ulimit -n is good, set
> > the ipc handler limit to 30, cranked down the number
> of maps I'm doing,
> > etc.  To no avail.
> >
> > At least I figured out how to debug hadoop jobs a bit.
> >
> > Anyone have thoughts?
> >


      

Re: something wrong on trunk possibly

Posted by Ryan Rawson <ry...@gmail.com>.
I looked at the commits on trunk, nothing new recently.

Some weird corruption and scanner errors in trunk.... nuking /hbase and
restarting fixed it, something wrong with the .META. table obviously.

Looks like what is happening is findClosestBefore() returns a 'empty'
RowResult, with absolutely no columns in it, futhermore, the row id doesnt
appear in my 'region list' Web-UI.  So it's not an active real alive region,
it's some other artifact that is still hanging out. Maybe it's a phantom
delete showing up as an entry.

I'm not sure it's worthwhile debugging until after HBASE-1234 comes out.
After all the buggy code is probably being substantially reworked and/or
removed.

-ryan

On Sat, Apr 4, 2009 at 2:19 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Hey guys,
>
> There seems to be something wrong on trunk... I used to have long
> map-reduce jobs, but now they are failing, unable to commit:
>
> 2009-04-04 01:17:09,279 DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
> locateRegionInMeta attempt 5 of 10 failed; retrying after sleep of 8000
> java.io.IOException: HRegionInfo was null or empty in .META.
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:566)
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:484)
> ... etc
>
> Basically mappers get stuck up on commits and make no progress, mapred
> kills them, done.
>
> I've spent some time banging at it - made sure that ulimit -n is good, set
> the ipc handler limit to 30, cranked down the number of maps I'm doing,
> etc.  To no avail.
>
> At least I figured out how to debug hadoop jobs a bit.
>
> Anyone have thoughts?
>