You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Geoff Hendrey <gh...@decarta.com> on 2011/09/02 22:40:58 UTC

PENDING_CLOSE for too long

In the master logs, I am seeing "regions in transition timed out" and
"region has been PENDING_CLOSE for too long, running forced unasign".
Both of these log messages occur at INFO level, so I assume they are
innocuous. Should I be concerned?

 

-geoff

RE: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

thanks, and CCing my team

-----Original Message-----
From: Stuart Smith [mailto:stu24mail@yahoo.com] 
Sent: Monday, November 14, 2011 3:20 PM
To: user@hbase.apache.org
Subject: Re: PENDING_CLOSE for too long

Thanks Geoff!

  The slow reply was due to the saga being moved to the cloudera lists.

I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


Take care,
  -stu




________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
Sent: Saturday, October 29, 2011 7:08 PM
Subject: Re: PENDING_CLOSE for too long

Stuart -

Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

Sent from my iPhone

On Oct 29, 2011, at 4:19 PM, "Ted Yu" <yu...@gmail.com> wrote:

> In 0.92 (to be released in 2 weeks), you can expect improvement in this
> regard.
> See HBASE-3368.
> 
> Geoff:
> Can you publish your tool on HBASE JIRA ?
> 
> Thanks
> 
> On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> 
> > Sure. I posted the code many weeks back for a tool that will repair holes
> > in .mETA.
> >
> > If you do a check on the list, you should find it. I'll send you the
> > latest code for that. Maybe I made some fixes after I posted the code.
> > Please ping me if I forget. I've used it to repair huge tables  (and fixed
> > subtle bugs in the process) so I'm confident it works.
> >
> > No matter what anyone tells me, I know hbase is horribly broken for the
> > use case of doing bulk writes from an mr job. It shits the bed every time
> > you pass a certain scale. For this reason we've completely rewritten our
> > code so that we use bulkloading. It's way more efficient and always work.
> >
> > Please ping me until I send you the code. Otherwise I will forget.
> >
> > Sent from my iPhone
> >
> > On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
> >
> > > Hello Geoff,
> > >
> > >   I usually don't show up here, since I use CDH, and good form means I
> > should stay on CDH-users,
> > > But!
> > >   I've been seeing the same issues for months:
> > >
> > >  - PENDING_CLOSE too long, master tries to reassign - I see an
> > continuous stream of these.
> > >  - WrongRegionExceptions due to overlapping regions & holes in the
> > regions.
> > >
> > > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> > script to write a java program to fix up overlaps & holes in an offline
> > fashion (hbase down, directly on hdfs), and will start testing next week
> > (cross my fingers!).
> > >
> > > It seems like the pending close messages can be ignored?
> > > And once I test my tool, and confirm I know a little bit about what I'm
> > doing, maybe we could share notes?
> > >
> > > Take care,
> > >   -stu
> > >
> > >
> > >
> > > ________________________________
> > > From: Geoff Hendrey <gh...@decarta.com>
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Sent: Saturday, September 3, 2011 12:11 AM
> > > Subject: RE: PENDING_CLOSE for too long
> > >
> > > "Are you having trouble getting to any of your data out in tables?"
> > >
> > > depends what you mean. We see corruptions from time to time that prevent
> > > us from getting data, one way or another. Today's corruption was regions
> > > with duplicate start and end rows. We fixed that by deleting the
> > > offending regions from HDFS, and running add_table.rb to restore the
> > > meta. The other common corruption is the holes in ".META." that we
> > > repair with a little tool we wrote. We'd love to learn why we see these
> > > corruptions with such regularity (seemingly much higher than others on
> > > the list).
> > >
> > > We will implement timeout you suggest, and see how it goes.
> > >
> > > Thanks,
> > > Geoff
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Friday, September 02, 2011 10:51 PM
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Subject: Re: PENDING_CLOSE for too long
> > >
> > > Are you having trouble getting to any of your data out in tables?
> > >
> > > To get rid of them, try restarting your master.
> > >
> > > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > > timeout after 30 minutes instead of 3"; i.e. set
> > > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > > hbase-site.xml.
> > >
> > > St.Ack
> > >
> > > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > > wrote:
> > > > In the master logs, I am seeing "regions in transition timed out" and
> > > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > > Both of these log messages occur at INFO level, so I assume they are
> > > > innocuous. Should I be concerned?
> > > >
> > > >
> > > >
> > > > -geoff
> > > >
> > > >
> >

Re: PENDING_CLOSE for too long

Posted by lars hofhansl <lh...@yahoo.com>.

Hi Stuart,

when you get come time, could you tell us how you "mess[ed] up badly", so that others can avoid the same mistakes?

Thanks.


-- Lars



----- Original Message -----
From: Stuart Smith <st...@yahoo.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Monday, November 14, 2011 3:20 PM
Subject: Re: PENDING_CLOSE for too long

Thanks Geoff!

  The slow reply was due to the saga being moved to the cloudera lists.

I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


Take care,
  -stu




________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
Sent: Saturday, October 29, 2011 7:08 PM
Subject: Re: PENDING_CLOSE for too long

Stuart -

Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

Sent from my iPhone

On Oct 29, 2011, at 4:19 PM, "Ted Yu" <yu...@gmail.com> wrote:

> In 0.92 (to be released in 2 weeks), you can expect improvement in this
> regard.
> See HBASE-3368.
> 
> Geoff:
> Can you publish your tool on HBASE JIRA ?
> 
> Thanks
> 
> On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> 
> > Sure. I posted the code many weeks back for a tool that will repair holes
> > in .mETA.
> >
> > If you do a check on the list, you should find it. I'll send you the
> > latest code for that. Maybe I made some fixes after I posted the code.
> > Please ping me if I forget. I've used it to repair huge tables  (and fixed
> > subtle bugs in the process) so I'm confident it works.
> >
> > No matter what anyone tells me, I know hbase is horribly broken for the
> > use case of doing bulk writes from an mr job. It shits the bed every time
> > you pass a certain scale. For this reason we've completely rewritten our
> > code so that we use bulkloading. It's way more efficient and always work.
> >
> > Please ping me until I send you the code. Otherwise I will forget.
> >
> > Sent from my iPhone
> >
> > On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
> >
> > > Hello Geoff,
> > >
> > >   I usually don't show up here, since I use CDH, and good form means I
> > should stay on CDH-users,
> > > But!
> > >   I've been seeing the same issues for months:
> > >
> > >  - PENDING_CLOSE too long, master tries to reassign - I see an
> > continuous stream of these.
> > >  - WrongRegionExceptions due to overlapping regions & holes in the
> > regions.
> > >
> > > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> > script to write a java program to fix up overlaps & holes in an offline
> > fashion (hbase down, directly on hdfs), and will start testing next week
> > (cross my fingers!).
> > >
> > > It seems like the pending close messages can be ignored?
> > > And once I test my tool, and confirm I know a little bit about what I'm
> > doing, maybe we could share notes?
> > >
> > > Take care,
> > >   -stu
> > >
> > >
> > >
> > > ________________________________
> > > From: Geoff Hendrey <gh...@decarta.com>
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Sent: Saturday, September 3, 2011 12:11 AM
> > > Subject: RE: PENDING_CLOSE for too long
> > >
> > > "Are you having trouble getting to any of your data out in tables?"
> > >
> > > depends what you mean. We see corruptions from time to time that prevent
> > > us from getting data, one way or another. Today's corruption was regions
> > > with duplicate start and end rows. We fixed that by deleting the
> > > offending regions from HDFS, and running add_table.rb to restore the
> > > meta. The other common corruption is the holes in ".META." that we
> > > repair with a little tool we wrote. We'd love to learn why we see these
> > > corruptions with such regularity (seemingly much higher than others on
> > > the list).
> > >
> > > We will implement timeout you suggest, and see how it goes.
> > >
> > > Thanks,
> > > Geoff
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Friday, September 02, 2011 10:51 PM
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Subject: Re: PENDING_CLOSE for too long
> > >
> > > Are you having trouble getting to any of your data out in tables?
> > >
> > > To get rid of them, try restarting your master.
> > >
> > > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > > timeout after 30 minutes instead of 3"; i.e. set
> > > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > > hbase-site.xml.
> > >
> > > St.Ack
> > >
> > > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > > wrote:
> > > > In the master logs, I am seeing "regions in transition timed out" and
> > > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > > Both of these log messages occur at INFO level, so I assume they are
> > > > innocuous. Should I be concerned?
> > > >
> > > >
> > > >
> > > > -geoff
> > > >
> > > >
> >

RE: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

Oh, and by the way in the case of scan-for-a-single value being super slow, a guy on our team found that the client caches region meta information aggressively. It can be turned off using hbase.client.prefetch.limit, and you will see scan-for-a-single value being about 10x faster.

We've also been using the merge script, but it sure is slow.,

-geoff

-----Original Message-----
From: Stuart Smith [mailto:stu24mail@yahoo.com] 
Sent: Monday, November 14, 2011 3:20 PM
To: user@hbase.apache.org
Subject: Re: PENDING_CLOSE for too long

Thanks Geoff!

  The slow reply was due to the saga being moved to the cloudera lists.

I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


Take care,
  -stu




________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
Sent: Saturday, October 29, 2011 7:08 PM
Subject: Re: PENDING_CLOSE for too long

Stuart -

Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

Sent from my iPhone

On Oct 29, 2011, at 4:19 PM, "Ted Yu" <yu...@gmail.com> wrote:

> In 0.92 (to be released in 2 weeks), you can expect improvement in this
> regard.
> See HBASE-3368.
> 
> Geoff:
> Can you publish your tool on HBASE JIRA ?
> 
> Thanks
> 
> On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> 
> > Sure. I posted the code many weeks back for a tool that will repair holes
> > in .mETA.
> >
> > If you do a check on the list, you should find it. I'll send you the
> > latest code for that. Maybe I made some fixes after I posted the code.
> > Please ping me if I forget. I've used it to repair huge tables  (and fixed
> > subtle bugs in the process) so I'm confident it works.
> >
> > No matter what anyone tells me, I know hbase is horribly broken for the
> > use case of doing bulk writes from an mr job. It shits the bed every time
> > you pass a certain scale. For this reason we've completely rewritten our
> > code so that we use bulkloading. It's way more efficient and always work.
> >
> > Please ping me until I send you the code. Otherwise I will forget.
> >
> > Sent from my iPhone
> >
> > On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
> >
> > > Hello Geoff,
> > >
> > >   I usually don't show up here, since I use CDH, and good form means I
> > should stay on CDH-users,
> > > But!
> > >   I've been seeing the same issues for months:
> > >
> > >  - PENDING_CLOSE too long, master tries to reassign - I see an
> > continuous stream of these.
> > >  - WrongRegionExceptions due to overlapping regions & holes in the
> > regions.
> > >
> > > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> > script to write a java program to fix up overlaps & holes in an offline
> > fashion (hbase down, directly on hdfs), and will start testing next week
> > (cross my fingers!).
> > >
> > > It seems like the pending close messages can be ignored?
> > > And once I test my tool, and confirm I know a little bit about what I'm
> > doing, maybe we could share notes?
> > >
> > > Take care,
> > >   -stu
> > >
> > >
> > >
> > > ________________________________
> > > From: Geoff Hendrey <gh...@decarta.com>
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Sent: Saturday, September 3, 2011 12:11 AM
> > > Subject: RE: PENDING_CLOSE for too long
> > >
> > > "Are you having trouble getting to any of your data out in tables?"
> > >
> > > depends what you mean. We see corruptions from time to time that prevent
> > > us from getting data, one way or another. Today's corruption was regions
> > > with duplicate start and end rows. We fixed that by deleting the
> > > offending regions from HDFS, and running add_table.rb to restore the
> > > meta. The other common corruption is the holes in ".META." that we
> > > repair with a little tool we wrote. We'd love to learn why we see these
> > > corruptions with such regularity (seemingly much higher than others on
> > > the list).
> > >
> > > We will implement timeout you suggest, and see how it goes.
> > >
> > > Thanks,
> > > Geoff
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Friday, September 02, 2011 10:51 PM
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Subject: Re: PENDING_CLOSE for too long
> > >
> > > Are you having trouble getting to any of your data out in tables?
> > >
> > > To get rid of them, try restarting your master.
> > >
> > > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > > timeout after 30 minutes instead of 3"; i.e. set
> > > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > > hbase-site.xml.
> > >
> > > St.Ack
> > >
> > > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > > wrote:
> > > > In the master logs, I am seeing "regions in transition timed out" and
> > > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > > Both of these log messages occur at INFO level, so I assume they are
> > > > innocuous. Should I be concerned?
> > > >
> > > >
> > > >
> > > > -geoff
> > > >
> > > >
> >

Re: PENDING_CLOSE for too long

Posted by Stuart Smith <st...@yahoo.com>.

Thanks Geoff!

  The slow reply was due to the saga being moved to the cloudera lists.

I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


Take care,
  -stu




________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: user@hbase.apache.org; Stuart Smith <st...@yahoo.com>
Sent: Saturday, October 29, 2011 7:08 PM
Subject: Re: PENDING_CLOSE for too long

Stuart -

Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

Sent from my iPhone

On Oct 29, 2011, at 4:19 PM, "Ted Yu" <yu...@gmail.com> wrote:

> In 0.92 (to be released in 2 weeks), you can expect improvement in this
> regard.
> See HBASE-3368.
> 
> Geoff:
> Can you publish your tool on HBASE JIRA ?
> 
> Thanks
> 
> On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> 
> > Sure. I posted the code many weeks back for a tool that will repair holes
> > in .mETA.
> >
> > If you do a check on the list, you should find it. I'll send you the
> > latest code for that. Maybe I made some fixes after I posted the code.
> > Please ping me if I forget. I've used it to repair huge tables  (and fixed
> > subtle bugs in the process) so I'm confident it works.
> >
> > No matter what anyone tells me, I know hbase is horribly broken for the
> > use case of doing bulk writes from an mr job. It shits the bed every time
> > you pass a certain scale. For this reason we've completely rewritten our
> > code so that we use bulkloading. It's way more efficient and always work.
> >
> > Please ping me until I send you the code. Otherwise I will forget.
> >
> > Sent from my iPhone
> >
> > On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
> >
> > > Hello Geoff,
> > >
> > >   I usually don't show up here, since I use CDH, and good form means I
> > should stay on CDH-users,
> > > But!
> > >   I've been seeing the same issues for months:
> > >
> > >  - PENDING_CLOSE too long, master tries to reassign - I see an
> > continuous stream of these.
> > >  - WrongRegionExceptions due to overlapping regions & holes in the
> > regions.
> > >
> > > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> > script to write a java program to fix up overlaps & holes in an offline
> > fashion (hbase down, directly on hdfs), and will start testing next week
> > (cross my fingers!).
> > >
> > > It seems like the pending close messages can be ignored?
> > > And once I test my tool, and confirm I know a little bit about what I'm
> > doing, maybe we could share notes?
> > >
> > > Take care,
> > >   -stu
> > >
> > >
> > >
> > > ________________________________
> > > From: Geoff Hendrey <gh...@decarta.com>
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Sent: Saturday, September 3, 2011 12:11 AM
> > > Subject: RE: PENDING_CLOSE for too long
> > >
> > > "Are you having trouble getting to any of your data out in tables?"
> > >
> > > depends what you mean. We see corruptions from time to time that prevent
> > > us from getting data, one way or another. Today's corruption was regions
> > > with duplicate start and end rows. We fixed that by deleting the
> > > offending regions from HDFS, and running add_table.rb to restore the
> > > meta. The other common corruption is the holes in ".META." that we
> > > repair with a little tool we wrote. We'd love to learn why we see these
> > > corruptions with such regularity (seemingly much higher than others on
> > > the list).
> > >
> > > We will implement timeout you suggest, and see how it goes.
> > >
> > > Thanks,
> > > Geoff
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Friday, September 02, 2011 10:51 PM
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Subject: Re: PENDING_CLOSE for too long
> > >
> > > Are you having trouble getting to any of your data out in tables?
> > >
> > > To get rid of them, try restarting your master.
> > >
> > > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > > timeout after 30 minutes instead of 3"; i.e. set
> > > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > > hbase-site.xml.
> > >
> > > St.Ack
> > >
> > > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > > wrote:
> > > > In the master logs, I am seeing "regions in transition timed out" and
> > > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > > Both of these log messages occur at INFO level, so I assume they are
> > > > innocuous. Should I be concerned?
> > > >
> > > >
> > > >
> > > > -geoff
> > > >
> > > >
> >

Re: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

Stuart -

Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

Sent from my iPhone

On Oct 29, 2011, at 4:19 PM, "Ted Yu" <yu...@gmail.com> wrote:

> In 0.92 (to be released in 2 weeks), you can expect improvement in this
> regard.
> See HBASE-3368.
> 
> Geoff:
> Can you publish your tool on HBASE JIRA ?
> 
> Thanks
> 
> On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> 
> > Sure. I posted the code many weeks back for a tool that will repair holes
> > in .mETA.
> >
> > If you do a check on the list, you should find it. I'll send you the
> > latest code for that. Maybe I made some fixes after I posted the code.
> > Please ping me if I forget. I've used it to repair huge tables  (and fixed
> > subtle bugs in the process) so I'm confident it works.
> >
> > No matter what anyone tells me, I know hbase is horribly broken for the
> > use case of doing bulk writes from an mr job. It shits the bed every time
> > you pass a certain scale. For this reason we've completely rewritten our
> > code so that we use bulkloading. It's way more efficient and always work.
> >
> > Please ping me until I send you the code. Otherwise I will forget.
> >
> > Sent from my iPhone
> >
> > On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
> >
> > > Hello Geoff,
> > >
> > >   I usually don't show up here, since I use CDH, and good form means I
> > should stay on CDH-users,
> > > But!
> > >   I've been seeing the same issues for months:
> > >
> > >  - PENDING_CLOSE too long, master tries to reassign - I see an
> > continuous stream of these.
> > >  - WrongRegionExceptions due to overlapping regions & holes in the
> > regions.
> > >
> > > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> > script to write a java program to fix up overlaps & holes in an offline
> > fashion (hbase down, directly on hdfs), and will start testing next week
> > (cross my fingers!).
> > >
> > > It seems like the pending close messages can be ignored?
> > > And once I test my tool, and confirm I know a little bit about what I'm
> > doing, maybe we could share notes?
> > >
> > > Take care,
> > >   -stu
> > >
> > >
> > >
> > > ________________________________
> > > From: Geoff Hendrey <gh...@decarta.com>
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Sent: Saturday, September 3, 2011 12:11 AM
> > > Subject: RE: PENDING_CLOSE for too long
> > >
> > > "Are you having trouble getting to any of your data out in tables?"
> > >
> > > depends what you mean. We see corruptions from time to time that prevent
> > > us from getting data, one way or another. Today's corruption was regions
> > > with duplicate start and end rows. We fixed that by deleting the
> > > offending regions from HDFS, and running add_table.rb to restore the
> > > meta. The other common corruption is the holes in ".META." that we
> > > repair with a little tool we wrote. We'd love to learn why we see these
> > > corruptions with such regularity (seemingly much higher than others on
> > > the list).
> > >
> > > We will implement timeout you suggest, and see how it goes.
> > >
> > > Thanks,
> > > Geoff
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Friday, September 02, 2011 10:51 PM
> > > To: user@hbase.apache.org
> > > Cc: hbase-user@hadoop.apache.org
> > > Subject: Re: PENDING_CLOSE for too long
> > >
> > > Are you having trouble getting to any of your data out in tables?
> > >
> > > To get rid of them, try restarting your master.
> > >
> > > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > > timeout after 30 minutes instead of 3"; i.e. set
> > > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > > hbase-site.xml.
> > >
> > > St.Ack
> > >
> > > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > > wrote:
> > > > In the master logs, I am seeing "regions in transition timed out" and
> > > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > > Both of these log messages occur at INFO level, so I assume they are
> > > > innocuous. Should I be concerned?
> > > >
> > > >
> > > >
> > > > -geoff
> > > >
> > > >
> >

Re: PENDING_CLOSE for too long

Posted by Ted Yu <yu...@gmail.com>.

In 0.92 (to be released in 2 weeks), you can expect improvement in this
regard.
See HBASE-3368.

Geoff:
Can you publish your tool on HBASE JIRA ?

Thanks

On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey <gh...@decarta.com> wrote:

> Sure. I posted the code many weeks back for a tool that will repair holes
> in .mETA.
>
> If you do a check on the list, you should find it. I'll send you the
> latest code for that. Maybe I made some fixes after I posted the code.
> Please ping me if I forget. I've used it to repair huge tables  (and fixed
> subtle bugs in the process) so I'm confident it works.
>
> No matter what anyone tells me, I know hbase is horribly broken for the
> use case of doing bulk writes from an mr job. It shits the bed every time
> you pass a certain scale. For this reason we've completely rewritten our
> code so that we use bulkloading. It's way more efficient and always work.
>
> Please ping me until I send you the code. Otherwise I will forget.
>
> Sent from my iPhone
>
> On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:
>
> > Hello Geoff,
> >
> >   I usually don't show up here, since I use CDH, and good form means I
> should stay on CDH-users,
> > But!
> >   I've been seeing the same issues for months:
> >
> >  - PENDING_CLOSE too long, master tries to reassign - I see an
> continuous stream of these.
> >  - WrongRegionExceptions due to overlapping regions & holes in the
> regions.
> >
> > I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
> script to write a java program to fix up overlaps & holes in an offline
> fashion (hbase down, directly on hdfs), and will start testing next week
> (cross my fingers!).
> >
> > It seems like the pending close messages can be ignored?
> > And once I test my tool, and confirm I know a little bit about what I'm
> doing, maybe we could share notes?
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> > From: Geoff Hendrey <gh...@decarta.com>
> > To: user@hbase.apache.org
> > Cc: hbase-user@hadoop.apache.org
> > Sent: Saturday, September 3, 2011 12:11 AM
> > Subject: RE: PENDING_CLOSE for too long
> >
> > "Are you having trouble getting to any of your data out in tables?"
> >
> > depends what you mean. We see corruptions from time to time that prevent
> > us from getting data, one way or another. Today's corruption was regions
> > with duplicate start and end rows. We fixed that by deleting the
> > offending regions from HDFS, and running add_table.rb to restore the
> > meta. The other common corruption is the holes in ".META." that we
> > repair with a little tool we wrote. We'd love to learn why we see these
> > corruptions with such regularity (seemingly much higher than others on
> > the list).
> >
> > We will implement timeout you suggest, and see how it goes.
> >
> > Thanks,
> > Geoff
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > Stack
> > Sent: Friday, September 02, 2011 10:51 PM
> > To: user@hbase.apache.org
> > Cc: hbase-user@hadoop.apache.org
> > Subject: Re: PENDING_CLOSE for too long
> >
> > Are you having trouble getting to any of your data out in tables?
> >
> > To get rid of them, try restarting your master.
> >
> > Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> > timeout after 30 minutes instead of 3"; i.e. set
> > "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> > hbase-site.xml.
> >
> > St.Ack
> >
> > On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> > wrote:
> > > In the master logs, I am seeing "regions in transition timed out" and
> > > "region has been PENDING_CLOSE for too long, running forced unasign".
> > > Both of these log messages occur at INFO level, so I assume they are
> > > innocuous. Should I be concerned?
> > >
> > >
> > >
> > > -geoff
> > >
> > >
>

Re: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

Sure. I posted the code many weeks back for a tool that will repair holes in .mETA.

If you do a check on the list, you should find it. I'll send you the latest code for that. Maybe I made some fixes after I posted the code. Please ping me if I forget. I've used it to repair huge tables  (and fixed subtle bugs in the process) so I'm confident it works.

No matter what anyone tells me, I know hbase is horribly broken for the use case of doing bulk writes from an mr job. It shits the bed every time you pass a certain scale. For this reason we've completely rewritten our code so that we use bulkloading. It's way more efficient and always work.

Please ping me until I send you the code. Otherwise I will forget. 

Sent from my iPhone

On Oct 29, 2011, at 1:39 PM, "Stuart Smith" <st...@yahoo.com> wrote:

> Hello Geoff,
> 
>   I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
> But!
>   I've been seeing the same issues for months:
> 
>  - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
>  - WrongRegionExceptions due to overlapping regions & holes in the regions.
> 
> I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).
> 
> It seems like the pending close messages can be ignored?
> And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?
> 
> Take care,
>   -stu
> 
> 
> 
> ________________________________
> From: Geoff Hendrey <gh...@decarta.com>
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Sent: Saturday, September 3, 2011 12:11 AM
> Subject: RE: PENDING_CLOSE for too long
> 
> "Are you having trouble getting to any of your data out in tables?"
> 
> depends what you mean. We see corruptions from time to time that prevent
> us from getting data, one way or another. Today's corruption was regions
> with duplicate start and end rows. We fixed that by deleting the
> offending regions from HDFS, and running add_table.rb to restore the
> meta. The other common corruption is the holes in ".META." that we
> repair with a little tool we wrote. We'd love to learn why we see these
> corruptions with such regularity (seemingly much higher than others on
> the list).
> 
> We will implement timeout you suggest, and see how it goes.
> 
> Thanks,
> Geoff
> 
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Friday, September 02, 2011 10:51 PM
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: Re: PENDING_CLOSE for too long
> 
> Are you having trouble getting to any of your data out in tables?
> 
> To get rid of them, try restarting your master.
> 
> Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> timeout after 30 minutes instead of 3"; i.e. set
> "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> hbase-site.xml.
> 
> St.Ack
> 
> On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> wrote:
> > In the master logs, I am seeing "regions in transition timed out" and
> > "region has been PENDING_CLOSE for too long, running forced unasign".
> > Both of these log messages occur at INFO level, so I assume they are
> > innocuous. Should I be concerned?
> >
> >
> >
> > -geoff
> >
> >

Re: tool to move out consecutive regions

Posted by Stack <st...@duboce.net>.

Thanks Geoff.  Mind making a JIRA and attaching the code as a patch?
Copying and pasting from email might not work so well.  Thanks boss,
St.Ack

On Mon, Oct 31, 2011 at 10:46 AM, Geoff Hendrey <gh...@decarta.com> wrote:
> Hi Guys -
>
> This is a fairly complete little Tool (Configured) whose purpose is to move out a whole slew of regions into a backup directory and restore .META. when done. We found that we needed to do this when a huge volume of keys had been generated into a production table, and it turned out the whole set of keys had an incorrect prefix. Thus, what we really wanted to do was move the data out of all the regions into some backup directory in one fell swoop. This tool accepts some parameters with -D (hadoop arguments). It will remove a slew of contiguous regions, relink the .META., and place the removed data in a backup directory in HDFS. It has been tested on big tables and includes some more subtle "gotchas" catches like being careful when parsing region names, to guard against rowkeys actually containing commas. It worked for me, but use at your own risk.
>
> Basically you give it -Dregion.remove.regionname.start=STARTREGION and region.remove.regionname.end=ENDREGION and all the data between STARTREGION and ENDREGION will be moved out of your table, where STARTREGION and ENDREGION are region names.
>
> import java.io.IOException;
> import java.io.InputStream;
> import java.util.Iterator;
> import java.util.logging.Level;
> import java.util.logging.Logger;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.conf.Configured;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.hbase.HBaseConfiguration;
> import org.apache.hadoop.hbase.HConstants;
> import org.apache.hadoop.hbase.HRegionInfo;
> import org.apache.hadoop.hbase.HTableDescriptor;
> import org.apache.hadoop.hbase.NotServingRegionException;
> import org.apache.hadoop.hbase.client.Delete;
> import org.apache.hadoop.hbase.client.Get;
> import org.apache.hadoop.hbase.client.HBaseAdmin;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.client.Put;
> import org.apache.hadoop.hbase.client.Result;
> import org.apache.hadoop.hbase.client.ResultScanner;
> import org.apache.hadoop.hbase.client.Scan;
> import org.apache.hadoop.hbase.util.Bytes;
> import org.apache.hadoop.hbase.util.FSUtils;
> import org.apache.hadoop.hbase.util.Writables;
> import org.apache.hadoop.util.Tool;
> import org.apache.hadoop.util.ToolRunner;
>
> /**
>  * @author ghendrey
>  */
> public class RemoveRegions extends Configured implements Tool {
>
>    public static void main(String[] args) throws Exception {
>        int exitCode = ToolRunner.run(new RemoveRegions(), args);
>        System.exit(exitCode);
>    }
>
>    private static void deleteMetaRow(HRegionInfo closedRegion, HTable hMetaTable) throws IOException {
>        Delete del = new Delete(closedRegion.getRegionName()); //Delete the original row from .META.
>        hMetaTable.delete(del);
>        System.out.println("Deleted the region's row from .META. " + closedRegion.getRegionNameAsString());
>    }
>
>    private static HRegionInfo closeRegion(Result result, HBaseAdmin admin) throws RuntimeException, IOException {
>
>        byte[] bytes = result.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
>        HRegionInfo closedRegion = Writables.getHRegionInfo(bytes);
>
>        try {
>            admin.closeRegion(closedRegion.getRegionName(), null); //. Close the existing region if open.
>            System.out.println("Closed the Region " + closedRegion.getRegionNameAsString());
>        } catch (Exception nse) {
>            System.out.println("Skipped closing the region because: " + nse.getMessage());
>        }
>        return closedRegion;
>    }
>
>    private static HRegionInfo getRegionInfo(String exclusiveStartRegionName, Configuration hConfig) throws IOException {
>        HTable readTable = new HTable(hConfig, Bytes.toBytes(".META."));
>        Get readGet = new Get(Bytes.toBytes(exclusiveStartRegionName));
>        Result readResult = readTable.get(readGet);
>        byte[] readBytes = readResult.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
>        HRegionInfo regionInfo = Writables.getHRegionInfo(readBytes); //Read the existing hregioninfo.
>        System.out.println("got region info: " + regionInfo);
>        return regionInfo;
>    }
>
>    private static void createBackupDir(Configuration conf) throws IOException {
>
>        String path = conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis());
>        Path backupDirPath = new Path(path);
>        FileSystem fs = backupDirPath.getFileSystem(conf);
>        FSUtils.DirFilter dirFilt = new FSUtils.DirFilter(fs);
>        System.out.println("creating backup dir: " + backupDirPath.toString());
>        fs.mkdirs(backupDirPath);
>    }
>
>    public int run(String[] strings) throws Exception {
>        try {
>            System.setProperty("javax.xml.parsers.DocumentBuilderFactory", "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");
>            Configuration conf = getConf();
>            Configuration hConfig = HBaseConfiguration.create(conf);
>            hConfig.set("hbase.zookeeper.quorum", System.getProperty("hbase.zookeeper.quorum", "doop2.dt.sv4.decarta.com,doop3.dt.sv4.decarta.com,doop4.dt.sv4.decarta.com,doop5.dt.sv4.decarta.com,doop7.dt.sv4.decarta.com,doop8.dt.sv4.decarta.com,doop9.dt.sv4.decarta.com,doop10.dt.sv4.decarta.com"));
>            HBaseAdmin admin = new HBaseAdmin(hConfig);
>            HBaseAdmin.checkHBaseAvailable(hConfig);
>
>
>            System.out.println("regions will be moved out from between region.remove.regionname.start and region.remove.regionname.end (exclusive)");
>            String exclusiveStartRegionName = conf.get("region.remove.regionname.start");
>            if (null == exclusiveStartRegionName) {
>                throw new RuntimeException("Current implementation requires an exclusive region.remove.regionname.start");
>            }
>            System.out.println("region.remove.regionname.start=" + exclusiveStartRegionName);
>            String exclusiveEndRegionName = conf.get("region.remove.regionname.end");
>            if (null == exclusiveEndRegionName) {
>
>                throw new RuntimeException("Current implementation requires an exclusive region.remove.endrow");
>            }
>            System.out.println("region.remove.regionname.end=" + exclusiveEndRegionName);
>
>            //CREATE A BACKUP DIR FOR THE REGION DATA TO BE MOVED INTO
>            createBackupDir(hConfig);
>
>
>            Path hbaseRootPath = FSUtils.getRootDir(hConfig);
>            if (null == hbaseRootPath) {
>                throw new RuntimeException("couldn't determine hbase root dir");
>            } else {
>                System.out.println("hbase rooted at " + hbaseRootPath.toString());
>            }
>
>            HTable hMetaTable = new HTable(hConfig, Bytes.toBytes(".META."));
>            System.out.println("connected to .META.");
>
>            //get region info for start and end regions
>            HRegionInfo exclusiveStartRegionInfo = getRegionInfo(exclusiveStartRegionName, hConfig);
>            HRegionInfo exclusiveEndRegionInfo = getRegionInfo(exclusiveEndRegionName, hConfig);
>
>
>            //CLOSE all the regions starting with the exclusiveStartRegionName (including it), and up to but excluding closing the exclusiveEndRegionName
>            //and DELETE rows from .META.
>            Scan scan = new Scan(Bytes.toBytes(exclusiveStartRegionName), Bytes.toBytes(exclusiveEndRegionName));
>            ResultScanner metaScanner = hMetaTable.getScanner(scan);
>            int i = 0;
>            for (Iterator<Result> iter = metaScanner.iterator(); iter.hasNext();) {
>                Result res = iter.next();
>                //CLOSE REGION
>                HRegionInfo closedRegion = closeRegion(res, admin);
>                //MOVE ACTUAL DATA OUT OF HBASE HDFS INTO BACKUP AREA
>                moveDataToBackup(closedRegion, hConfig);
>                //DELETE ROW FROM META TABLE
>                deleteMetaRow(closedRegion, hMetaTable);
>            }
>
>            //now reinsert the startrow into .META. with it's endrow pointing to the startrow of the exclusiveEndRegionInfo
>            //This effectively "relinks" the link list of .META., now that all the interstitial region-rows have been removed from .META.
>            relinkStartRow(exclusiveStartRegionInfo, exclusiveEndRegionInfo, hConfig, admin);
>
>
>            return 0;
>
>        } catch (Exception ex) {
>            throw new RuntimeException(ex.getMessage(), ex);
>        }
>
>    }
>
>    private void relinkStartRow(HRegionInfo exclusiveStartRegionInfo, HRegionInfo exclusiveEndRegionInfo, Configuration hConfig, HBaseAdmin admin) throws IllegalArgumentException, IOException {
>        //Now we are going to recreate the region info for exclusiveStartRegion, such that it's endKey points to the startKey
>        //of the exclusiveEndRegion.
>        HTableDescriptor descriptor = new HTableDescriptor(exclusiveStartRegionInfo.getTableDesc()); //Use existing hregioninfo htabledescriptor and this construction
>        // Just changing the End key , nothing else. This performs the "unlink" step
>        byte[] startKey = exclusiveStartRegionInfo.getStartKey();
>        byte[] endKey = exclusiveEndRegionInfo.getStartKey();
>        HRegionInfo newStartRegion = new HRegionInfo(descriptor, startKey, endKey);
>        byte[] value = Writables.getBytes(newStartRegion);
>        Put put = new Put(newStartRegion.getRegionName()); //  Same time stamp from the record.
>        put.add(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER, value); //Insert the new entry in .META. using new hregioninfo name as row key and add an info:regioninfo whose contents is the serialized new hregioninfo.
>        HTable metaTable = new HTable(hConfig, ".META.");
>        metaTable.put(put);
>        System.out.println("New row in .META.: " + newStartRegion.getRegionNameAsString() + " End key is " + Bytes.toString(exclusiveEndRegionInfo.getStartKey()));
>        admin.assign(newStartRegion.getRegionName(), true); //Assign the new region.
>        System.out.println("Assigned the new region " + newStartRegion.getRegionNameAsString());
>    }
>
>    private static void moveDataToBackup(HRegionInfo closedRegion, Configuration conf) throws IOException {
>
>
>        Path rootPath = FSUtils.getRootDir(conf);
>        String tablename = closedRegion.getRegionNameAsString().split(",")[0]; //split regionname on comma. tablename comes before first comma
>        Path tablePath = new Path(rootPath, tablename);
>        String[] dotSplit = closedRegion.getRegionNameAsString().split("\\.", 0);
>        String regionId = dotSplit[dotSplit.length - 1]; //split regionname on dot. regionId between last two dots
>        Path regionPath = new Path(tablePath, regionId);
>        System.out.println(regionPath);
>        FileSystem fs = FileSystem.get(conf);
>
>        Path regionBackupPath = new Path(conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis()) + "/" + regionId);
>
>        //Path regionBackupPath = new Path(backupPath, regionId);
>        System.out.println("moving to: " + regionBackupPath);
>        fs.rename(regionPath, regionBackupPath);
>
>    }
> }
>
> -----Original Message-----
> From: Stuart Smith [mailto:stu24mail@yahoo.com]
> Sent: Saturday, October 29, 2011 1:39 PM
> To: user@hbase.apache.org
> Subject: Re: PENDING_CLOSE for too long
>
> Hello Geoff,
>
>   I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
> But!
>   I've been seeing the same issues for months:
>
>  - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
>  - WrongRegionExceptions due to overlapping regions & holes in the regions.
>
> I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).
>
> It seems like the pending close messages can be ignored?
> And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?
>
> Take care,
>   -stu
>
>
>
> ________________________________
> From: Geoff Hendrey <gh...@decarta.com>
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Sent: Saturday, September 3, 2011 12:11 AM
> Subject: RE: PENDING_CLOSE for too long
>
> "Are you having trouble getting to any of your data out in tables?"
>
> depends what you mean. We see corruptions from time to time that prevent
> us from getting data, one way or another. Today's corruption was regions
> with duplicate start and end rows. We fixed that by deleting the
> offending regions from HDFS, and running add_table.rb to restore the
> meta. The other common corruption is the holes in ".META." that we
> repair with a little tool we wrote. We'd love to learn why we see these
> corruptions with such regularity (seemingly much higher than others on
> the list).
>
> We will implement timeout you suggest, and see how it goes.
>
> Thanks,
> Geoff
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Friday, September 02, 2011 10:51 PM
> To: user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: Re: PENDING_CLOSE for too long
>
> Are you having trouble getting to any of your data out in tables?
>
> To get rid of them, try restarting your master.
>
> Before you restart your master, do "HBASE-4126  Make timeoutmonitor
> timeout after 30 minutes instead of 3"; i.e. set
> "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
> hbase-site.xml.
>
> St.Ack
>
> On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
> wrote:
>> In the master logs, I am seeing "regions in transition timed out" and
>> "region has been PENDING_CLOSE for too long, running forced unasign".
>> Both of these log messages occur at INFO level, so I assume they are
>> innocuous. Should I be concerned?
>>
>>
>>
>> -geoff
>>
>>
>

tool to move out consecutive regions

Posted by Geoff Hendrey <gh...@decarta.com>.

Hi Guys -

This is a fairly complete little Tool (Configured) whose purpose is to move out a whole slew of regions into a backup directory and restore .META. when done. We found that we needed to do this when a huge volume of keys had been generated into a production table, and it turned out the whole set of keys had an incorrect prefix. Thus, what we really wanted to do was move the data out of all the regions into some backup directory in one fell swoop. This tool accepts some parameters with -D (hadoop arguments). It will remove a slew of contiguous regions, relink the .META., and place the removed data in a backup directory in HDFS. It has been tested on big tables and includes some more subtle "gotchas" catches like being careful when parsing region names, to guard against rowkeys actually containing commas. It worked for me, but use at your own risk.

Basically you give it -Dregion.remove.regionname.start=STARTREGION and region.remove.regionname.end=ENDREGION and all the data between STARTREGION and ENDREGION will be moved out of your table, where STARTREGION and ENDREGION are region names.

import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.HRegionInfo;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.NotServingRegionException;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.util.FSUtils;
import org.apache.hadoop.hbase.util.Writables;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * @author ghendrey
 */
public class RemoveRegions extends Configured implements Tool {

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new RemoveRegions(), args);
        System.exit(exitCode);
    }

    private static void deleteMetaRow(HRegionInfo closedRegion, HTable hMetaTable) throws IOException {
        Delete del = new Delete(closedRegion.getRegionName()); //Delete the original row from .META.
        hMetaTable.delete(del);
        System.out.println("Deleted the region's row from .META. " + closedRegion.getRegionNameAsString());
    }

    private static HRegionInfo closeRegion(Result result, HBaseAdmin admin) throws RuntimeException, IOException {

        byte[] bytes = result.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
        HRegionInfo closedRegion = Writables.getHRegionInfo(bytes);

        try {
            admin.closeRegion(closedRegion.getRegionName(), null); //. Close the existing region if open.
            System.out.println("Closed the Region " + closedRegion.getRegionNameAsString());
        } catch (Exception nse) {
            System.out.println("Skipped closing the region because: " + nse.getMessage());
        }
        return closedRegion;
    }

    private static HRegionInfo getRegionInfo(String exclusiveStartRegionName, Configuration hConfig) throws IOException {
        HTable readTable = new HTable(hConfig, Bytes.toBytes(".META."));
        Get readGet = new Get(Bytes.toBytes(exclusiveStartRegionName));
        Result readResult = readTable.get(readGet);
        byte[] readBytes = readResult.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
        HRegionInfo regionInfo = Writables.getHRegionInfo(readBytes); //Read the existing hregioninfo.
        System.out.println("got region info: " + regionInfo);
        return regionInfo;
    }

    private static void createBackupDir(Configuration conf) throws IOException {

        String path = conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis());
        Path backupDirPath = new Path(path);
        FileSystem fs = backupDirPath.getFileSystem(conf);
        FSUtils.DirFilter dirFilt = new FSUtils.DirFilter(fs);
        System.out.println("creating backup dir: " + backupDirPath.toString());
        fs.mkdirs(backupDirPath);
    }

    public int run(String[] strings) throws Exception {
        try {
            System.setProperty("javax.xml.parsers.DocumentBuilderFactory", "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");
            Configuration conf = getConf();
            Configuration hConfig = HBaseConfiguration.create(conf);
            hConfig.set("hbase.zookeeper.quorum", System.getProperty("hbase.zookeeper.quorum", "doop2.dt.sv4.decarta.com,doop3.dt.sv4.decarta.com,doop4.dt.sv4.decarta.com,doop5.dt.sv4.decarta.com,doop7.dt.sv4.decarta.com,doop8.dt.sv4.decarta.com,doop9.dt.sv4.decarta.com,doop10.dt.sv4.decarta.com"));
            HBaseAdmin admin = new HBaseAdmin(hConfig);
            HBaseAdmin.checkHBaseAvailable(hConfig);


            System.out.println("regions will be moved out from between region.remove.regionname.start and region.remove.regionname.end (exclusive)");
            String exclusiveStartRegionName = conf.get("region.remove.regionname.start");
            if (null == exclusiveStartRegionName) {
                throw new RuntimeException("Current implementation requires an exclusive region.remove.regionname.start");
            }
            System.out.println("region.remove.regionname.start=" + exclusiveStartRegionName);
            String exclusiveEndRegionName = conf.get("region.remove.regionname.end");
            if (null == exclusiveEndRegionName) {

                throw new RuntimeException("Current implementation requires an exclusive region.remove.endrow");
            }
            System.out.println("region.remove.regionname.end=" + exclusiveEndRegionName);

            //CREATE A BACKUP DIR FOR THE REGION DATA TO BE MOVED INTO
            createBackupDir(hConfig);


            Path hbaseRootPath = FSUtils.getRootDir(hConfig);
            if (null == hbaseRootPath) {
                throw new RuntimeException("couldn't determine hbase root dir");
            } else {
                System.out.println("hbase rooted at " + hbaseRootPath.toString());
            }

            HTable hMetaTable = new HTable(hConfig, Bytes.toBytes(".META."));
            System.out.println("connected to .META.");

            //get region info for start and end regions
            HRegionInfo exclusiveStartRegionInfo = getRegionInfo(exclusiveStartRegionName, hConfig);
            HRegionInfo exclusiveEndRegionInfo = getRegionInfo(exclusiveEndRegionName, hConfig);


            //CLOSE all the regions starting with the exclusiveStartRegionName (including it), and up to but excluding closing the exclusiveEndRegionName
            //and DELETE rows from .META.
            Scan scan = new Scan(Bytes.toBytes(exclusiveStartRegionName), Bytes.toBytes(exclusiveEndRegionName));
            ResultScanner metaScanner = hMetaTable.getScanner(scan);
            int i = 0;
            for (Iterator<Result> iter = metaScanner.iterator(); iter.hasNext();) {
                Result res = iter.next();
                //CLOSE REGION
                HRegionInfo closedRegion = closeRegion(res, admin);
                //MOVE ACTUAL DATA OUT OF HBASE HDFS INTO BACKUP AREA
                moveDataToBackup(closedRegion, hConfig);
                //DELETE ROW FROM META TABLE
                deleteMetaRow(closedRegion, hMetaTable);
            }

            //now reinsert the startrow into .META. with it's endrow pointing to the startrow of the exclusiveEndRegionInfo
            //This effectively "relinks" the link list of .META., now that all the interstitial region-rows have been removed from .META.
            relinkStartRow(exclusiveStartRegionInfo, exclusiveEndRegionInfo, hConfig, admin);


            return 0;

        } catch (Exception ex) {
            throw new RuntimeException(ex.getMessage(), ex);
        }

    }

    private void relinkStartRow(HRegionInfo exclusiveStartRegionInfo, HRegionInfo exclusiveEndRegionInfo, Configuration hConfig, HBaseAdmin admin) throws IllegalArgumentException, IOException {
        //Now we are going to recreate the region info for exclusiveStartRegion, such that it's endKey points to the startKey
        //of the exclusiveEndRegion.
        HTableDescriptor descriptor = new HTableDescriptor(exclusiveStartRegionInfo.getTableDesc()); //Use existing hregioninfo htabledescriptor and this construction
        // Just changing the End key , nothing else. This performs the "unlink" step
        byte[] startKey = exclusiveStartRegionInfo.getStartKey();
        byte[] endKey = exclusiveEndRegionInfo.getStartKey();
        HRegionInfo newStartRegion = new HRegionInfo(descriptor, startKey, endKey);
        byte[] value = Writables.getBytes(newStartRegion);
        Put put = new Put(newStartRegion.getRegionName()); //  Same time stamp from the record.
        put.add(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER, value); //Insert the new entry in .META. using new hregioninfo name as row key and add an info:regioninfo whose contents is the serialized new hregioninfo.
        HTable metaTable = new HTable(hConfig, ".META.");
        metaTable.put(put);
        System.out.println("New row in .META.: " + newStartRegion.getRegionNameAsString() + " End key is " + Bytes.toString(exclusiveEndRegionInfo.getStartKey()));
        admin.assign(newStartRegion.getRegionName(), true); //Assign the new region.
        System.out.println("Assigned the new region " + newStartRegion.getRegionNameAsString());
    }

    private static void moveDataToBackup(HRegionInfo closedRegion, Configuration conf) throws IOException {


        Path rootPath = FSUtils.getRootDir(conf);
        String tablename = closedRegion.getRegionNameAsString().split(",")[0]; //split regionname on comma. tablename comes before first comma
        Path tablePath = new Path(rootPath, tablename);
        String[] dotSplit = closedRegion.getRegionNameAsString().split("\\.", 0);
        String regionId = dotSplit[dotSplit.length - 1]; //split regionname on dot. regionId between last two dots
        Path regionPath = new Path(tablePath, regionId);
        System.out.println(regionPath);
        FileSystem fs = FileSystem.get(conf);

        Path regionBackupPath = new Path(conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis()) + "/" + regionId);

        //Path regionBackupPath = new Path(backupPath, regionId);
        System.out.println("moving to: " + regionBackupPath);
        fs.rename(regionPath, regionBackupPath);

    }
}

-----Original Message-----
From: Stuart Smith [mailto:stu24mail@yahoo.com] 
Sent: Saturday, October 29, 2011 1:39 PM
To: user@hbase.apache.org
Subject: Re: PENDING_CLOSE for too long

Hello Geoff,

  I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
But!
  I've been seeing the same issues for months:

 - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
 - WrongRegionExceptions due to overlapping regions & holes in the regions.

I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

It seems like the pending close messages can be ignored?
And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

Take care,
  -stu



________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Sent: Saturday, September 3, 2011 12:11 AM
Subject: RE: PENDING_CLOSE for too long

"Are you having trouble getting to any of your data out in tables?"

depends what you mean. We see corruptions from time to time that prevent
us from getting data, one way or another. Today's corruption was regions
with duplicate start and end rows. We fixed that by deleting the
offending regions from HDFS, and running add_table.rb to restore the
meta. The other common corruption is the holes in ".META." that we
repair with a little tool we wrote. We'd love to learn why we see these
corruptions with such regularity (seemingly much higher than others on
the list).

We will implement timeout you suggest, and see how it goes.

Thanks,
Geoff

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
Stack
Sent: Friday, September 02, 2011 10:51 PM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: Re: PENDING_CLOSE for too long

Are you having trouble getting to any of your data out in tables?

To get rid of them, try restarting your master.

Before you restart your master, do "HBASE-4126  Make timeoutmonitor
timeout after 30 minutes instead of 3"; i.e. set
"hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
hbase-site.xml.

St.Ack

On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
> In the master logs, I am seeing "regions in transition timed out" and
> "region has been PENDING_CLOSE for too long, running forced unasign".
> Both of these log messages occur at INFO level, so I assume they are
> innocuous. Should I be concerned?
>
>
>
> -geoff
>
>

RE: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

attached is my original email to the list, which contains code for a tool to repair your "hole" in .META.

-----Original Message-----
From: Stuart Smith [mailto:stu24mail@yahoo.com] 
Sent: Saturday, October 29, 2011 1:39 PM
To: user@hbase.apache.org
Subject: Re: PENDING_CLOSE for too long

Hello Geoff,

� I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
But!
� I've been seeing the same issues for months:

�- PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
�- WrongRegionExceptions due to overlapping regions & holes in the regions.

I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

It seems like the pending close messages can be ignored?
And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

Take care,
� -stu

________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Sent: Saturday, September 3, 2011 12:11 AM
Subject: RE: PENDING_CLOSE for too long

"Are you having trouble getting to any of your data out in tables?"

depends what you mean. We see corruptions from time to time that prevent
us from getting data, one way or another. Today's corruption was regions
with duplicate start and end rows. We fixed that by deleting the
offending regions from HDFS, and running add_table.rb to restore the
meta. The other common corruption is the holes in ".META." that we
repair with a little tool we wrote. We'd love to learn why we see these
corruptions with such regularity (seemingly much higher than others on
the list).

We will implement timeout you suggest, and see how it goes.

Thanks,
Geoff

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
Stack
Sent: Friday, September 02, 2011 10:51 PM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: Re: PENDING_CLOSE for too long

Are you having trouble getting to any of your data out in tables?

To get rid of them, try restarting your master.

Before you restart your master, do "HBASE-4126� Make timeoutmonitor
timeout after 30 minutes instead of 3"; i.e. set
"hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
hbase-site.xml.

St.Ack

On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
> In the master logs, I am seeing "regions in transition timed out" and
> "region has been PENDING_CLOSE for too long, running forced unasign".
> Both of these log messages occur at INFO level, so I assume they are
> innocuous. Should I be concerned?
>
>
>
> -geoff
>
>

Re: PENDING_CLOSE for too long

Posted by Stuart Smith <st...@yahoo.com>.

Hello Geoff,

  I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
But!
  I've been seeing the same issues for months:

 - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
 - WrongRegionExceptions due to overlapping regions & holes in the regions.

I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

It seems like the pending close messages can be ignored?
And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

Take care,
  -stu

________________________________
From: Geoff Hendrey <gh...@decarta.com>
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Sent: Saturday, September 3, 2011 12:11 AM
Subject: RE: PENDING_CLOSE for too long

"Are you having trouble getting to any of your data out in tables?"

depends what you mean. We see corruptions from time to time that prevent
us from getting data, one way or another. Today's corruption was regions
with duplicate start and end rows. We fixed that by deleting the
offending regions from HDFS, and running add_table.rb to restore the
meta. The other common corruption is the holes in ".META." that we
repair with a little tool we wrote. We'd love to learn why we see these
corruptions with such regularity (seemingly much higher than others on
the list).

We will implement timeout you suggest, and see how it goes.

Thanks,
Geoff

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
Stack
Sent: Friday, September 02, 2011 10:51 PM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: Re: PENDING_CLOSE for too long

Are you having trouble getting to any of your data out in tables?

To get rid of them, try restarting your master.

Before you restart your master, do "HBASE-4126  Make timeoutmonitor
timeout after 30 minutes instead of 3"; i.e. set
"hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
hbase-site.xml.

St.Ack

On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
> In the master logs, I am seeing "regions in transition timed out" and
> "region has been PENDING_CLOSE for too long, running forced unasign".
> Both of these log messages occur at INFO level, so I assume they are
> innocuous. Should I be concerned?
>
>
>
> -geoff
>
>

RE: PENDING_CLOSE for too long

Posted by Geoff Hendrey <gh...@decarta.com>.

"Are you having trouble getting to any of your data out in tables?"

depends what you mean. We see corruptions from time to time that prevent
us from getting data, one way or another. Today's corruption was regions
with duplicate start and end rows. We fixed that by deleting the
offending regions from HDFS, and running add_table.rb to restore the
meta. The other common corruption is the holes in ".META." that we
repair with a little tool we wrote. We'd love to learn why we see these
corruptions with such regularity (seemingly much higher than others on
the list).

We will implement timeout you suggest, and see how it goes.

Thanks,
Geoff

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
Stack
Sent: Friday, September 02, 2011 10:51 PM
To: user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: Re: PENDING_CLOSE for too long

Are you having trouble getting to any of your data out in tables?

To get rid of them, try restarting your master.

Before you restart your master, do "HBASE-4126  Make timeoutmonitor
timeout after 30 minutes instead of 3"; i.e. set
"hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
hbase-site.xml.

St.Ack

On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com>
wrote:
> In the master logs, I am seeing "regions in transition timed out" and
> "region has been PENDING_CLOSE for too long, running forced unasign".
> Both of these log messages occur at INFO level, so I assume they are
> innocuous. Should I be concerned?
>
>
>
> -geoff
>
>

Re: PENDING_CLOSE for too long

Posted by Stack <st...@duboce.net>.

Are you having trouble getting to any of your data out in tables?

To get rid of them, try restarting your master.

Before you restart your master, do "HBASE-4126  Make timeoutmonitor
timeout after 30 minutes instead of 3"; i.e. set
"hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
hbase-site.xml.

St.Ack

On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <gh...@decarta.com> wrote:
> In the master logs, I am seeing "regions in transition timed out" and
> "region has been PENDING_CLOSE for too long, running forced unasign".
> Both of these log messages occur at INFO level, so I assume they are
> innocuous. Should I be concerned?
>
>
>
> -geoff
>
>