You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by jackie macmillian <ja...@gmail.com> on 2020/08/04 11:46:54 UTC

Ghost Regions Problem

Hi all,

we have a cluster with hbase 2.2.0 installed on hadoop 2.9.2.
a few weeks ago, we had some issues on our active/standby namenode
selection due to some network problems and their zkfc services' competition
to select the active namenode. as a result, both our namenodes became
active for a short time and all region server services restarted
themselves. we achieved to solve that issue with some arrangements on
timeout parameters. but the story began afterwards.
after the region servers completed their reset tasks, we saw that all our
hbase tables became unstable. for example, think about a 200 regions-wide
table. 196 regions of that table got online, but 4 regions stuck at an
intermediate state like closing/opening. at the end, the tables stuck at
disabling/enabling states. on the other hand, hbase had lots of procedure
locks and masterprocwals directory kept enlarging.
to overcome that issue, i used hbck2 to release stuck regions and once i
managed to enable the table, i created an empty copy of that table from its
descriptor and bulk loaded all hfiles of that corrupt table to the new one.
at this point, you would ask why i did not use that enabled table. i
couldn't because although i was able to bypass the locked procedures there
were so many of them to resolve one by one. if you use hbck2 to bypass
those locks but leave them as they are, it would be only a cosmetic move,
regions won't become online in real. so i thought it would be much more
faster to create a brand new one and load all the data to that table. bulk
load was successful and the new table became online and scannable. the next
point was to disable the old one and drop it. but, as hmaster was dealing
lots of locks and procedures, i wasn't able to disable the old table. some
regions remain in disabling state again. so i decided to set that table's
state to disabled with hbck2 and then i succeeded to drop them.
after i put all my tables to online and all my old tables dropped
successfully, masterprocwals was the last stop to a clean hbase, i thought
:) i moved aside masterprocwals directory and restarted the active master.
the new master took control and voila! master procedures & locks became
clear, and all my tables were online as needed! i scanned hbase:meta table
and saw there is no other regions than the ones online.
until now.. remember those regions who were stuck and forced to close to
disable and drop the tables? when a region server is crashed and restarted
for some reason now, those regions are tried to be assigned by the master
to region servers. but region servers decline that assignment as there is
no table descriptor for those regions. take a look at HBASE-22780
<https://issues.apache.org/jira/browse/HBASE-22780>. exactly the same
problem is issued here.
i tried to create a 1-regioned table with the same name as the old table.
it succeeded. and the ghost region followed that table. then disabled and
dropped them again successfully. and again explored that hbase:meta doesn't
have that region anymore. but after a region server crash it comes again
from nowhere. so i figured out that when a region server comes down hmaster
does not read hbase:meta table to assign that server's regions to other
servers. i've read that master processes have some in-memory representation
of hbase:meta table in order to perform assignment issues as fast as
possible. i would clean hbase:meta from those ghost regions as explained,
but i have to force the masters to get this clean copy of hbase:meta to
their in-memory representations. how can i achieve that? assume that i have
cleared meta table and now what? rolling restart of hmasters? do standby
masters share the same in-memory meta table with the active one? if that's
the case i think rolling restart wouldn't solve that problem.. or should i
shut all masters down and then start them again in order to force them to
rebuild their in-memories from meta table?
any helps would be appreciated.
thank you for your patience :)

jackie

Re: Ghost Regions Problem

Posted by Wellington Chevreuil <we...@gmail.com>.

I mentioned assigns as a possible solution for your original issue (before
you dropped/recreated/bulkloaded original table). It obviously will never
work for these "ghost" regions because these don't belong to any table.

Yes, rolling restart masters will make it read state from meta again. Can
you confirm how have you originally cleaned up the original problem,
especially if you had manually deleted regions from meta?

On Wed, 5 Aug 2020, 08:43 jackie macmillian, <ja...@gmail.com>
wrote:

> Thanks for your response Wellington.
>
> hbck2 assigns method does not work here unfortunately due to lack of table
> descriptor, both in meta table and in-memory. The actual table and most of
> the regions that table had been dropped successfully. When you try to
> assign those remaining ghost regions, they are stuck as on HBASE-22780.
>
> One way to get rid of those regions is to create a new table with its old
> name. Suppose you have 4 ghost regions. If you create a 1-region table,
> those 4 ghosts go after that 1 region composing a 5-regioned table. After
> that you are able to disable that table and drop the table successfully. On
> the contrary, as we have many tables & regions like this, it is so hard to
> explore them.
>
> To cut a long story short, the hmaster is assuming its in-memory
> representation of the meta table is intact, but in fact it is not. I need a
> way to force all masters to rebuild their in-memory representations from
> clean hbase:meta table. Does a rolling restart of all masters do that or do
> I have to shut all masters down to force them to proceed with
> initialization on startup?
>
> Wellington Chevreuil <we...@gmail.com>, 4 Ağu 2020 Sal,
> 16:42 tarihinde şunu yazdı:
>
> > >
> > >  if you use hbck2 to bypass
> > > those locks but leave them as they are, it would be only a cosmetic
> move,
> > > regions won't become online in real
> > >
> > You can use hbck2 *assigns *method to bring those regions online (it
> > accepts multiple regions as input)
> >
> >  i've read that master processes have some in-memory representation
> > > of hbase:meta table
> > >
> > Yes, masters read meta table only during initialisation, from there
> > onwards, since every change to meta is orchestrated by the active master,
> > it assumes its in-memory representation of meta table is the truth. What
> > exactly steps had you followed when you say you had dropped those ghost
> > regions? If that means any manual deletion of region dirs/files in hdfs,
> or
> > direct manipulation of meta table via client API, then that explains the
> > master inconsistency.
> >
> >
> > Em ter., 4 de ago. de 2020 às 12:51, jackie macmillian <
> > jackie.macmillian@gmail.com> escreveu:
> >
> > > Hi all,
> > >
> > > we have a cluster with hbase 2.2.0 installed on hadoop 2.9.2.
> > > a few weeks ago, we had some issues on our active/standby namenode
> > > selection due to some network problems and their zkfc services'
> > competition
> > > to select the active namenode. as a result, both our namenodes became
> > > active for a short time and all region server services restarted
> > > themselves. we achieved to solve that issue with some arrangements on
> > > timeout parameters. but the story began afterwards.
> > > after the region servers completed their reset tasks, we saw that all
> our
> > > hbase tables became unstable. for example, think about a 200
> regions-wide
> > > table. 196 regions of that table got online, but 4 regions stuck at an
> > > intermediate state like closing/opening. at the end, the tables stuck
> at
> > > disabling/enabling states. on the other hand, hbase had lots of
> procedure
> > > locks and masterprocwals directory kept enlarging.
> > > to overcome that issue, i used hbck2 to release stuck regions and once
> i
> > > managed to enable the table, i created an empty copy of that table from
> > its
> > > descriptor and bulk loaded all hfiles of that corrupt table to the new
> > one.
> > > at this point, you would ask why i did not use that enabled table. i
> > > couldn't because although i was able to bypass the locked procedures
> > there
> > > were so many of them to resolve one by one. if you use hbck2 to bypass
> > > those locks but leave them as they are, it would be only a cosmetic
> move,
> > > regions won't become online in real. so i thought it would be much more
> > > faster to create a brand new one and load all the data to that table.
> > bulk
> > > load was successful and the new table became online and scannable. the
> > next
> > > point was to disable the old one and drop it. but, as hmaster was
> dealing
> > > lots of locks and procedures, i wasn't able to disable the old table.
> > some
> > > regions remain in disabling state again. so i decided to set that
> table's
> > > state to disabled with hbck2 and then i succeeded to drop them.
> > > after i put all my tables to online and all my old tables dropped
> > > successfully, masterprocwals was the last stop to a clean hbase, i
> > thought
> > > :) i moved aside masterprocwals directory and restarted the active
> > master.
> > > the new master took control and voila! master procedures & locks became
> > > clear, and all my tables were online as needed! i scanned hbase:meta
> > table
> > > and saw there is no other regions than the ones online.
> > > until now.. remember those regions who were stuck and forced to close
> to
> > > disable and drop the tables? when a region server is crashed and
> > restarted
> > > for some reason now, those regions are tried to be assigned by the
> master
> > > to region servers. but region servers decline that assignment as there
> is
> > > no table descriptor for those regions. take a look at HBASE-22780
> > > <https://issues.apache.org/jira/browse/HBASE-22780>. exactly the same
> > > problem is issued here.
> > > i tried to create a 1-regioned table with the same name as the old
> table.
> > > it succeeded. and the ghost region followed that table. then disabled
> and
> > > dropped them again successfully. and again explored that hbase:meta
> > doesn't
> > > have that region anymore. but after a region server crash it comes
> again
> > > from nowhere. so i figured out that when a region server comes down
> > hmaster
> > > does not read hbase:meta table to assign that server's regions to other
> > > servers. i've read that master processes have some in-memory
> > representation
> > > of hbase:meta table in order to perform assignment issues as fast as
> > > possible. i would clean hbase:meta from those ghost regions as
> explained,
> > > but i have to force the masters to get this clean copy of hbase:meta to
> > > their in-memory representations. how can i achieve that? assume that i
> > have
> > > cleared meta table and now what? rolling restart of hmasters? do
> standby
> > > masters share the same in-memory meta table with the active one? if
> > that's
> > > the case i think rolling restart wouldn't solve that problem.. or
> should
> > i
> > > shut all masters down and then start them again in order to force them
> to
> > > rebuild their in-memories from meta table?
> > > any helps would be appreciated.
> > > thank you for your patience :)
> > >
> > > jackie
> > >
> >
>

Re: Ghost Regions Problem

Posted by jackie macmillian <ja...@gmail.com>.

Thanks for your response Wellington.

hbck2 assigns method does not work here unfortunately due to lack of table
descriptor, both in meta table and in-memory. The actual table and most of
the regions that table had been dropped successfully. When you try to
assign those remaining ghost regions, they are stuck as on HBASE-22780.

One way to get rid of those regions is to create a new table with its old
name. Suppose you have 4 ghost regions. If you create a 1-region table,
those 4 ghosts go after that 1 region composing a 5-regioned table. After
that you are able to disable that table and drop the table successfully. On
the contrary, as we have many tables & regions like this, it is so hard to
explore them.

To cut a long story short, the hmaster is assuming its in-memory
representation of the meta table is intact, but in fact it is not. I need a
way to force all masters to rebuild their in-memory representations from
clean hbase:meta table. Does a rolling restart of all masters do that or do
I have to shut all masters down to force them to proceed with
initialization on startup?

Wellington Chevreuil <we...@gmail.com>, 4 Ağu 2020 Sal,
16:42 tarihinde şunu yazdı:

> >
> >  if you use hbck2 to bypass
> > those locks but leave them as they are, it would be only a cosmetic move,
> > regions won't become online in real
> >
> You can use hbck2 *assigns *method to bring those regions online (it
> accepts multiple regions as input)
>
>  i've read that master processes have some in-memory representation
> > of hbase:meta table
> >
> Yes, masters read meta table only during initialisation, from there
> onwards, since every change to meta is orchestrated by the active master,
> it assumes its in-memory representation of meta table is the truth. What
> exactly steps had you followed when you say you had dropped those ghost
> regions? If that means any manual deletion of region dirs/files in hdfs, or
> direct manipulation of meta table via client API, then that explains the
> master inconsistency.
>
>
> Em ter., 4 de ago. de 2020 às 12:51, jackie macmillian <
> jackie.macmillian@gmail.com> escreveu:
>
> > Hi all,
> >
> > we have a cluster with hbase 2.2.0 installed on hadoop 2.9.2.
> > a few weeks ago, we had some issues on our active/standby namenode
> > selection due to some network problems and their zkfc services'
> competition
> > to select the active namenode. as a result, both our namenodes became
> > active for a short time and all region server services restarted
> > themselves. we achieved to solve that issue with some arrangements on
> > timeout parameters. but the story began afterwards.
> > after the region servers completed their reset tasks, we saw that all our
> > hbase tables became unstable. for example, think about a 200 regions-wide
> > table. 196 regions of that table got online, but 4 regions stuck at an
> > intermediate state like closing/opening. at the end, the tables stuck at
> > disabling/enabling states. on the other hand, hbase had lots of procedure
> > locks and masterprocwals directory kept enlarging.
> > to overcome that issue, i used hbck2 to release stuck regions and once i
> > managed to enable the table, i created an empty copy of that table from
> its
> > descriptor and bulk loaded all hfiles of that corrupt table to the new
> one.
> > at this point, you would ask why i did not use that enabled table. i
> > couldn't because although i was able to bypass the locked procedures
> there
> > were so many of them to resolve one by one. if you use hbck2 to bypass
> > those locks but leave them as they are, it would be only a cosmetic move,
> > regions won't become online in real. so i thought it would be much more
> > faster to create a brand new one and load all the data to that table.
> bulk
> > load was successful and the new table became online and scannable. the
> next
> > point was to disable the old one and drop it. but, as hmaster was dealing
> > lots of locks and procedures, i wasn't able to disable the old table.
> some
> > regions remain in disabling state again. so i decided to set that table's
> > state to disabled with hbck2 and then i succeeded to drop them.
> > after i put all my tables to online and all my old tables dropped
> > successfully, masterprocwals was the last stop to a clean hbase, i
> thought
> > :) i moved aside masterprocwals directory and restarted the active
> master.
> > the new master took control and voila! master procedures & locks became
> > clear, and all my tables were online as needed! i scanned hbase:meta
> table
> > and saw there is no other regions than the ones online.
> > until now.. remember those regions who were stuck and forced to close to
> > disable and drop the tables? when a region server is crashed and
> restarted
> > for some reason now, those regions are tried to be assigned by the master
> > to region servers. but region servers decline that assignment as there is
> > no table descriptor for those regions. take a look at HBASE-22780
> > <https://issues.apache.org/jira/browse/HBASE-22780>. exactly the same
> > problem is issued here.
> > i tried to create a 1-regioned table with the same name as the old table.
> > it succeeded. and the ghost region followed that table. then disabled and
> > dropped them again successfully. and again explored that hbase:meta
> doesn't
> > have that region anymore. but after a region server crash it comes again
> > from nowhere. so i figured out that when a region server comes down
> hmaster
> > does not read hbase:meta table to assign that server's regions to other
> > servers. i've read that master processes have some in-memory
> representation
> > of hbase:meta table in order to perform assignment issues as fast as
> > possible. i would clean hbase:meta from those ghost regions as explained,
> > but i have to force the masters to get this clean copy of hbase:meta to
> > their in-memory representations. how can i achieve that? assume that i
> have
> > cleared meta table and now what? rolling restart of hmasters? do standby
> > masters share the same in-memory meta table with the active one? if
> that's
> > the case i think rolling restart wouldn't solve that problem.. or should
> i
> > shut all masters down and then start them again in order to force them to
> > rebuild their in-memories from meta table?
> > any helps would be appreciated.
> > thank you for your patience :)
> >
> > jackie
> >
>

Re: Ghost Regions Problem

Posted by Wellington Chevreuil <we...@gmail.com>.

>
>  if you use hbck2 to bypass
> those locks but leave them as they are, it would be only a cosmetic move,
> regions won't become online in real
>
You can use hbck2 *assigns *method to bring those regions online (it
accepts multiple regions as input)

 i've read that master processes have some in-memory representation
> of hbase:meta table
>
Yes, masters read meta table only during initialisation, from there
onwards, since every change to meta is orchestrated by the active master,
it assumes its in-memory representation of meta table is the truth. What
exactly steps had you followed when you say you had dropped those ghost
regions? If that means any manual deletion of region dirs/files in hdfs, or
direct manipulation of meta table via client API, then that explains the
master inconsistency.


Em ter., 4 de ago. de 2020 às 12:51, jackie macmillian <
jackie.macmillian@gmail.com> escreveu:

> Hi all,
>
> we have a cluster with hbase 2.2.0 installed on hadoop 2.9.2.
> a few weeks ago, we had some issues on our active/standby namenode
> selection due to some network problems and their zkfc services' competition
> to select the active namenode. as a result, both our namenodes became
> active for a short time and all region server services restarted
> themselves. we achieved to solve that issue with some arrangements on
> timeout parameters. but the story began afterwards.
> after the region servers completed their reset tasks, we saw that all our
> hbase tables became unstable. for example, think about a 200 regions-wide
> table. 196 regions of that table got online, but 4 regions stuck at an
> intermediate state like closing/opening. at the end, the tables stuck at
> disabling/enabling states. on the other hand, hbase had lots of procedure
> locks and masterprocwals directory kept enlarging.
> to overcome that issue, i used hbck2 to release stuck regions and once i
> managed to enable the table, i created an empty copy of that table from its
> descriptor and bulk loaded all hfiles of that corrupt table to the new one.
> at this point, you would ask why i did not use that enabled table. i
> couldn't because although i was able to bypass the locked procedures there
> were so many of them to resolve one by one. if you use hbck2 to bypass
> those locks but leave them as they are, it would be only a cosmetic move,
> regions won't become online in real. so i thought it would be much more
> faster to create a brand new one and load all the data to that table. bulk
> load was successful and the new table became online and scannable. the next
> point was to disable the old one and drop it. but, as hmaster was dealing
> lots of locks and procedures, i wasn't able to disable the old table. some
> regions remain in disabling state again. so i decided to set that table's
> state to disabled with hbck2 and then i succeeded to drop them.
> after i put all my tables to online and all my old tables dropped
> successfully, masterprocwals was the last stop to a clean hbase, i thought
> :) i moved aside masterprocwals directory and restarted the active master.
> the new master took control and voila! master procedures & locks became
> clear, and all my tables were online as needed! i scanned hbase:meta table
> and saw there is no other regions than the ones online.
> until now.. remember those regions who were stuck and forced to close to
> disable and drop the tables? when a region server is crashed and restarted
> for some reason now, those regions are tried to be assigned by the master
> to region servers. but region servers decline that assignment as there is
> no table descriptor for those regions. take a look at HBASE-22780
> <https://issues.apache.org/jira/browse/HBASE-22780>. exactly the same
> problem is issued here.
> i tried to create a 1-regioned table with the same name as the old table.
> it succeeded. and the ghost region followed that table. then disabled and
> dropped them again successfully. and again explored that hbase:meta doesn't
> have that region anymore. but after a region server crash it comes again
> from nowhere. so i figured out that when a region server comes down hmaster
> does not read hbase:meta table to assign that server's regions to other
> servers. i've read that master processes have some in-memory representation
> of hbase:meta table in order to perform assignment issues as fast as
> possible. i would clean hbase:meta from those ghost regions as explained,
> but i have to force the masters to get this clean copy of hbase:meta to
> their in-memory representations. how can i achieve that? assume that i have
> cleared meta table and now what? rolling restart of hmasters? do standby
> masters share the same in-memory meta table with the active one? if that's
> the case i think rolling restart wouldn't solve that problem.. or should i
> shut all masters down and then start them again in order to force them to
> rebuild their in-memories from meta table?
> any helps would be appreciated.
> thank you for your patience :)
>
> jackie
>