You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bill Graham <bi...@gmail.com> on 2011/01/22 19:27:45 UTC

Lost .META., lost tables

Hi,

Last night while experimenting with getting lzo set up I managed to
somehow lose all .META. data and all my tables. My regions still exist
in HDFS, but the shell tells me I have no tables. At this point I'm
pretty sure I need to reinstall HBase clean-slate on HDFS, hence
losing all data, but I'm sharing my story in case there are JIRAs to
be created or lessons to be learned.

Specifics:
- 4 Node cluster running 0.90.0.rc1
- 1 table of a few GBs and 24 regions, let's call it TableA
- CDH3b2

1. Just for kicks I decided to issue an alter table command to change
COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
taken any steps to install the native lzo libs in HBase (they exist in
HDFS), so this was probably a stupid thing to do. After issuing the
command I wasn't able to re-enable the table, nor could I fully
disable it. I was in a state somewhere in between the two, as
described in a thread earlier this week. The shell said enabled, the
master.jsp said disabled. Calls to do either would time out. The
master server was logging the same exceptions as in HBASE-3406 ad
infinitum. hbck -fix wasn't doing anything. After bouncing the entire
cluster a few times (master, RSs, zookeepers), I was able to finally
get back to normal state, with COMPRESSION set to 'none'  with hbck
-fix.

Besides HBASE-3406, maybe there's another JIRA here where the shell
permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
the table in a nasty state.

At this point I should have been grateful and called in a night, but
noooooo... Instead I shut down the cluster again and symlinked
lib/native to the same dir in my hadoop home, which is lzo-enabled and
I restarted the cluster. All seemed ok.

2. At this point I decided to experiment with a new table after
reading http://wiki.apache.org/hadoop/UsingLzoCompression more
closely. After creating 'mytable' with lzo enabled, I saw similar
behavior as I did in 1. so I used the same techniques to just try to
just drop the table. After bouncing the cluster and issuing a hbck
-fix, the shell reported that HBase had no tables at all. It seemed
like all the .META. data was wiped out but I still had all of my
orphaned regions in HDFS. This was very bad.

It was clear that these tables weren't coming back so in a last ditch
effort I stopped the HBase cluster, the SNN and the NN and I restored
HDFS from the checkpoint taken about an hour before. Now everything
was out of whack and HBase wouldn't even come up and -ROOT- couldn't
be located, .log/ files weren't being read properly and things were a
mess.

One could make the argument that I was beating on HBase a bit and
maybe even trying to break things, but it didn't take a lot of effort
to get to a pretty dire state.

thanks,
BIll

Re: Lost .META., lost tables

Posted by Stack <st...@duboce.net>.
On Sat, Jan 22, 2011 at 11:25 AM, Ted Dunning <td...@maprtech.com> wrote:
> Is it necessary to checkpoint ZK when checkpointing an hbase cluster?
>

I'd think that on restart Ted, it should be ok.  We clear zk state on
clean restart of cluster.
St.Ack


> On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <bi...@gmail.com> wrote:
>
>> Hi,
>>
>> Last night while experimenting with getting lzo set up I managed to
>> somehow lose all .META. data and all my tables. My regions still exist
>> in HDFS, but the shell tells me I have no tables. At this point I'm
>> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
>> losing all data, but I'm sharing my story in case there are JIRAs to
>> be created or lessons to be learned.
>>
>> Specifics:
>> - 4 Node cluster running 0.90.0.rc1
>> - 1 table of a few GBs and 24 regions, let's call it TableA
>> - CDH3b2
>>
>> 1. Just for kicks I decided to issue an alter table command to change
>> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
>> taken any steps to install the native lzo libs in HBase (they exist in
>> HDFS), so this was probably a stupid thing to do. After issuing the
>> command I wasn't able to re-enable the table, nor could I fully
>> disable it. I was in a state somewhere in between the two, as
>> described in a thread earlier this week. The shell said enabled, the
>> master.jsp said disabled. Calls to do either would time out. The
>> master server was logging the same exceptions as in HBASE-3406 ad
>> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
>> cluster a few times (master, RSs, zookeepers), I was able to finally
>> get back to normal state, with COMPRESSION set to 'none'  with hbck
>> -fix.
>>
>> Besides HBASE-3406, maybe there's another JIRA here where the shell
>> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
>> the table in a nasty state.
>>
>> At this point I should have been grateful and called in a night, but
>> noooooo... Instead I shut down the cluster again and symlinked
>> lib/native to the same dir in my hadoop home, which is lzo-enabled and
>> I restarted the cluster. All seemed ok.
>>
>> 2. At this point I decided to experiment with a new table after
>> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
>> closely. After creating 'mytable' with lzo enabled, I saw similar
>> behavior as I did in 1. so I used the same techniques to just try to
>> just drop the table. After bouncing the cluster and issuing a hbck
>> -fix, the shell reported that HBase had no tables at all. It seemed
>> like all the .META. data was wiped out but I still had all of my
>> orphaned regions in HDFS. This was very bad.
>>
>> It was clear that these tables weren't coming back so in a last ditch
>> effort I stopped the HBase cluster, the SNN and the NN and I restored
>> HDFS from the checkpoint taken about an hour before. Now everything
>> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
>> be located, .log/ files weren't being read properly and things were a
>> mess.
>>
>> One could make the argument that I was beating on HBase a bit and
>> maybe even trying to break things, but it didn't take a lot of effort
>> to get to a pretty dire state.
>>
>> thanks,
>> BIll
>>
>

Re: Lost .META., lost tables

Posted by Ted Dunning <td...@maprtech.com>.
Is it necessary to checkpoint ZK when checkpointing an hbase cluster?

On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <bi...@gmail.com> wrote:

> Hi,
>
> Last night while experimenting with getting lzo set up I managed to
> somehow lose all .META. data and all my tables. My regions still exist
> in HDFS, but the shell tells me I have no tables. At this point I'm
> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
> losing all data, but I'm sharing my story in case there are JIRAs to
> be created or lessons to be learned.
>
> Specifics:
> - 4 Node cluster running 0.90.0.rc1
> - 1 table of a few GBs and 24 regions, let's call it TableA
> - CDH3b2
>
> 1. Just for kicks I decided to issue an alter table command to change
> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
> taken any steps to install the native lzo libs in HBase (they exist in
> HDFS), so this was probably a stupid thing to do. After issuing the
> command I wasn't able to re-enable the table, nor could I fully
> disable it. I was in a state somewhere in between the two, as
> described in a thread earlier this week. The shell said enabled, the
> master.jsp said disabled. Calls to do either would time out. The
> master server was logging the same exceptions as in HBASE-3406 ad
> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
> cluster a few times (master, RSs, zookeepers), I was able to finally
> get back to normal state, with COMPRESSION set to 'none'  with hbck
> -fix.
>
> Besides HBASE-3406, maybe there's another JIRA here where the shell
> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
> the table in a nasty state.
>
> At this point I should have been grateful and called in a night, but
> noooooo... Instead I shut down the cluster again and symlinked
> lib/native to the same dir in my hadoop home, which is lzo-enabled and
> I restarted the cluster. All seemed ok.
>
> 2. At this point I decided to experiment with a new table after
> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
> closely. After creating 'mytable' with lzo enabled, I saw similar
> behavior as I did in 1. so I used the same techniques to just try to
> just drop the table. After bouncing the cluster and issuing a hbck
> -fix, the shell reported that HBase had no tables at all. It seemed
> like all the .META. data was wiped out but I still had all of my
> orphaned regions in HDFS. This was very bad.
>
> It was clear that these tables weren't coming back so in a last ditch
> effort I stopped the HBase cluster, the SNN and the NN and I restored
> HDFS from the checkpoint taken about an hour before. Now everything
> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
> be located, .log/ files weren't being read properly and things were a
> mess.
>
> One could make the argument that I was beating on HBase a bit and
> maybe even trying to break things, but it didn't take a lot of effort
> to get to a pretty dire state.
>
> thanks,
> BIll
>

Re: Lost .META., lost tables

Posted by Bill Graham <bi...@gmail.com>.
Thanks for all the pointers Stack. I've since re-initialized HBase so
many of the diagnostic steps you've suggested no longer apply, but
they've got me better armed for trouble next time I need them.

I've attached the master logs from the time I create mytable to when I
finally shut down the cluster to wipe out HDFS and start over. I've
heavily sed'ed out sensitive info for some dummy placeholders so let
me know if some things don't make sense.

I actually still have the old region data saved aside. Just curious,
is there an easy way to import that into the new table without writing
a MR job? If it's easy to save the data I will, but I can survive
without it.

More comments below.

thanks,
Bill


On Sun, Jan 23, 2011 at 11:16 AM, Stack <st...@duboce.net> wrote:
> On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <bi...@gmail.com> wrote:
>> Hi,
>>
>> Last night while experimenting with getting lzo set up I managed to
>> somehow lose all .META. data and all my tables. My regions still exist
>> in HDFS, but the shell tells me I have no tables.
>
> If you scan .META., whats it say?
>
> hbase> scan '.META.'
>
>
> Is it empty?

That's a great question, I forgot that .META. is just a table that I
could scan. Next time I'll try that.

>
> You are running NTP on your cluster and all machines are close in
> time? (Edits are set with server's local time; perhaps .META. region
> moved to a machine whose clock was way behind?)
>

Yes, NTP is running and clocks are in sync.

>
>
>> At this point I'm
>> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
>> losing all data, but I'm sharing my story in case there are JIRAs to
>> be created or lessons to be learned.
>>
>
> In 0.89.x and previous, bin/add_table.rb would rebuild your .META.
> You could try it.  You will probably have to restart your cluster
> after its done to have hbase assign tables (the regular process that
> would do this on a live cluster has been removed in 0.90, replaced w/
> different mechanism -- script needs updating or rather replacing but
> not done yet).
>
>
>> Specifics:
>> - 4 Node cluster running 0.90.0.rc1
>> - 1 table of a few GBs and 24 regions, let's call it TableA
>> - CDH3b2
>>
>
> Append was enabled on this cluster?  (I'm not sure if CDH enables
> append by default. Here is the flag:
>
> <property>
>  <name>dfs.support.append</name>
>  <value>true</value>
>  <description>This branch of HDFS supports reliable append/sync.
>  </description>
> </property>
>
> )
>
> If not enabled, then on crash edits to .META. may have been lost?

Yes, append is enabled.

>
>
>> 1. Just for kicks I decided to issue an alter table command to change
>> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
>> taken any steps to install the native lzo libs in HBase (they exist in
>> HDFS), so this was probably a stupid thing to do. After issuing the
>> command I wasn't able to re-enable the table, nor could I fully
>> disable it. I was in a state somewhere in between the two, as
>> described in a thread earlier this week.
>
> Yeah. Sounds like the "Wayne-scenario".  If LZO libs are not properly
> installed, regions won't deploy.  They fail in messy way.  You can add
> some insurance with something like this facility
> http://hbase.apache.org/hbase.regionserver.codecs.html.  There is also
> a tool to test for proper LZO install in hbase (See 'Testing
> Compression is enabled' in
> 'http://wiki.apache.org/hadoop/UsingLzoCompression').
>
>> The shell said enabled, the
>> master.jsp said disabled. Calls to do either would time out. The
>> master server was logging the same exceptions as in HBASE-3406 ad
>> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
>> cluster a few times (master, RSs, zookeepers), I was able to finally
>> get back to normal state, with COMPRESSION set to 'none'  with hbck
>> -fix.
>>
>
> Sorry for pain caused.  Enable/disable is flakey in 0.89.x and previous.
>
> Should be better in 0.90.0.
>
>
>> Besides HBASE-3406, maybe there's another JIRA here where the shell
>> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
>> the table in a nasty state.
>>
>
> Please add a comment to hbase-3406 and any substantiating evidence if
> you can since that issue is a little ungrounded at the mo.

Will do. It should be at least easy to reproduce now.

>
>
>> At this point I should have been grateful and called in a night, but
>> noooooo... Instead I shut down the cluster again and symlinked
>> lib/native to the same dir in my hadoop home, which is lzo-enabled and
>> I restarted the cluster. All seemed ok.
>>
>
> OK.  Serves you right for sticking with it (smile).

I know, I've learned this lesson before. When I'm working too late in
the evening to just try just one more thing and the eyes/brain are
groggy, bad things happen.

>
>
>> 2. At this point I decided to experiment with a new table after
>> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
>> closely. After creating 'mytable' with lzo enabled, I saw similar
>> behavior as I did in 1. so I used the same techniques to just try to
>> just drop the table. After bouncing the cluster and issuing a hbck
>> -fix, the shell reported that HBase had no tables at all. It seemed
>> like all the .META. data was wiped out but I still had all of my
>> orphaned regions in HDFS. This was very bad.
>>
>
> Yeah.  You have that master log?  You think hbck -fix really 'fixed'
> your cluster?

I can't say for sure, but I recall that neither drop table or hbck
-fix wouldn't work so I restarted the cluster. Then hbck -summary
still showed inconsistent state so I ran hbck -fix and things seemed
ok. The drop table command then succeeded, but 'list' showed no
tables. I didn't run 'list' before the drop table command, so it's
possible they were gone before I ran drop. Actually, they could have
been gone after the bounce but before the hbck -fix command.


>
>
>> It was clear that these tables weren't coming back so in a last ditch
>> effort I stopped the HBase cluster, the SNN and the NN and I restored
>> HDFS from the checkpoint taken about an hour before.
>
> Checkpoint?  A distcp or something?

Restore to a SNN checkpoint, which now that I think of it makes no
sense. That just restores the HDFS file metadata to the last
checkpoint, not the file contents.

>
>> Now everything
>> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
>> be located, .log/ files weren't being read properly and things were a
>> mess.
>>
>
> Hmm.  You think you didn't get the edits up in RS memory?  YOu didn't
> flush all regions before checkpointing?

No, I didn't flush the regions. I think the cluster couldn't start
because it wasn't able to read files in the .log/ directory in HFDS.
There were alerts in the logs about trying to split logs and not being
able to find some. In HDFS, all the files under the .log/ dir were
empty.

>
>
>> One could make the argument that I was beating on HBase a bit and
>> maybe even trying to break things, but it didn't take a lot of effort
>> to get to a pretty dire state.
>>
>
> Not good.  If you can figure a damaging sequence of steps, stick them
> in an issue and I'll try over here.  Enabling LZO w/o support messing
> stuff up is sort of known issue though we should handle it more
> gracefully for sure.

Unfortunately I can't recall the specifics and sequence of all the
things I was trying with enough confidence to make a clear JIRA. It
was some combination of disabling/enabling/deleting a table, hbck -fix
and restarting the cluster that did it.

>
> St.Ack
>

Re: Lost .META., lost tables

Posted by Stack <st...@duboce.net>.
On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <bi...@gmail.com> wrote:
> Hi,
>
> Last night while experimenting with getting lzo set up I managed to
> somehow lose all .META. data and all my tables. My regions still exist
> in HDFS, but the shell tells me I have no tables.

If you scan .META., whats it say?

hbase> scan '.META.'


Is it empty?

You are running NTP on your cluster and all machines are close in
time? (Edits are set with server's local time; perhaps .META. region
moved to a machine whose clock was way behind?)



> At this point I'm
> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
> losing all data, but I'm sharing my story in case there are JIRAs to
> be created or lessons to be learned.
>

In 0.89.x and previous, bin/add_table.rb would rebuild your .META.
You could try it.  You will probably have to restart your cluster
after its done to have hbase assign tables (the regular process that
would do this on a live cluster has been removed in 0.90, replaced w/
different mechanism -- script needs updating or rather replacing but
not done yet).


> Specifics:
> - 4 Node cluster running 0.90.0.rc1
> - 1 table of a few GBs and 24 regions, let's call it TableA
> - CDH3b2
>

Append was enabled on this cluster?  (I'm not sure if CDH enables
append by default. Here is the flag:

<property>
  <name>dfs.support.append</name>
  <value>true</value>
  <description>This branch of HDFS supports reliable append/sync.
  </description>
</property>

)

If not enabled, then on crash edits to .META. may have been lost?


> 1. Just for kicks I decided to issue an alter table command to change
> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
> taken any steps to install the native lzo libs in HBase (they exist in
> HDFS), so this was probably a stupid thing to do. After issuing the
> command I wasn't able to re-enable the table, nor could I fully
> disable it. I was in a state somewhere in between the two, as
> described in a thread earlier this week.

Yeah. Sounds like the "Wayne-scenario".  If LZO libs are not properly
installed, regions won't deploy.  They fail in messy way.  You can add
some insurance with something like this facility
http://hbase.apache.org/hbase.regionserver.codecs.html.  There is also
a tool to test for proper LZO install in hbase (See 'Testing
Compression is enabled' in
'http://wiki.apache.org/hadoop/UsingLzoCompression').

> The shell said enabled, the
> master.jsp said disabled. Calls to do either would time out. The
> master server was logging the same exceptions as in HBASE-3406 ad
> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
> cluster a few times (master, RSs, zookeepers), I was able to finally
> get back to normal state, with COMPRESSION set to 'none'  with hbck
> -fix.
>

Sorry for pain caused.  Enable/disable is flakey in 0.89.x and previous.

Should be better in 0.90.0.


> Besides HBASE-3406, maybe there's another JIRA here where the shell
> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
> the table in a nasty state.
>

Please add a comment to hbase-3406 and any substantiating evidence if
you can since that issue is a little ungrounded at the mo.


> At this point I should have been grateful and called in a night, but
> noooooo... Instead I shut down the cluster again and symlinked
> lib/native to the same dir in my hadoop home, which is lzo-enabled and
> I restarted the cluster. All seemed ok.
>

OK.  Serves you right for sticking with it (smile).


> 2. At this point I decided to experiment with a new table after
> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
> closely. After creating 'mytable' with lzo enabled, I saw similar
> behavior as I did in 1. so I used the same techniques to just try to
> just drop the table. After bouncing the cluster and issuing a hbck
> -fix, the shell reported that HBase had no tables at all. It seemed
> like all the .META. data was wiped out but I still had all of my
> orphaned regions in HDFS. This was very bad.
>

Yeah.  You have that master log?  You think hbck -fix really 'fixed'
your cluster?


> It was clear that these tables weren't coming back so in a last ditch
> effort I stopped the HBase cluster, the SNN and the NN and I restored
> HDFS from the checkpoint taken about an hour before.

Checkpoint?  A distcp or something?

> Now everything
> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
> be located, .log/ files weren't being read properly and things were a
> mess.
>

Hmm.  You think you didn't get the edits up in RS memory?  YOu didn't
flush all regions before checkpointing?


> One could make the argument that I was beating on HBase a bit and
> maybe even trying to break things, but it didn't take a lot of effort
> to get to a pretty dire state.
>

Not good.  If you can figure a damaging sequence of steps, stick them
in an issue and I'll try over here.  Enabling LZO w/o support messing
stuff up is sort of known issue though we should handle it more
gracefully for sure.

St.Ack