You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by mike anderson <sa...@gmail.com> on 2009/06/18 18:10:57 UTC

table contents disappeared

I had about 30,000 rows in my table 'cached_parsedtext'.  This morning when
I checked, Hbase appeared to be down (master server web UI was not
responding and the Shell crashed when I tried to count rows). I tried doing
a nice shutdown via bin/stop-hbase, this hung for about 20 minutes though so
I gave up and did a kill -9 on the hbase processes (what else was I supposed
to do!?). Upon restarting I discovered that all of the rows were gone. I
browsed the filesystem and saw that some of the metadata still existed in
hadoop dfs. Is there a way to rebuild the table? (After the force kill I
also did a nice restart of hbase and hadoop -- same results)

A few of the relevent looking log files are included below for those that
speak the language. However, these don't really mean much to me.

logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
org.apache.hadoop.hba
se.master.ServerManager: Received MSG_REPORT_OPEN:
cached_parsedtext,,1244838542607: safeMode=false fr
om 10.0.16.91:60020
logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
org.apache.hadoop.hba
se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open on
10.0.16.91:60020
logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039 INFO
org.apache.hadoop.hba
se.master.ProcessRegionOpen$1: updating row cached_parsedtext,,1244838542607
in region .META.,,1 with
startcode 1245337882941 and server 10.0.16.91:60020
logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595 INFO
org.apache.hadoop.hba
se.master.RegionManager: assigning region cached_parsedtext,,1244838542607
to the only server 10.0.16.
91:60020
logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823 INFO
org.apache.hadoop.hba
se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
cached_parsedtext,,1244838542607: safeMode=
false from 10.0.16.91:60020




Ideally I'd love to get my table back, but if not, learning how to avoid
this in the future would be great.


Thanks in advance,
Mike

Re: table contents disappeared

Posted by Yabo-Arber Xu <ar...@gmail.com>.

The same situation happened to me a couple of weeks ago. Our cluster is also
10-node, and I wasn't able to find any solution so far.

Best,
Arber

On Fri, Jun 19, 2009 at 2:35 AM, mike anderson <sa...@gmail.com>wrote:

> 0.19.3, hdfs, 10 nodes fully distributed.
>
> Is there a way to rebuild what was lost (even partially)? will this problem
> be fixed in 0.20?
>
>
> On Thu, Jun 18, 2009 at 1:51 PM, stack <st...@duboce.net> wrote:
>
> > You are on what version of hbase?
> >
> > My guess is its 0.19.x?
> >
> > How many nodes.  You using hdfs or local fs?
> >
> > The log below doesn't show issues.
> >
> > So, as to what happened, I speculate that you loaded up your table and
> then
> > there was some issue -- did you up your file descriptors, xceivers, etc?
> --
> > that caused the hang but uploads, in particular the edits that included
> > creation of your table and addition table regions had not been persisted.
> > The hungup hbase and your kill -9 -- there is nothing else you can do
> when
> > it won't respond though you could try ./bin/hbase-daemon.sh stop
> > regionserver on each of your regionservers to try and bring them down
> > nicely
> > -- meant the catalog table edits were lost so it appears your table is
> lost
> > (HDFS does not have a working flush/sync/append in hadoop 0.19.x so hbase
> > can lose data).
> >
> > In the head of the 0.19 branch we've done stuff to make the window
> whereby
> > we lose edits narrower (.META. flushes every few k or so).  I need to put
> > up
> > a 0.19.4 release candidate (I'm held up by my tracing a new issue here on
> > our home cluster).
> >
> > St.Ack
> >
> >
> >
> >
> >
> > On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <saidtherobot@gmail.com
> > >wrote:
> >
> > > I had about 30,000 rows in my table 'cached_parsedtext'.  This morning
> > when
> > > I checked, Hbase appeared to be down (master server web UI was not
> > > responding and the Shell crashed when I tried to count rows). I tried
> > doing
> > > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes
> though
> > > so
> > > I gave up and did a kill -9 on the hbase processes (what else was I
> > > supposed
> > > to do!?). Upon restarting I discovered that all of the rows were gone.
> I
> > > browsed the filesystem and saw that some of the metadata still existed
> in
> > > hadoop dfs. Is there a way to rebuild the table? (After the force kill
> I
> > > also did a nice restart of hbase and hadoop -- same results)
> > >
> > > A few of the relevent looking log files are included below for those
> that
> > > speak the language. However, these don't really mean much to me.
> > >
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ServerManager: Received MSG_REPORT_OPEN:
> > > cached_parsedtext,,1244838542607: safeMode=false fr
> > > om 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open on
> > > 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ProcessRegionOpen$1: updating row
> > > cached_parsedtext,,1244838542607
> > > in region .META.,,1 with
> > > startcode 1245337882941 and server 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595
> INFO
> > > org.apache.hadoop.hba
> > > se.master.RegionManager: assigning region
> > cached_parsedtext,,1244838542607
> > > to the only server 10.0.16.
> > > 91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> > > cached_parsedtext,,1244838542607: safeMode=
> > > false from 10.0.16.91:60020
> > >
> > >
> > >
> > >
> > > Ideally I'd love to get my table back, but if not, learning how to
> avoid
> > > this in the future would be great.
> > >
> > >
> > > Thanks in advance,
> > > Mike
> > >
> >
>

Re: table contents disappeared

Posted by Ryan Rawson <ry...@gmail.com>.

To re-iterate, this is strictly a bug in HDFS whereby our write-ahead-logs
have the data, but HDFS isn't giving it up.

HADOOP-4379 promises to have a (slow) fix to this issue (makes recovery
slowish).

Hadoop 0.21 promises to have the whizbang solution to this, but still 2
months out?

-ryan

On Thu, Jun 18, 2009 at 11:54 AM, stack <st...@duboce.net> wrote:

> We added noting in fileystem the vitals that could be used in a script
> reconstructing the .META. (see the .regioninfo file under each region dir
> in
> the filesystem).  We've not yet written up a script to reconstruct (Waiting
> on someone who  needs it badly enough I suppose).
>
> In 0.20.0, still no flush in hadoop.  There may be a workaround.  Will let
> list know if it proves viable (HADOOP-4379).  HDFS team have committed to a
> working flush in hadoop 0.21.
>
> St.Ack
>
> On Thu, Jun 18, 2009 at 11:35 AM, mike anderson <saidtherobot@gmail.com
> >wrote:
>
> > 0.19.3, hdfs, 10 nodes fully distributed.
> >
> > Is there a way to rebuild what was lost (even partially)? will this
> problem
> > be fixed in 0.20?
> >
> >
> > On Thu, Jun 18, 2009 at 1:51 PM, stack <st...@duboce.net> wrote:
> >
> > > You are on what version of hbase?
> > >
> > > My guess is its 0.19.x?
> > >
> > > How many nodes.  You using hdfs or local fs?
> > >
> > > The log below doesn't show issues.
> > >
> > > So, as to what happened, I speculate that you loaded up your table and
> > then
> > > there was some issue -- did you up your file descriptors, xceivers,
> etc?
> > --
> > > that caused the hang but uploads, in particular the edits that included
> > > creation of your table and addition table regions had not been
> persisted.
> > > The hungup hbase and your kill -9 -- there is nothing else you can do
> > when
> > > it won't respond though you could try ./bin/hbase-daemon.sh stop
> > > regionserver on each of your regionservers to try and bring them down
> > > nicely
> > > -- meant the catalog table edits were lost so it appears your table is
> > lost
> > > (HDFS does not have a working flush/sync/append in hadoop 0.19.x so
> hbase
> > > can lose data).
> > >
> > > In the head of the 0.19 branch we've done stuff to make the window
> > whereby
> > > we lose edits narrower (.META. flushes every few k or so).  I need to
> put
> > > up
> > > a 0.19.4 release candidate (I'm held up by my tracing a new issue here
> on
> > > our home cluster).
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <saidtherobot@gmail.com
> > > >wrote:
> > >
> > > > I had about 30,000 rows in my table 'cached_parsedtext'.  This
> morning
> > > when
> > > > I checked, Hbase appeared to be down (master server web UI was not
> > > > responding and the Shell crashed when I tried to count rows). I tried
> > > doing
> > > > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes
> > though
> > > > so
> > > > I gave up and did a kill -9 on the hbase processes (what else was I
> > > > supposed
> > > > to do!?). Upon restarting I discovered that all of the rows were
> gone.
> > I
> > > > browsed the filesystem and saw that some of the metadata still
> existed
> > in
> > > > hadoop dfs. Is there a way to rebuild the table? (After the force
> kill
> > I
> > > > also did a nice restart of hbase and hadoop -- same results)
> > > >
> > > > A few of the relevent looking log files are included below for those
> > that
> > > > speak the language. However, these don't really mean much to me.
> > > >
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ServerManager: Received MSG_REPORT_OPEN:
> > > > cached_parsedtext,,1244838542607: safeMode=false fr
> > > > om 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open
> on
> > > > 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ProcessRegionOpen$1: updating row
> > > > cached_parsedtext,,1244838542607
> > > > in region .META.,,1 with
> > > > startcode 1245337882941 and server 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.RegionManager: assigning region
> > > cached_parsedtext,,1244838542607
> > > > to the only server 10.0.16.
> > > > 91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> > > > cached_parsedtext,,1244838542607: safeMode=
> > > > false from 10.0.16.91:60020
> > > >
> > > >
> > > >
> > > >
> > > > Ideally I'd love to get my table back, but if not, learning how to
> > avoid
> > > > this in the future would be great.
> > > >
> > > >
> > > > Thanks in advance,
> > > > Mike
> > > >
> > >
> >
>

Re: table contents disappeared

Posted by stack <st...@duboce.net>.

We added noting in fileystem the vitals that could be used in a script
reconstructing the .META. (see the .regioninfo file under each region dir in
the filesystem).  We've not yet written up a script to reconstruct (Waiting
on someone who  needs it badly enough I suppose).

In 0.20.0, still no flush in hadoop.  There may be a workaround.  Will let
list know if it proves viable (HADOOP-4379).  HDFS team have committed to a
working flush in hadoop 0.21.

St.Ack

On Thu, Jun 18, 2009 at 11:35 AM, mike anderson <sa...@gmail.com>wrote:

> 0.19.3, hdfs, 10 nodes fully distributed.
>
> Is there a way to rebuild what was lost (even partially)? will this problem
> be fixed in 0.20?
>
>
> On Thu, Jun 18, 2009 at 1:51 PM, stack <st...@duboce.net> wrote:
>
> > You are on what version of hbase?
> >
> > My guess is its 0.19.x?
> >
> > How many nodes.  You using hdfs or local fs?
> >
> > The log below doesn't show issues.
> >
> > So, as to what happened, I speculate that you loaded up your table and
> then
> > there was some issue -- did you up your file descriptors, xceivers, etc?
> --
> > that caused the hang but uploads, in particular the edits that included
> > creation of your table and addition table regions had not been persisted.
> > The hungup hbase and your kill -9 -- there is nothing else you can do
> when
> > it won't respond though you could try ./bin/hbase-daemon.sh stop
> > regionserver on each of your regionservers to try and bring them down
> > nicely
> > -- meant the catalog table edits were lost so it appears your table is
> lost
> > (HDFS does not have a working flush/sync/append in hadoop 0.19.x so hbase
> > can lose data).
> >
> > In the head of the 0.19 branch we've done stuff to make the window
> whereby
> > we lose edits narrower (.META. flushes every few k or so).  I need to put
> > up
> > a 0.19.4 release candidate (I'm held up by my tracing a new issue here on
> > our home cluster).
> >
> > St.Ack
> >
> >
> >
> >
> >
> > On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <saidtherobot@gmail.com
> > >wrote:
> >
> > > I had about 30,000 rows in my table 'cached_parsedtext'.  This morning
> > when
> > > I checked, Hbase appeared to be down (master server web UI was not
> > > responding and the Shell crashed when I tried to count rows). I tried
> > doing
> > > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes
> though
> > > so
> > > I gave up and did a kill -9 on the hbase processes (what else was I
> > > supposed
> > > to do!?). Upon restarting I discovered that all of the rows were gone.
> I
> > > browsed the filesystem and saw that some of the metadata still existed
> in
> > > hadoop dfs. Is there a way to rebuild the table? (After the force kill
> I
> > > also did a nice restart of hbase and hadoop -- same results)
> > >
> > > A few of the relevent looking log files are included below for those
> that
> > > speak the language. However, these don't really mean much to me.
> > >
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ServerManager: Received MSG_REPORT_OPEN:
> > > cached_parsedtext,,1244838542607: safeMode=false fr
> > > om 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open on
> > > 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ProcessRegionOpen$1: updating row
> > > cached_parsedtext,,1244838542607
> > > in region .META.,,1 with
> > > startcode 1245337882941 and server 10.0.16.91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595
> INFO
> > > org.apache.hadoop.hba
> > > se.master.RegionManager: assigning region
> > cached_parsedtext,,1244838542607
> > > to the only server 10.0.16.
> > > 91:60020
> > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823
> INFO
> > > org.apache.hadoop.hba
> > > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> > > cached_parsedtext,,1244838542607: safeMode=
> > > false from 10.0.16.91:60020
> > >
> > >
> > >
> > >
> > > Ideally I'd love to get my table back, but if not, learning how to
> avoid
> > > this in the future would be great.
> > >
> > >
> > > Thanks in advance,
> > > Mike
> > >
> >
>

Re: table contents disappeared

Posted by mike anderson <sa...@gmail.com>.

0.19.3, hdfs, 10 nodes fully distributed.

Is there a way to rebuild what was lost (even partially)? will this problem
be fixed in 0.20?


On Thu, Jun 18, 2009 at 1:51 PM, stack <st...@duboce.net> wrote:

> You are on what version of hbase?
>
> My guess is its 0.19.x?
>
> How many nodes.  You using hdfs or local fs?
>
> The log below doesn't show issues.
>
> So, as to what happened, I speculate that you loaded up your table and then
> there was some issue -- did you up your file descriptors, xceivers, etc? --
> that caused the hang but uploads, in particular the edits that included
> creation of your table and addition table regions had not been persisted.
> The hungup hbase and your kill -9 -- there is nothing else you can do when
> it won't respond though you could try ./bin/hbase-daemon.sh stop
> regionserver on each of your regionservers to try and bring them down
> nicely
> -- meant the catalog table edits were lost so it appears your table is lost
> (HDFS does not have a working flush/sync/append in hadoop 0.19.x so hbase
> can lose data).
>
> In the head of the 0.19 branch we've done stuff to make the window whereby
> we lose edits narrower (.META. flushes every few k or so).  I need to put
> up
> a 0.19.4 release candidate (I'm held up by my tracing a new issue here on
> our home cluster).
>
> St.Ack
>
>
>
>
>
> On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <saidtherobot@gmail.com
> >wrote:
>
> > I had about 30,000 rows in my table 'cached_parsedtext'.  This morning
> when
> > I checked, Hbase appeared to be down (master server web UI was not
> > responding and the Shell crashed when I tried to count rows). I tried
> doing
> > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes though
> > so
> > I gave up and did a kill -9 on the hbase processes (what else was I
> > supposed
> > to do!?). Upon restarting I discovered that all of the rows were gone. I
> > browsed the filesystem and saw that some of the metadata still existed in
> > hadoop dfs. Is there a way to rebuild the table? (After the force kill I
> > also did a nice restart of hbase and hadoop -- same results)
> >
> > A few of the relevent looking log files are included below for those that
> > speak the language. However, these don't really mean much to me.
> >
> > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
> > org.apache.hadoop.hba
> > se.master.ServerManager: Received MSG_REPORT_OPEN:
> > cached_parsedtext,,1244838542607: safeMode=false fr
> > om 10.0.16.91:60020
> > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
> > org.apache.hadoop.hba
> > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open on
> > 10.0.16.91:60020
> > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039 INFO
> > org.apache.hadoop.hba
> > se.master.ProcessRegionOpen$1: updating row
> > cached_parsedtext,,1244838542607
> > in region .META.,,1 with
> > startcode 1245337882941 and server 10.0.16.91:60020
> > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595 INFO
> > org.apache.hadoop.hba
> > se.master.RegionManager: assigning region
> cached_parsedtext,,1244838542607
> > to the only server 10.0.16.
> > 91:60020
> > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823 INFO
> > org.apache.hadoop.hba
> > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> > cached_parsedtext,,1244838542607: safeMode=
> > false from 10.0.16.91:60020
> >
> >
> >
> >
> > Ideally I'd love to get my table back, but if not, learning how to avoid
> > this in the future would be great.
> >
> >
> > Thanks in advance,
> > Mike
> >
>

Re: table contents disappeared

Posted by stack <st...@duboce.net>.

You are on what version of hbase?

My guess is its 0.19.x?

How many nodes.  You using hdfs or local fs?

The log below doesn't show issues.

So, as to what happened, I speculate that you loaded up your table and then
there was some issue -- did you up your file descriptors, xceivers, etc? --
that caused the hang but uploads, in particular the edits that included
creation of your table and addition table regions had not been persisted.
The hungup hbase and your kill -9 -- there is nothing else you can do when
it won't respond though you could try ./bin/hbase-daemon.sh stop
regionserver on each of your regionservers to try and bring them down nicely
-- meant the catalog table edits were lost so it appears your table is lost
(HDFS does not have a working flush/sync/append in hadoop 0.19.x so hbase
can lose data).

In the head of the 0.19 branch we've done stuff to make the window whereby
we lose edits narrower (.META. flushes every few k or so).  I need to put up
a 0.19.4 release candidate (I'm held up by my tracing a new issue here on
our home cluster).

St.Ack





On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <sa...@gmail.com>wrote:

> I had about 30,000 rows in my table 'cached_parsedtext'.  This morning when
> I checked, Hbase appeared to be down (master server web UI was not
> responding and the Shell crashed when I tried to count rows). I tried doing
> a nice shutdown via bin/stop-hbase, this hung for about 20 minutes though
> so
> I gave up and did a kill -9 on the hbase processes (what else was I
> supposed
> to do!?). Upon restarting I discovered that all of the rows were gone. I
> browsed the filesystem and saw that some of the metadata still existed in
> hadoop dfs. Is there a way to rebuild the table? (After the force kill I
> also did a nice restart of hbase and hadoop -- same results)
>
> A few of the relevent looking log files are included below for those that
> speak the language. However, these don't really mean much to me.
>
> logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
> org.apache.hadoop.hba
> se.master.ServerManager: Received MSG_REPORT_OPEN:
> cached_parsedtext,,1244838542607: safeMode=false fr
> om 10.0.16.91:60020
> logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 INFO
> org.apache.hadoop.hba
> se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open on
> 10.0.16.91:60020
> logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039 INFO
> org.apache.hadoop.hba
> se.master.ProcessRegionOpen$1: updating row
> cached_parsedtext,,1244838542607
> in region .META.,,1 with
> startcode 1245337882941 and server 10.0.16.91:60020
> logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595 INFO
> org.apache.hadoop.hba
> se.master.RegionManager: assigning region cached_parsedtext,,1244838542607
> to the only server 10.0.16.
> 91:60020
> logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823 INFO
> org.apache.hadoop.hba
> se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> cached_parsedtext,,1244838542607: safeMode=
> false from 10.0.16.91:60020
>
>
>
>
> Ideally I'd love to get my table back, but if not, learning how to avoid
> this in the future would be great.
>
>
> Thanks in advance,
> Mike
>