You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jean-Marc Spaggiari <je...@spaggiari.org> on 2013/06/02 03:31:10 UTC

Multiple different failures

Hi,

Today I faced a power outage. 4 computers stayed up. The 3 ZK servers,
the Master, the NN and 2 DN/RS. They was on UPS.

While everything was going back up... Guess what... I faced a 2nd one!

After bringing HBase up, about 97% of my data was missing.  (19M rows
in my main table)

I ran HBCK which found many issues and fixed, I think, all of them.
(1013M rows in my main table now).

I have not been able to identify why I lost all of that, but 2 small things.

1) I had about 900 un-assigned regions in a table. Here is a log example:

ERROR: Region { meta =>
work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d.,
hdfs => hdfs://node3:9000/hbase/work_proposed/fdf1d3bf27c7c8bae77711b85473bb2d,
deployed =>  } not deployed on any region server.
Trying to fix unassigned region...
13/06/01 17:37:11 INFO util.HBaseFsckRepair: Region still in
transition, waiting for it to become assigned: {NAME =>
'work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d.',
STARTKEY => '\xC9\x1F\x1F\x0F\x00\x00\x00\x00http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202',
ENDKEY => '\xC9\x86\x19\x8E\x00\x00\x00\x00http://home.yorkbbs.ca/MemberPostsList.aspx?spaceid=576287',
ENCODED => fdf1d3bf27c7c8bae77711b85473bb2d,}

So regions got re-assigned on by one... Was SOOOOO long... Should not
HBCK try to re-assign all those regions in parallel or at least as
many thread as we have region servers? Today it's waiting for the
current region to be fully assigned and open to continue, which takes
a while.



2) Might be good for HBCK to display the data/time in all lines. That
helps to estimate the remaining to. Hole detection is not displaying
that, and so are some other fixes.

The 2nd point is easy to fix, but the first one might be a bit more
tricky. What do you thing about it?



JM

Re: Multiple different failures

Posted by ramkrishna vasudevan <ra...@gmail.com>.
>>So regions got re-assigned on by one... Was SOOOOO long... Should not
HBCK try to re-assign all those regions in parallel or at least as
many thread as we have region servers?
This point can be looked into.  Also need to check the code once as how it
works now.

Regards
Ram


On Sun, Jun 2, 2013 at 7:26 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi Varun,
>
> Data was no more there in HBase because entries were missing in the
> META. I had only 100 regions in my table, instead of the expected
> 1000. So it "disappears"... But data was still there in HDFS. It's
> very hard to really definitively loos data with HBase/Hadoop. So HBCK
> was able to find this data back from HDFS and restore the missing
> parts into HBase. At the end, everything is back. My HBase is still
> not running because I have in doing distributed log split, but I will
> look at it. The main thing is that I'm back to 1B lines in my table...
>
> JM
>
> 2013/6/1 Varun Sharma <va...@pinterest.com>:
> > Are you saying 97 % data was lost or was it offlined until the region
> > servers came back up ?
> >
> > Varun
> >
> >
> > On Sat, Jun 1, 2013 at 6:31 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org
> >> wrote:
> >
> >> Hi,
> >>
> >> Today I faced a power outage. 4 computers stayed up. The 3 ZK servers,
> >> the Master, the NN and 2 DN/RS. They was on UPS.
> >>
> >> While everything was going back up... Guess what... I faced a 2nd one!
> >>
> >> After bringing HBase up, about 97% of my data was missing.  (19M rows
> >> in my main table)
> >>
> >> I ran HBCK which found many issues and fixed, I think, all of them.
> >> (1013M rows in my main table now).
> >>
> >> I have not been able to identify why I lost all of that, but 2 small
> >> things.
> >>
> >> 1) I had about 900 un-assigned regions in a table. Here is a log
> example:
> >>
> >> ERROR: Region { meta =>
> >> work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> >>
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d
> >> .,
> >> hdfs =>
> >> hdfs://node3:9000/hbase/work_proposed/fdf1d3bf27c7c8bae77711b85473bb2d,
> >> deployed =>  } not deployed on any region server.
> >> Trying to fix unassigned region...
> >> 13/06/01 17:37:11 INFO util.HBaseFsckRepair: Region still in
> >> transition, waiting for it to become assigned: {NAME =>
> >> 'work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> >>
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d
> .
> >> ',
> >> STARTKEY => '\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> >>
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202
> >> ',
> >> ENDKEY => '\xC9\x86\x19\x8E\x00\x00\x00\x00
> >> http://home.yorkbbs.ca/MemberPostsList.aspx?spaceid=576287',
> >> ENCODED => fdf1d3bf27c7c8bae77711b85473bb2d,}
> >>
> >> So regions got re-assigned on by one... Was SOOOOO long... Should not
> >> HBCK try to re-assign all those regions in parallel or at least as
> >> many thread as we have region servers? Today it's waiting for the
> >> current region to be fully assigned and open to continue, which takes
> >> a while.
> >>
> >>
> >>
> >> 2) Might be good for HBCK to display the data/time in all lines. That
> >> helps to estimate the remaining to. Hole detection is not displaying
> >> that, and so are some other fixes.
> >>
> >> The 2nd point is easy to fix, but the first one might be a bit more
> >> tricky. What do you thing about it?
> >>
> >>
> >>
> >> JM
> >>
>

Re: Multiple different failures

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Varun,

Data was no more there in HBase because entries were missing in the
META. I had only 100 regions in my table, instead of the expected
1000. So it "disappears"... But data was still there in HDFS. It's
very hard to really definitively loos data with HBase/Hadoop. So HBCK
was able to find this data back from HDFS and restore the missing
parts into HBase. At the end, everything is back. My HBase is still
not running because I have in doing distributed log split, but I will
look at it. The main thing is that I'm back to 1B lines in my table...

JM

2013/6/1 Varun Sharma <va...@pinterest.com>:
> Are you saying 97 % data was lost or was it offlined until the region
> servers came back up ?
>
> Varun
>
>
> On Sat, Jun 1, 2013 at 6:31 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
>> wrote:
>
>> Hi,
>>
>> Today I faced a power outage. 4 computers stayed up. The 3 ZK servers,
>> the Master, the NN and 2 DN/RS. They was on UPS.
>>
>> While everything was going back up... Guess what... I faced a 2nd one!
>>
>> After bringing HBase up, about 97% of my data was missing.  (19M rows
>> in my main table)
>>
>> I ran HBCK which found many issues and fixed, I think, all of them.
>> (1013M rows in my main table now).
>>
>> I have not been able to identify why I lost all of that, but 2 small
>> things.
>>
>> 1) I had about 900 un-assigned regions in a table. Here is a log example:
>>
>> ERROR: Region { meta =>
>> work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
>> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d
>> .,
>> hdfs =>
>> hdfs://node3:9000/hbase/work_proposed/fdf1d3bf27c7c8bae77711b85473bb2d,
>> deployed =>  } not deployed on any region server.
>> Trying to fix unassigned region...
>> 13/06/01 17:37:11 INFO util.HBaseFsckRepair: Region still in
>> transition, waiting for it to become assigned: {NAME =>
>> 'work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
>> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d.
>> ',
>> STARTKEY => '\xC9\x1F\x1F\x0F\x00\x00\x00\x00
>> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202
>> ',
>> ENDKEY => '\xC9\x86\x19\x8E\x00\x00\x00\x00
>> http://home.yorkbbs.ca/MemberPostsList.aspx?spaceid=576287',
>> ENCODED => fdf1d3bf27c7c8bae77711b85473bb2d,}
>>
>> So regions got re-assigned on by one... Was SOOOOO long... Should not
>> HBCK try to re-assign all those regions in parallel or at least as
>> many thread as we have region servers? Today it's waiting for the
>> current region to be fully assigned and open to continue, which takes
>> a while.
>>
>>
>>
>> 2) Might be good for HBCK to display the data/time in all lines. That
>> helps to estimate the remaining to. Hole detection is not displaying
>> that, and so are some other fixes.
>>
>> The 2nd point is easy to fix, but the first one might be a bit more
>> tricky. What do you thing about it?
>>
>>
>>
>> JM
>>

Re: Multiple different failures

Posted by Varun Sharma <va...@pinterest.com>.
Are you saying 97 % data was lost or was it offlined until the region
servers came back up ?

Varun


On Sat, Jun 1, 2013 at 6:31 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi,
>
> Today I faced a power outage. 4 computers stayed up. The 3 ZK servers,
> the Master, the NN and 2 DN/RS. They was on UPS.
>
> While everything was going back up... Guess what... I faced a 2nd one!
>
> After bringing HBase up, about 97% of my data was missing.  (19M rows
> in my main table)
>
> I ran HBCK which found many issues and fixed, I think, all of them.
> (1013M rows in my main table now).
>
> I have not been able to identify why I lost all of that, but 2 small
> things.
>
> 1) I had about 900 un-assigned regions in a table. Here is a log example:
>
> ERROR: Region { meta =>
> work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d
> .,
> hdfs =>
> hdfs://node3:9000/hbase/work_proposed/fdf1d3bf27c7c8bae77711b85473bb2d,
> deployed =>  } not deployed on any region server.
> Trying to fix unassigned region...
> 13/06/01 17:37:11 INFO util.HBaseFsckRepair: Region still in
> transition, waiting for it to become assigned: {NAME =>
> 'work_proposed,\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202,1366811662932.fdf1d3bf27c7c8bae77711b85473bb2d.
> ',
> STARTKEY => '\xC9\x1F\x1F\x0F\x00\x00\x00\x00
> http://www.lawyerlocate.ca/lawyers/city_subs.php?province=5&city=956&category=2&subcategory=202
> ',
> ENDKEY => '\xC9\x86\x19\x8E\x00\x00\x00\x00
> http://home.yorkbbs.ca/MemberPostsList.aspx?spaceid=576287',
> ENCODED => fdf1d3bf27c7c8bae77711b85473bb2d,}
>
> So regions got re-assigned on by one... Was SOOOOO long... Should not
> HBCK try to re-assign all those regions in parallel or at least as
> many thread as we have region servers? Today it's waiting for the
> current region to be fully assigned and open to continue, which takes
> a while.
>
>
>
> 2) Might be good for HBCK to display the data/time in all lines. That
> helps to estimate the remaining to. Hole detection is not displaying
> that, and so are some other fixes.
>
> The 2nd point is easy to fix, but the first one might be a bit more
> tricky. What do you thing about it?
>
>
>
> JM
>