You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bluemetrix Development <bm...@gmail.com> on 2010/03/03 21:41:33 UTC

Re: hbase shell count crashes

For completeness sake, I'll update here.
The issue with shell counts and rowcounter crashing were fixed by upping
- open files to 32K (ulimit -n)
- dfs.datanode.max.xcievers to 2048
(I had overlooked this when moving to a larger cluster)

As for recovering from crashes, I haven't had much luck.
I'm only running a 3 server cluster so that may be an issue,
but when one server goes down, it doesn't seem to be too easy
to recover the Hbase table data after getting everything restarted again.
I've usually had to wipe hdfs and start from scratch.

On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
<bm...@gmail.com> wrote:
> Hi, Thanks for the suggestions. I'll make note of this.
> (I've decided to reinsert, as with time constraints it is probably
> quicker than trying to debug and recover.)
> So, I guess I am more concerned about trying to prevent this from
> happening again.
> Is it possible that a shell count caused enough load to crash hbase?
> Or that nodes becoming unavailable due to heavy network load could
> cause data corruption?
>
> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>>
>> Try this...
>>
>> 1 run hadoop fsck /
>> 2 shut down hbase
>> 3 mv /hbase to /hbase.old
>> 4 restart /hbase (optional just for a sanity check)
>> 5 copy /hbase.old back to /hbase
>> 6 restart
>>
>> This may not help, but it can't hurt.
>> Depending on the size of your hbase database, it could take a while. On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>
>> HTH
>>
>> -Mike
>>
>>
>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>> Subject: Re: hbase shell count crashes
>>> From: bmdevelopment@gmail.com
>>> To: hbase-user@hadoop.apache.org
>>>
>>> Hi,
>>> So after a few more attempts and crashes from trying the shell count,
>>> I ran the MR rowcounter and noticed that the number of rows were less
>>> than they should have been - even on smaller test tables.
>>> This led me to start looking through the logs and perform a few
>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>> are entirely missing - no longer show up under the shell list command.
>>>
>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>> lot of this in the master log.
>>>
>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>> info:regioninfo is empty for row:
>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>> info:serverstartcode
>>>
>>> Came across this in the regionserver log:
>>> 2010-02-16 23:58:33,851 WARN
>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>> because its empty. HBASE-646 DATA LOSS?
>>>
>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>> re-insert from scratch as this is still in testing phase,
>>> but would be curious to find out what has led to these issues in order
>>> to possibly fix or at least not repeat.
>>>
>>> Thanks
>>>
>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>> <bm...@gmail.com> wrote:
>>> > Hi, Thanks for the explanation.
>>> >
>>> > Yes, I was able to cat the file from all three of my region servers:
>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>> >
>>> > I have never came across this before, but this is the first time I've
>>> > had 7M rows in the db.
>>> > Is there anything going on that would bog down the network and cause
>>> > this file to be unreachable?
>>> >
>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>> >
>>> > Appreciate the help.
>>> >
>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> >> This line
>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>> >> blk_-6288142015045035704_88516
>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>> >>
>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>> >> table (the table where all region addresses are). Are you able to open that
>>> >> file using the bin/hadoop command line utility?
>>> >>
>>> >> J-D
>>> >>
>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>> >> bmdevelopment@gmail.com> wrote:
>>> >>
>>> >>> Hi,
>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>> >>> right towards the end.
>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>> >>> to become unavailable.
>>> >>>
>>> >>> Here's the tail end of the count output:
>>> >>> http://pastebin.com/m465346d0
>>> >>>
>>> >>> I'm on version 0.20.2 and running this command:
>>> >>> > count 'table', 1000000
>>> >>>
>>> >>> Anyone with similar issues or ideas on this?
>>> >>> Please let me know if you need further info.
>>> >>> Thanks
>>> >>>
>>> >>
>>> >
>>
>> _________________________________________________________________
>> Hotmail: Trusted email with powerful SPAM protection.
>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>

Re: hbase shell count crashes

Posted by Bluemetrix Development <bm...@gmail.com>.
Thanks. I'll take a look at that in depth as soon as I have a chance.

Seriously tho, brilliant work and thanks to all involved - its progressed
such a great deal even in the last 9 months I've been following /
using the product.
Really enjoying it.

On Wed, Mar 3, 2010 at 5:58 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244
>
> As you can see we are working hard to stabilize HBase as much as possible ;)
>
> J-D
>
> On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development
> <bm...@gmail.com> wrote:
>> Yes, upgrading to 0.20.3 should be added to my list above. I have
>> since done this.
>> Thanks very much for that.
>>
>> On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> There were a lot of problems with Hadoop pre 0.20.2 for clusters
>>> smaller than 10, especially 3 when having node failure. If you are
>>> talking about just region servers, you are using 0.20.2 and 0.20.3 has
>>> stability fixes.
>>>
>>> J-D
>>>
>>> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
>>> <bm...@gmail.com> wrote:
>>>> For completeness sake, I'll update here.
>>>> The issue with shell counts and rowcounter crashing were fixed by upping
>>>> - open files to 32K (ulimit -n)
>>>> - dfs.datanode.max.xcievers to 2048
>>>> (I had overlooked this when moving to a larger cluster)
>>>>
>>>> As for recovering from crashes, I haven't had much luck.
>>>> I'm only running a 3 server cluster so that may be an issue,
>>>> but when one server goes down, it doesn't seem to be too easy
>>>> to recover the Hbase table data after getting everything restarted again.
>>>> I've usually had to wipe hdfs and start from scratch.
>>>>
>>>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>>>> <bm...@gmail.com> wrote:
>>>>> Hi, Thanks for the suggestions. I'll make note of this.
>>>>> (I've decided to reinsert, as with time constraints it is probably
>>>>> quicker than trying to debug and recover.)
>>>>> So, I guess I am more concerned about trying to prevent this from
>>>>> happening again.
>>>>> Is it possible that a shell count caused enough load to crash hbase?
>>>>> Or that nodes becoming unavailable due to heavy network load could
>>>>> cause data corruption?
>>>>>
>>>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>>>> <mi...@hotmail.com> wrote:
>>>>>>
>>>>>> Try this...
>>>>>>
>>>>>> 1 run hadoop fsck /
>>>>>> 2 shut down hbase
>>>>>> 3 mv /hbase to /hbase.old
>>>>>> 4 restart /hbase (optional just for a sanity check)
>>>>>> 5 copy /hbase.old back to /hbase
>>>>>> 6 restart
>>>>>>
>>>>>> This may not help, but it can't hurt.
>>>>>> Depending on the size of your hbase database, it could take a while. On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>>>> Subject: Re: hbase shell count crashes
>>>>>>> From: bmdevelopment@gmail.com
>>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>>
>>>>>>> Hi,
>>>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>>>>> than they should have been - even on smaller test tables.
>>>>>>> This led me to start looking through the logs and perform a few
>>>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>>>
>>>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>>>>> lot of this in the master log.
>>>>>>>
>>>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>>>> info:regioninfo is empty for row:
>>>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>>>> info:serverstartcode
>>>>>>>
>>>>>>> Came across this in the regionserver log:
>>>>>>> 2010-02-16 23:58:33,851 WARN
>>>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>>>
>>>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>>>>> re-insert from scratch as this is still in testing phase,
>>>>>>> but would be curious to find out what has led to these issues in order
>>>>>>> to possibly fix or at least not repeat.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>>>> <bm...@gmail.com> wrote:
>>>>>>> > Hi, Thanks for the explanation.
>>>>>>> >
>>>>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>>>>>> >
>>>>>>> > I have never came across this before, but this is the first time I've
>>>>>>> > had 7M rows in the db.
>>>>>>> > Is there anything going on that would bog down the network and cause
>>>>>>> > this file to be unreachable?
>>>>>>> >
>>>>>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>>>> >
>>>>>>> > Appreciate the help.
>>>>>>> >
>>>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>>> >> This line
>>>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>>>>> >> blk_-6288142015045035704_88516
>>>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>>>> >>
>>>>>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>>>>>> >> table (the table where all region addresses are). Are you able to open that
>>>>>>> >> file using the bin/hadoop command line utility?
>>>>>>> >>
>>>>>>> >> J-D
>>>>>>> >>
>>>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>>>>> >> bmdevelopment@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >>> Hi,
>>>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>>>>> >>> right towards the end.
>>>>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>>>>> >>> to become unavailable.
>>>>>>> >>>
>>>>>>> >>> Here's the tail end of the count output:
>>>>>>> >>> http://pastebin.com/m465346d0
>>>>>>> >>>
>>>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>>>> >>> > count 'table', 1000000
>>>>>>> >>>
>>>>>>> >>> Anyone with similar issues or ideas on this?
>>>>>>> >>> Please let me know if you need further info.
>>>>>>> >>> Thanks
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>>>
>>>>
>>>
>>
>

Re: hbase shell count crashes

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Mmm then you might be hitting http://issues.apache.org/jira/browse/HBASE-2244

As you can see we are working hard to stabilize HBase as much as possible ;)

J-D

On Wed, Mar 3, 2010 at 2:56 PM, Bluemetrix Development
<bm...@gmail.com> wrote:
> Yes, upgrading to 0.20.3 should be added to my list above. I have
> since done this.
> Thanks very much for that.
>
> On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> There were a lot of problems with Hadoop pre 0.20.2 for clusters
>> smaller than 10, especially 3 when having node failure. If you are
>> talking about just region servers, you are using 0.20.2 and 0.20.3 has
>> stability fixes.
>>
>> J-D
>>
>> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
>> <bm...@gmail.com> wrote:
>>> For completeness sake, I'll update here.
>>> The issue with shell counts and rowcounter crashing were fixed by upping
>>> - open files to 32K (ulimit -n)
>>> - dfs.datanode.max.xcievers to 2048
>>> (I had overlooked this when moving to a larger cluster)
>>>
>>> As for recovering from crashes, I haven't had much luck.
>>> I'm only running a 3 server cluster so that may be an issue,
>>> but when one server goes down, it doesn't seem to be too easy
>>> to recover the Hbase table data after getting everything restarted again.
>>> I've usually had to wipe hdfs and start from scratch.
>>>
>>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>>> <bm...@gmail.com> wrote:
>>>> Hi, Thanks for the suggestions. I'll make note of this.
>>>> (I've decided to reinsert, as with time constraints it is probably
>>>> quicker than trying to debug and recover.)
>>>> So, I guess I am more concerned about trying to prevent this from
>>>> happening again.
>>>> Is it possible that a shell count caused enough load to crash hbase?
>>>> Or that nodes becoming unavailable due to heavy network load could
>>>> cause data corruption?
>>>>
>>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>>> <mi...@hotmail.com> wrote:
>>>>>
>>>>> Try this...
>>>>>
>>>>> 1 run hadoop fsck /
>>>>> 2 shut down hbase
>>>>> 3 mv /hbase to /hbase.old
>>>>> 4 restart /hbase (optional just for a sanity check)
>>>>> 5 copy /hbase.old back to /hbase
>>>>> 6 restart
>>>>>
>>>>> This may not help, but it can't hurt.
>>>>> Depending on the size of your hbase database, it could take a while. On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>>> Subject: Re: hbase shell count crashes
>>>>>> From: bmdevelopment@gmail.com
>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>
>>>>>> Hi,
>>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>>>> than they should have been - even on smaller test tables.
>>>>>> This led me to start looking through the logs and perform a few
>>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>>
>>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>>>> lot of this in the master log.
>>>>>>
>>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>>> info:regioninfo is empty for row:
>>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>>> info:serverstartcode
>>>>>>
>>>>>> Came across this in the regionserver log:
>>>>>> 2010-02-16 23:58:33,851 WARN
>>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>>
>>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>>>> re-insert from scratch as this is still in testing phase,
>>>>>> but would be curious to find out what has led to these issues in order
>>>>>> to possibly fix or at least not repeat.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>>> <bm...@gmail.com> wrote:
>>>>>> > Hi, Thanks for the explanation.
>>>>>> >
>>>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>>>>> >
>>>>>> > I have never came across this before, but this is the first time I've
>>>>>> > had 7M rows in the db.
>>>>>> > Is there anything going on that would bog down the network and cause
>>>>>> > this file to be unreachable?
>>>>>> >
>>>>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>>> >
>>>>>> > Appreciate the help.
>>>>>> >
>>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>>> >> This line
>>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>>>> >> blk_-6288142015045035704_88516
>>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>>> >>
>>>>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>>>>> >> table (the table where all region addresses are). Are you able to open that
>>>>>> >> file using the bin/hadoop command line utility?
>>>>>> >>
>>>>>> >> J-D
>>>>>> >>
>>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>>>> >> bmdevelopment@gmail.com> wrote:
>>>>>> >>
>>>>>> >>> Hi,
>>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>>>> >>> right towards the end.
>>>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>>>> >>> to become unavailable.
>>>>>> >>>
>>>>>> >>> Here's the tail end of the count output:
>>>>>> >>> http://pastebin.com/m465346d0
>>>>>> >>>
>>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>>> >>> > count 'table', 1000000
>>>>>> >>>
>>>>>> >>> Anyone with similar issues or ideas on this?
>>>>>> >>> Please let me know if you need further info.
>>>>>> >>> Thanks
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>
>>>>> _________________________________________________________________
>>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>>
>>>
>>
>

Re: hbase shell count crashes

Posted by Bluemetrix Development <bm...@gmail.com>.
Yes, upgrading to 0.20.3 should be added to my list above. I have
since done this.
Thanks very much for that.

On Wed, Mar 3, 2010 at 4:44 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> There were a lot of problems with Hadoop pre 0.20.2 for clusters
> smaller than 10, especially 3 when having node failure. If you are
> talking about just region servers, you are using 0.20.2 and 0.20.3 has
> stability fixes.
>
> J-D
>
> On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
> <bm...@gmail.com> wrote:
>> For completeness sake, I'll update here.
>> The issue with shell counts and rowcounter crashing were fixed by upping
>> - open files to 32K (ulimit -n)
>> - dfs.datanode.max.xcievers to 2048
>> (I had overlooked this when moving to a larger cluster)
>>
>> As for recovering from crashes, I haven't had much luck.
>> I'm only running a 3 server cluster so that may be an issue,
>> but when one server goes down, it doesn't seem to be too easy
>> to recover the Hbase table data after getting everything restarted again.
>> I've usually had to wipe hdfs and start from scratch.
>>
>> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
>> <bm...@gmail.com> wrote:
>>> Hi, Thanks for the suggestions. I'll make note of this.
>>> (I've decided to reinsert, as with time constraints it is probably
>>> quicker than trying to debug and recover.)
>>> So, I guess I am more concerned about trying to prevent this from
>>> happening again.
>>> Is it possible that a shell count caused enough load to crash hbase?
>>> Or that nodes becoming unavailable due to heavy network load could
>>> cause data corruption?
>>>
>>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>>
>>>> Try this...
>>>>
>>>> 1 run hadoop fsck /
>>>> 2 shut down hbase
>>>> 3 mv /hbase to /hbase.old
>>>> 4 restart /hbase (optional just for a sanity check)
>>>> 5 copy /hbase.old back to /hbase
>>>> 6 restart
>>>>
>>>> This may not help, but it can't hurt.
>>>> Depending on the size of your hbase database, it could take a while. On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>>> Subject: Re: hbase shell count crashes
>>>>> From: bmdevelopment@gmail.com
>>>>> To: hbase-user@hadoop.apache.org
>>>>>
>>>>> Hi,
>>>>> So after a few more attempts and crashes from trying the shell count,
>>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>>> than they should have been - even on smaller test tables.
>>>>> This led me to start looking through the logs and perform a few
>>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>>> are entirely missing - no longer show up under the shell list command.
>>>>>
>>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>>> lot of this in the master log.
>>>>>
>>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>>> info:regioninfo is empty for row:
>>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>>> info:serverstartcode
>>>>>
>>>>> Came across this in the regionserver log:
>>>>> 2010-02-16 23:58:33,851 WARN
>>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>>> because its empty. HBASE-646 DATA LOSS?
>>>>>
>>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>>> re-insert from scratch as this is still in testing phase,
>>>>> but would be curious to find out what has led to these issues in order
>>>>> to possibly fix or at least not repeat.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>>> <bm...@gmail.com> wrote:
>>>>> > Hi, Thanks for the explanation.
>>>>> >
>>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>>>> >
>>>>> > I have never came across this before, but this is the first time I've
>>>>> > had 7M rows in the db.
>>>>> > Is there anything going on that would bog down the network and cause
>>>>> > this file to be unreachable?
>>>>> >
>>>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>>> >
>>>>> > Appreciate the help.
>>>>> >
>>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>>> >> This line
>>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>>> >> blk_-6288142015045035704_88516
>>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>>> >>
>>>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>>>> >> table (the table where all region addresses are). Are you able to open that
>>>>> >> file using the bin/hadoop command line utility?
>>>>> >>
>>>>> >> J-D
>>>>> >>
>>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>>> >> bmdevelopment@gmail.com> wrote:
>>>>> >>
>>>>> >>> Hi,
>>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>>> >>> right towards the end.
>>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>>> >>> to become unavailable.
>>>>> >>>
>>>>> >>> Here's the tail end of the count output:
>>>>> >>> http://pastebin.com/m465346d0
>>>>> >>>
>>>>> >>> I'm on version 0.20.2 and running this command:
>>>>> >>> > count 'table', 1000000
>>>>> >>>
>>>>> >>> Anyone with similar issues or ideas on this?
>>>>> >>> Please let me know if you need further info.
>>>>> >>> Thanks
>>>>> >>>
>>>>> >>
>>>>> >
>>>>
>>>> _________________________________________________________________
>>>> Hotmail: Trusted email with powerful SPAM protection.
>>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>>
>>
>

Re: hbase shell count crashes

Posted by Jean-Daniel Cryans <jd...@apache.org>.
There were a lot of problems with Hadoop pre 0.20.2 for clusters
smaller than 10, especially 3 when having node failure. If you are
talking about just region servers, you are using 0.20.2 and 0.20.3 has
stability fixes.

J-D

On Wed, Mar 3, 2010 at 12:41 PM, Bluemetrix Development
<bm...@gmail.com> wrote:
> For completeness sake, I'll update here.
> The issue with shell counts and rowcounter crashing were fixed by upping
> - open files to 32K (ulimit -n)
> - dfs.datanode.max.xcievers to 2048
> (I had overlooked this when moving to a larger cluster)
>
> As for recovering from crashes, I haven't had much luck.
> I'm only running a 3 server cluster so that may be an issue,
> but when one server goes down, it doesn't seem to be too easy
> to recover the Hbase table data after getting everything restarted again.
> I've usually had to wipe hdfs and start from scratch.
>
> On Wed, Feb 17, 2010 at 12:59 PM, Bluemetrix Development
> <bm...@gmail.com> wrote:
>> Hi, Thanks for the suggestions. I'll make note of this.
>> (I've decided to reinsert, as with time constraints it is probably
>> quicker than trying to debug and recover.)
>> So, I guess I am more concerned about trying to prevent this from
>> happening again.
>> Is it possible that a shell count caused enough load to crash hbase?
>> Or that nodes becoming unavailable due to heavy network load could
>> cause data corruption?
>>
>> On Wed, Feb 17, 2010 at 12:42 PM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>>
>>> Try this...
>>>
>>> 1 run hadoop fsck /
>>> 2 shut down hbase
>>> 3 mv /hbase to /hbase.old
>>> 4 restart /hbase (optional just for a sanity check)
>>> 5 copy /hbase.old back to /hbase
>>> 6 restart
>>>
>>> This may not help, but it can't hurt.
>>> Depending on the size of your hbase database, it could take a while. On our sandbox, we suffer from zookeeper and hbase failures when there's a heavy load on the network. (Don't ask, the sandbox was just a play area on whatever hardware we could find.) Doing this copy cleaned up a database that wouldn't fully come up. May do the same for you.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>>> Date: Wed, 17 Feb 2010 10:50:59 -0500
>>>> Subject: Re: hbase shell count crashes
>>>> From: bmdevelopment@gmail.com
>>>> To: hbase-user@hadoop.apache.org
>>>>
>>>> Hi,
>>>> So after a few more attempts and crashes from trying the shell count,
>>>> I ran the MR rowcounter and noticed that the number of rows were less
>>>> than they should have been - even on smaller test tables.
>>>> This led me to start looking through the logs and perform a few
>>>> compacts on META and restarts of hbase. Unfortunately, now two tables
>>>> are entirely missing - no longer show up under the shell list command.
>>>>
>>>> I'm not entirely sure what to look for in the logs, but I've noticed a
>>>> lot of this in the master log.
>>>>
>>>> 2010-02-16 23:59:25,856 WARN org.apache.hadoop.hbase.master.HMaster:
>>>> info:regioninfo is empty for row:
>>>> UserData_0209,e834d76faddee14b,1266316478685; has keys: info:server,
>>>> info:serverstartcode
>>>>
>>>> Came across this in the regionserver log:
>>>> 2010-02-16 23:58:33,851 WARN
>>>> org.apache.hadoop.hbase.regionserver.Store: Skipping
>>>> hdfs://upp1.bmeu.com:50001/hbase/.META./1028785192/info/4080287239754005013
>>>> because its empty. HBASE-646 DATA LOSS?
>>>>
>>>> Any ideas if the tables are recoverable? Its not a big deal for me to
>>>> re-insert from scratch as this is still in testing phase,
>>>> but would be curious to find out what has led to these issues in order
>>>> to possibly fix or at least not repeat.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 16, 2010 at 2:43 PM, Bluemetrix Development
>>>> <bm...@gmail.com> wrote:
>>>> > Hi, Thanks for the explanation.
>>>> >
>>>> > Yes, I was able to cat the file from all three of my region servers:
>>>> > hadoop fs -cat /hbase/.META./1028785192/info/8254845156484129698 > tmp.out
>>>> >
>>>> > I have never came across this before, but this is the first time I've
>>>> > had 7M rows in the db.
>>>> > Is there anything going on that would bog down the network and cause
>>>> > this file to be unreachable?
>>>> >
>>>> > I have 3 servers. The master is running the jobtracker, namenode and hmaster.
>>>> > And all 3 are running datanodes, regionservers and zookeeper.
>>>> >
>>>> > Appreciate the help.
>>>> >
>>>> > On Tue, Feb 16, 2010 at 2:11 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>>> >> This line
>>>> >> java.io.IOException: java.io.IOException: Could not obtain block:
>>>> >> blk_-6288142015045035704_88516
>>>> >> file=/hbase/.META./1028785192/info/8254845156484129698
>>>> >>
>>>> >> Means that the region server wasn't able to fetch a block for the .META.
>>>> >> table (the table where all region addresses are). Are you able to open that
>>>> >> file using the bin/hadoop command line utility?
>>>> >>
>>>> >> J-D
>>>> >>
>>>> >> On Tue, Feb 16, 2010 at 11:08 AM, Bluemetrix Development <
>>>> >> bmdevelopment@gmail.com> wrote:
>>>> >>
>>>> >>> Hi,
>>>> >>> I'm currently trying to run a count in hbase shell and it crashes
>>>> >>> right towards the end.
>>>> >>> This is turn seems to crash hbase or at least causes the regionservers
>>>> >>> to become unavailable.
>>>> >>>
>>>> >>> Here's the tail end of the count output:
>>>> >>> http://pastebin.com/m465346d0
>>>> >>>
>>>> >>> I'm on version 0.20.2 and running this command:
>>>> >>> > count 'table', 1000000
>>>> >>>
>>>> >>> Anyone with similar issues or ideas on this?
>>>> >>> Please let me know if you need further info.
>>>> >>> Thanks
>>>> >>>
>>>> >>
>>>> >
>>>
>>> _________________________________________________________________
>>> Hotmail: Trusted email with powerful SPAM protection.
>>> http://clk.atdmt.com/GBL/go/201469227/direct/01/
>>
>