You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jean-Adrien <ad...@jeanjean.ch> on 2008/10/17 10:01:40 UTC

Regionserver fails to serve region

Hello again.
This is my last message for today

I have often an exception in my HBase client. A regionserver fails to serve
a region when the client get a row on the HBase cluster.

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 192.168.1.15:60020 for region
table-0.3,:testrow79063200,1223872616091, row ':testrow22102600', but failed
after 10 attempts.

The attempts of above can be:
1.
java.io.IOException: java.io.IOException: Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
2-10
java.io.IOException: java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)

After what. Every time the client try to reach the same region the attemps
1-10 are
java.io.IOException: java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)

In this case, if the client try to reach the same region again, all next 10
attemps are the NPE.

Another 10 attempts scenario I have seen:
1-10:
IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
java.io.IOException: Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
        at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)

Preceded, in concerned regionsserver log, by the line:

2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
obtain block blk_-3759213227484579481_226277 from any node: 
java.io.IOException: No live nodes contain current block

If I look for this block in the hadoop master log I can find

2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask
192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
(many more blocks)

about 16 min before.
In both cases the regionserver fails to serve the concerned region until I
restart hbase (not hadoop).

I have no clue to know if such a failure is temporary (how long) or I really
need to restart. But I noticed that the failure doesn't recover in the next
3-4 hours.

One last question by the way:
Why the replication factor of my hbase files in dfs is 3, when my hadoop
cluster is configured to keep only 2 copies ?
Is it because the default (hadoop-default.xml) config file of the hadoop
client, which is embedded in hbase distrib overrides the cluster
configuration for the mapfiles created ? 
Is that a good configuration scheme, or is it preferable to allow the hbase
hadoop client to load the hadoop-site.xml file I have set for the running
instance of hadoop server, adding the hadoop conf directory in the hbase
classpath,
and therefore having the same configuration in client than in server ?

Have a nice day.
Thank you for your advises.

-- Jean-Adrien

Cluster setup:
4 regionsservers / datanodes
1 is master / namenode as well.
java-6-sun
Total size of hdfs: 81.98 GB (replication factor 3)
fsck -> healthy
hadoop: 0.18.1
hbase: 0.18.0 (jar of hadoop replaced with 0.18.1)
1Gb ram per node




-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20028553.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.
I'll enable DEBUG and run cluster this weekend , i need it stable till
weekend.
But, as a Murphy's Law it will not happen with DEBUG on :-).

BTW, there no reason to run balancer on the beginning, from last time
balancer was required after  30h work.
After 30h of work - log files are large than it should be :-)

I'll run it and let you know after weekend.

Thank You for your assistance an patience.

On Tue, Nov 11, 2008 at 6:47 PM, Michael Stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.DEBUG wasn't enabled , because it decrease the performance and increase
>> log size.
>>
>>
> Sure. But maybe leave it on while we're trying to figure issues.
>
>  Regarding the ulimit - yes it's upped for 32K.
>>
>>
> Good.
>
>  You remember correct - during massive load i run the balancer and from
>> this
>> time everything is started to behave strange.
>>  Currently , i can't tell you the the regions that are in the table - i
>> re-formatted hdfs ( this was the only way i can get my cluster back to
>> work).
>>
>>
> Sure (If DEBUG was on, it records in log how many -- just FYI).
>
>  I have 7 datatnodes , 6 of them are running region server and one is
>> Hmaster.
>>
>>
> Are things running for you now?  Have you tried another upload without the
> balancer?
> St.Ack
>

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.DEBUG wasn't enabled , because it decrease the performance and increase
> log size.
>   
Sure. But maybe leave it on while we're trying to figure issues.

> Regarding the ulimit - yes it's upped for 32K.
>   
Good.

> You remember correct - during massive load i run the balancer and from this
> time everything is started to behave strange.
>   
> Currently , i can't tell you the the regions that are in the table - i
> re-formatted hdfs ( this was the only way i can get my cluster back to
> work).
>   
Sure (If DEBUG was on, it records in log how many -- just FYI).

> I have 7 datatnodes , 6 of them are running region server and one is
> Hmaster.
>   
Are things running for you now?  Have you tried another upload without 
the balancer?
St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.DEBUG wasn't enabled , because it decrease the performance and increase
log size.
Regarding the ulimit - yes it's upped for 32K.
You remember correct - during massive load i run the balancer and from this
time everything is started to behave strange.

Currently , i can't tell you the the regions that are in the table - i
re-formatted hdfs ( this was the only way i can get my cluster back to
work).

I have 7 datatnodes , 6 of them are running region server and one is
Hmaster.

Best Regards.

On Tue, Nov 11, 2008 at 1:08 AM, stack <st...@duboce.net> wrote:

> I took a look.
>
> First, enable DEBUG.  See the hbase FAQ for how.
>
> Looking, I see that all was running fine till:
>
> 2008-11-03 14:10:08,261 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.X.X.Y:60020. Already tried 0 time(s).
>
> ...in the middle of an attempt at scanning the .META. region.
>
> Looking through regionserver logs, they are all fine till about that above
> time when I start to see variations on:
>
> 2008-11-03 14:08:46,440 INFO org.apache.hadoop.dfs.DFSClient: Could not
> obtain block blk_1223341017118968735_305051 from any node:
>  java.io.IOException: No live nodes contain current block
>
> ....and
>
> 2008-11-03 14:08:43,660 INFO org.apache.hadoop.dfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 10.X.X.Y:50010
> 2008-11-03 14:08:43,660 INFO org.apache.hadoop.dfs.DFSClient: Abandoning
> block blk_6726606309673852040_314096
>
> Your hdfs went bad for some reason around above time.  I don't see any
> obvious explanation for why it went bad.  You were running balancer at the
> time IIRC?
>
> Could you netstat your running datanodes and see how many concurrent
> connections you had running?  Was 1024 enough?  You had configured a max of
> 1024?  I don't see the ulimit print out in these logs so presume its > 1024.
>
> How many regions do you have in your table when it starts to go wonky?  You
> have 6 datanodes running beside your 6 regionservers?
>
> St.Ack
>
>
> Slava Gorelik wrote:
>
>> Hi Michael.
>> I'm sending logs, in 2 parts (2 messages)
>> Part 1
>>
>>
>> On Tue, Nov 4, 2008 at 11:44 PM, Slava Gorelik <slava.gorelik@gmail.com<mailto:
>> slava.gorelik@gmail.com>> wrote:
>>
>>    Thank You. Now it's clear.
>>
>>
>>    On Tue, Nov 4, 2008 at 11:31 PM, stack <stack@duboce.net
>>    <ma...@duboce.net>> wrote:
>>
>>        Slava Gorelik wrote:
>>
>>            One more regarding the blockCache, how changes in store
>>            files (as i
>>            understand those are MapFiles) are reflected on client
>>            side cache. If we are
>>            talking about more than one client that doing a changes ?
>>            If each client has
>>            different part of the MapFile ? or something else ?
>>
>>
>>        The block cache cache is over in the server. Its a cache for
>>        store files which never change once written.  Did I say
>>        client-side cache?  I should have been more clear.  The client
>>        in this case is the regionserver itself.   The cache is so the
>>        regionserver saves on its trips over the network visiting
>>        datanodes.
>>        St.Ack
>>
>>
>>
>>            Best Regards.
>>
>>            On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik
>>            <slava.gorelik@gmail.com
>>            <ma...@gmail.com>>wrote:
>>
>>
>>                I can try to reproduce it again, but before this i
>>                would like to send you a
>>                logs.
>>                Best Regards.
>>
>>
>>                On Tue, Nov 4, 2008 at 10:05 PM, stack
>>                <stack@duboce.net <ma...@duboce.net>> wrote:
>>
>>
>>                    Then we should try and figure if there is an issue
>>                    in the balancer, or
>>                    maybe there is something missing if we are not
>>                    doing a big upload in a
>>                    manner that balances the upload across HDFS?
>>                    St.Ack
>>
>>                    Slava Gorelik wrote:
>>
>>
>>                        Sure, i'll arrange logs tomorrow.About
>>                        balancer, to wait when the massive
>>                        work is finished is good in testing
>>                        environment but in production it's
>>                        not
>>                        relevant :-)
>>
>>                        Best Regards.
>>
>>                        On Tue, Nov 4, 2008 at 9:48 PM, stack
>>                        <stack@duboce.net <ma...@duboce.net>>
>>
>>                        wrote:
>>
>>
>>
>>
>>                            Slava Gorelik wrote:
>>
>>
>>
>>
>>                                Hi.Regarding the failure of new block
>>                                creation - i failed to run hbase
>>                                till
>>                                i reformatted HDFS again.
>>
>>
>>
>>
>>
>>                            I'd be interested in the logs.
>>
>>                             I just wandering if hadoop re balancing
>>                            is necessary? Will it balance
>>
>>
>>
>>                                itself
>>                                ? As i understand hadoop balancer is
>>                                moving data between data nodes,
>>                                but
>>                                in
>>                                my case this is during massive (8
>>                                clients just adding a records - about
>>                                400
>>                                requests for all region servers - 6).
>>                                So, is it good idea to run
>>                                balancer during heavy load ?
>>
>>
>>
>>
>>
>>                            I don't have sufficient experience running
>>                            the balancer.  Perhaps wait
>>                            till
>>                            upload is done, then run it?
>>
>>                            St.Ack
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi Michael.Did you have a chance to see the logs ?

Best Regards.

On Wed, Nov 5, 2008 at 1:24 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi Michael.3'rd part (last) of the logs
>
>
> On Tue, Nov 4, 2008 at 11:31 PM, stack <st...@duboce.net> wrote:
>
>> Slava Gorelik wrote:
>>
>>> One more regarding the blockCache, how changes in store files (as i
>>> understand those are MapFiles) are reflected on client side cache. If we
>>> are
>>> talking about more than one client that doing a changes ? If each client
>>> has
>>> different part of the MapFile ? or something else ?
>>>
>>>
>>
>> The block cache cache is over in the server. Its a cache for store files
>> which never change once written.  Did I say client-side cache?  I should
>> have been more clear.  The client in this case is the regionserver itself.
>> The cache is so the regionserver saves on its trips over the network
>> visiting datanodes.
>> St.Ack
>>
>>
>>
>>  Best Regards.
>>>
>>> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <slava.gorelik@gmail.com
>>> >wrote:
>>>
>>>
>>>
>>>> I can try to reproduce it again, but before this i would like to send
>>>> you a
>>>> logs.
>>>> Best Regards.
>>>>
>>>>
>>>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>> Then we should try and figure if there is an issue in the balancer, or
>>>>> maybe there is something missing if we are not doing a big upload in a
>>>>> manner that balances the upload across HDFS?
>>>>> St.Ack
>>>>>
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the
>>>>>> massive
>>>>>> work is finished is good in testing environment but in production it's
>>>>>> not
>>>>>> relevant :-)
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi.Regarding the failure of new block creation - i failed to run
>>>>>>>> hbase
>>>>>>>> till
>>>>>>>> i reformatted HDFS again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I'd be interested in the logs.
>>>>>>>
>>>>>>>  I just wandering if hadoop re balancing is necessary? Will it
>>>>>>> balance
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> itself
>>>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>>>> but
>>>>>>>> in
>>>>>>>> my case this is during massive (8 clients just adding a records -
>>>>>>>> about
>>>>>>>> 400
>>>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>>>> balancer during heavy load ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I don't have sufficient experience running the balancer.  Perhaps
>>>>>>> wait
>>>>>>> till
>>>>>>> upload is done, then run it?
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi Michael.3'rd part (last) of the logs

On Tue, Nov 4, 2008 at 11:31 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> One more regarding the blockCache, how changes in store files (as i
>> understand those are MapFiles) are reflected on client side cache. If we
>> are
>> talking about more than one client that doing a changes ? If each client
>> has
>> different part of the MapFile ? or something else ?
>>
>>
>
> The block cache cache is over in the server. Its a cache for store files
> which never change once written.  Did I say client-side cache?  I should
> have been more clear.  The client in this case is the regionserver itself.
> The cache is so the regionserver saves on its trips over the network
> visiting datanodes.
> St.Ack
>
>
>
>  Best Regards.
>>
>> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <slava.gorelik@gmail.com
>> >wrote:
>>
>>
>>
>>> I can try to reproduce it again, but before this i would like to send you
>>> a
>>> logs.
>>> Best Regards.
>>>
>>>
>>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>> Then we should try and figure if there is an issue in the balancer, or
>>>> maybe there is something missing if we are not doing a big upload in a
>>>> manner that balances the upload across HDFS?
>>>> St.Ack
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the
>>>>> massive
>>>>> work is finished is good in testing environment but in production it's
>>>>> not
>>>>> relevant :-)
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi.Regarding the failure of new block creation - i failed to run
>>>>>>> hbase
>>>>>>> till
>>>>>>> i reformatted HDFS again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I'd be interested in the logs.
>>>>>>
>>>>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> itself
>>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>>> but
>>>>>>> in
>>>>>>> my case this is during massive (8 clients just adding a records -
>>>>>>> about
>>>>>>> 400
>>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>>> balancer during heavy load ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>>>>> till
>>>>>> upload is done, then run it?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

I took a look.

First, enable DEBUG.  See the hbase FAQ for how.

Looking, I see that all was running fine till:

2008-11-03 14:10:08,261 INFO org.apache.hadoop.ipc.Client: Retrying 
connect to server: /10.X.X.Y:60020. Already tried 0 time(s).

...in the middle of an attempt at scanning the .META. region.

Looking through regionserver logs, they are all fine till about that 
above time when I start to see variations on:

2008-11-03 14:08:46,440 INFO org.apache.hadoop.dfs.DFSClient: Could not 
obtain block blk_1223341017118968735_305051 from any node:  
java.io.IOException: No live nodes contain current block

....and

2008-11-03 14:08:43,660 INFO org.apache.hadoop.dfs.DFSClient: Exception 
in createBlockOutputStream java.io.IOException: Bad connect ack with 
firstBadLink 10.X.X.Y:50010
2008-11-03 14:08:43,660 INFO org.apache.hadoop.dfs.DFSClient: Abandoning 
block blk_6726606309673852040_314096

Your hdfs went bad for some reason around above time.  I don't see any 
obvious explanation for why it went bad.  You were running balancer at 
the time IIRC?

Could you netstat your running datanodes and see how many concurrent 
connections you had running?  Was 1024 enough?  You had configured a max 
of 1024?  I don't see the ulimit print out in these logs so presume its 
 > 1024.

How many regions do you have in your table when it starts to go wonky?  
You have 6 datanodes running beside your 6 regionservers?

St.Ack


Slava Gorelik wrote:
> Hi Michael.
> I'm sending logs, in 2 parts (2 messages)
> Part 1
>
>
> On Tue, Nov 4, 2008 at 11:44 PM, Slava Gorelik 
> <slava.gorelik@gmail.com <ma...@gmail.com>> wrote:
>
>     Thank You. Now it's clear.
>
>
>     On Tue, Nov 4, 2008 at 11:31 PM, stack <stack@duboce.net
>     <ma...@duboce.net>> wrote:
>
>         Slava Gorelik wrote:
>
>             One more regarding the blockCache, how changes in store
>             files (as i
>             understand those are MapFiles) are reflected on client
>             side cache. If we are
>             talking about more than one client that doing a changes ?
>             If each client has
>             different part of the MapFile ? or something else ?
>              
>
>
>         The block cache cache is over in the server. Its a cache for
>         store files which never change once written.  Did I say
>         client-side cache?  I should have been more clear.  The client
>         in this case is the regionserver itself.   The cache is so the
>         regionserver saves on its trips over the network visiting
>         datanodes.
>         St.Ack
>
>
>
>             Best Regards.
>
>             On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik
>             <slava.gorelik@gmail.com
>             <ma...@gmail.com>>wrote:
>
>              
>
>                 I can try to reproduce it again, but before this i
>                 would like to send you a
>                 logs.
>                 Best Regards.
>
>
>                 On Tue, Nov 4, 2008 at 10:05 PM, stack
>                 <stack@duboce.net <ma...@duboce.net>> wrote:
>
>                    
>
>                     Then we should try and figure if there is an issue
>                     in the balancer, or
>                     maybe there is something missing if we are not
>                     doing a big upload in a
>                     manner that balances the upload across HDFS?
>                     St.Ack
>
>                     Slava Gorelik wrote:
>
>                          
>
>                         Sure, i'll arrange logs tomorrow.About
>                         balancer, to wait when the massive
>                         work is finished is good in testing
>                         environment but in production it's
>                         not
>                         relevant :-)
>
>                         Best Regards.
>
>                         On Tue, Nov 4, 2008 at 9:48 PM, stack
>                         <stack@duboce.net <ma...@duboce.net>>
>                         wrote:
>
>
>
>                                
>
>                             Slava Gorelik wrote:
>
>
>
>                                      
>
>                                 Hi.Regarding the failure of new block
>                                 creation - i failed to run hbase
>                                 till
>                                 i reformatted HDFS again.
>
>
>
>
>                                            
>
>                             I'd be interested in the logs.
>
>                              I just wandering if hadoop re balancing
>                             is necessary? Will it balance
>
>
>                                      
>
>                                 itself
>                                 ? As i understand hadoop balancer is
>                                 moving data between data nodes,
>                                 but
>                                 in
>                                 my case this is during massive (8
>                                 clients just adding a records - about
>                                 400
>                                 requests for all region servers - 6).
>                                 So, is it good idea to run
>                                 balancer during heavy load ?
>
>
>
>
>                                            
>
>                             I don't have sufficient experience running
>                             the balancer.  Perhaps wait
>                             till
>                             upload is done, then run it?
>
>                             St.Ack
>
>
>
>                                      
>
>
>                                
>
>                          
>
>
>              
>
>
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi Michael.I'm sending logs, in 2 parts (2 messages)
Part 1


On Tue, Nov 4, 2008 at 11:44 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Thank You. Now it's clear.
>
>
> On Tue, Nov 4, 2008 at 11:31 PM, stack <st...@duboce.net> wrote:
>
>> Slava Gorelik wrote:
>>
>>> One more regarding the blockCache, how changes in store files (as i
>>> understand those are MapFiles) are reflected on client side cache. If we
>>> are
>>> talking about more than one client that doing a changes ? If each client
>>> has
>>> different part of the MapFile ? or something else ?
>>>
>>>
>>
>> The block cache cache is over in the server. Its a cache for store files
>> which never change once written.  Did I say client-side cache?  I should
>> have been more clear.  The client in this case is the regionserver itself.
>> The cache is so the regionserver saves on its trips over the network
>> visiting datanodes.
>> St.Ack
>>
>>
>>
>>  Best Regards.
>>>
>>> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <slava.gorelik@gmail.com
>>> >wrote:
>>>
>>>
>>>
>>>> I can try to reproduce it again, but before this i would like to send
>>>> you a
>>>> logs.
>>>> Best Regards.
>>>>
>>>>
>>>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>> Then we should try and figure if there is an issue in the balancer, or
>>>>> maybe there is something missing if we are not doing a big upload in a
>>>>> manner that balances the upload across HDFS?
>>>>> St.Ack
>>>>>
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the
>>>>>> massive
>>>>>> work is finished is good in testing environment but in production it's
>>>>>> not
>>>>>> relevant :-)
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi.Regarding the failure of new block creation - i failed to run
>>>>>>>> hbase
>>>>>>>> till
>>>>>>>> i reformatted HDFS again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I'd be interested in the logs.
>>>>>>>
>>>>>>>  I just wandering if hadoop re balancing is necessary? Will it
>>>>>>> balance
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> itself
>>>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>>>> but
>>>>>>>> in
>>>>>>>> my case this is during massive (8 clients just adding a records -
>>>>>>>> about
>>>>>>>> 400
>>>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>>>> balancer during heavy load ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I don't have sufficient experience running the balancer.  Perhaps
>>>>>>> wait
>>>>>>> till
>>>>>>> upload is done, then run it?
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Thank You. Now it's clear.

On Tue, Nov 4, 2008 at 11:31 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> One more regarding the blockCache, how changes in store files (as i
>> understand those are MapFiles) are reflected on client side cache. If we
>> are
>> talking about more than one client that doing a changes ? If each client
>> has
>> different part of the MapFile ? or something else ?
>>
>>
>
> The block cache cache is over in the server. Its a cache for store files
> which never change once written.  Did I say client-side cache?  I should
> have been more clear.  The client in this case is the regionserver itself.
> The cache is so the regionserver saves on its trips over the network
> visiting datanodes.
> St.Ack
>
>
>
>  Best Regards.
>>
>> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <slava.gorelik@gmail.com
>> >wrote:
>>
>>
>>
>>> I can try to reproduce it again, but before this i would like to send you
>>> a
>>> logs.
>>> Best Regards.
>>>
>>>
>>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>> Then we should try and figure if there is an issue in the balancer, or
>>>> maybe there is something missing if we are not doing a big upload in a
>>>> manner that balances the upload across HDFS?
>>>> St.Ack
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the
>>>>> massive
>>>>> work is finished is good in testing environment but in production it's
>>>>> not
>>>>> relevant :-)
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi.Regarding the failure of new block creation - i failed to run
>>>>>>> hbase
>>>>>>> till
>>>>>>> i reformatted HDFS again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I'd be interested in the logs.
>>>>>>
>>>>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> itself
>>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>>> but
>>>>>>> in
>>>>>>> my case this is during massive (8 clients just adding a records -
>>>>>>> about
>>>>>>> 400
>>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>>> balancer during heavy load ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>>>>> till
>>>>>> upload is done, then run it?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi Michael.
Second part of the logs (will be 3 parts, due to mailing list limit).


On Tue, Nov 4, 2008 at 11:31 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> One more regarding the blockCache, how changes in store files (as i
>> understand those are MapFiles) are reflected on client side cache. If we
>> are
>> talking about more than one client that doing a changes ? If each client
>> has
>> different part of the MapFile ? or something else ?
>>
>>
>
> The block cache cache is over in the server. Its a cache for store files
> which never change once written.  Did I say client-side cache?  I should
> have been more clear.  The client in this case is the regionserver itself.
> The cache is so the regionserver saves on its trips over the network
> visiting datanodes.
> St.Ack
>
>
>
>  Best Regards.
>>
>> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <slava.gorelik@gmail.com
>> >wrote:
>>
>>
>>
>>> I can try to reproduce it again, but before this i would like to send you
>>> a
>>> logs.
>>> Best Regards.
>>>
>>>
>>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>> Then we should try and figure if there is an issue in the balancer, or
>>>> maybe there is something missing if we are not doing a big upload in a
>>>> manner that balances the upload across HDFS?
>>>> St.Ack
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the
>>>>> massive
>>>>> work is finished is good in testing environment but in production it's
>>>>> not
>>>>> relevant :-)
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi.Regarding the failure of new block creation - i failed to run
>>>>>>> hbase
>>>>>>> till
>>>>>>> i reformatted HDFS again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I'd be interested in the logs.
>>>>>>
>>>>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> itself
>>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>>> but
>>>>>>> in
>>>>>>> my case this is during massive (8 clients just adding a records -
>>>>>>> about
>>>>>>> 400
>>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>>> balancer during heavy load ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>>>>> till
>>>>>> upload is done, then run it?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Slava Gorelik wrote:
> One more regarding the blockCache, how changes in store files (as i
> understand those are MapFiles) are reflected on client side cache. If we are
> talking about more than one client that doing a changes ? If each client has
> different part of the MapFile ? or something else ?
>   

The block cache cache is over in the server. Its a cache for store files 
which never change once written.  Did I say client-side cache?  I should 
have been more clear.  The client in this case is the regionserver 
itself.   The cache is so the regionserver saves on its trips over the 
network visiting datanodes.
St.Ack


> Best Regards.
>
> On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <sl...@gmail.com>wrote:
>
>   
>> I can try to reproduce it again, but before this i would like to send you a
>> logs.
>> Best Regards.
>>
>>
>> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>>
>>     
>>> Then we should try and figure if there is an issue in the balancer, or
>>> maybe there is something missing if we are not doing a big upload in a
>>> manner that balances the upload across HDFS?
>>> St.Ack
>>>
>>> Slava Gorelik wrote:
>>>
>>>       
>>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the massive
>>>> work is finished is good in testing environment but in production it's
>>>> not
>>>> relevant :-)
>>>>
>>>> Best Regards.
>>>>
>>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi.Regarding the failure of new block creation - i failed to run hbase
>>>>>> till
>>>>>> i reformatted HDFS again.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> I'd be interested in the logs.
>>>>>
>>>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>>>
>>>>>
>>>>>           
>>>>>> itself
>>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>>> but
>>>>>> in
>>>>>> my case this is during massive (8 clients just adding a records - about
>>>>>> 400
>>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>>> balancer during heavy load ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>>>> till
>>>>> upload is done, then run it?
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>
>>>>         
>>>       
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

One more regarding the blockCache, how changes in store files (as i
understand those are MapFiles) are reflected on client side cache. If we are
talking about more than one client that doing a changes ? If each client has
different part of the MapFile ? or something else ?
Best Regards.

On Tue, Nov 4, 2008 at 11:10 PM, Slava Gorelik <sl...@gmail.com>wrote:

> I can try to reproduce it again, but before this i would like to send you a
> logs.
> Best Regards.
>
>
> On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:
>
>> Then we should try and figure if there is an issue in the balancer, or
>> maybe there is something missing if we are not doing a big upload in a
>> manner that balances the upload across HDFS?
>> St.Ack
>>
>> Slava Gorelik wrote:
>>
>>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the massive
>>> work is finished is good in testing environment but in production it's
>>> not
>>> relevant :-)
>>>
>>> Best Regards.
>>>
>>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Hi.Regarding the failure of new block creation - i failed to run hbase
>>>>> till
>>>>> i reformatted HDFS again.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> I'd be interested in the logs.
>>>>
>>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>>
>>>>
>>>>> itself
>>>>> ? As i understand hadoop balancer is moving data between data nodes,
>>>>> but
>>>>> in
>>>>> my case this is during massive (8 clients just adding a records - about
>>>>> 400
>>>>> requests for all region servers - 6). So, is it good idea to run
>>>>> balancer during heavy load ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>>> till
>>>> upload is done, then run it?
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

I can try to reproduce it again, but before this i would like to send you a
logs.
Best Regards.


On Tue, Nov 4, 2008 at 10:05 PM, stack <st...@duboce.net> wrote:

> Then we should try and figure if there is an issue in the balancer, or
> maybe there is something missing if we are not doing a big upload in a
> manner that balances the upload across HDFS?
> St.Ack
>
> Slava Gorelik wrote:
>
>> Sure, i'll arrange logs tomorrow.About balancer, to wait when the massive
>> work is finished is good in testing environment but in production it's not
>> relevant :-)
>>
>> Best Regards.
>>
>> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>>
>>
>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi.Regarding the failure of new block creation - i failed to run hbase
>>>> till
>>>> i reformatted HDFS again.
>>>>
>>>>
>>>>
>>>>
>>> I'd be interested in the logs.
>>>
>>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>>
>>>
>>>> itself
>>>> ? As i understand hadoop balancer is moving data between data nodes, but
>>>> in
>>>> my case this is during massive (8 clients just adding a records - about
>>>> 400
>>>> requests for all region servers - 6). So, is it good idea to run
>>>> balancer during heavy load ?
>>>>
>>>>
>>>>
>>>>
>>> I don't have sufficient experience running the balancer.  Perhaps wait
>>> till
>>> upload is done, then run it?
>>>
>>> St.Ack
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Then we should try and figure if there is an issue in the balancer, or 
maybe there is something missing if we are not doing a big upload in a 
manner that balances the upload across HDFS?
St.Ack

Slava Gorelik wrote:
> Sure, i'll arrange logs tomorrow.About balancer, to wait when the massive
> work is finished is good in testing environment but in production it's not
> relevant :-)
>
> Best Regards.
>
> On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:
>
>   
>> Slava Gorelik wrote:
>>
>>     
>>> Hi.Regarding the failure of new block creation - i failed to run hbase
>>> till
>>> i reformatted HDFS again.
>>>
>>>
>>>       
>> I'd be interested in the logs.
>>
>>  I just wandering if hadoop re balancing is necessary? Will it balance
>>     
>>> itself
>>> ? As i understand hadoop balancer is moving data between data nodes, but
>>> in
>>> my case this is during massive (8 clients just adding a records - about
>>> 400
>>> requests for all region servers - 6). So, is it good idea to run
>>> balancer during heavy load ?
>>>
>>>
>>>       
>> I don't have sufficient experience running the balancer.  Perhaps wait till
>> upload is done, then run it?
>>
>> St.Ack
>>
>>     
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Sure, i'll arrange logs tomorrow.About balancer, to wait when the massive
work is finished is good in testing environment but in production it's not
relevant :-)

Best Regards.

On Tue, Nov 4, 2008 at 9:48 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.Regarding the failure of new block creation - i failed to run hbase
>> till
>> i reformatted HDFS again.
>>
>>
> I'd be interested in the logs.
>
>  I just wandering if hadoop re balancing is necessary? Will it balance
>> itself
>> ? As i understand hadoop balancer is moving data between data nodes, but
>> in
>> my case this is during massive (8 clients just adding a records - about
>> 400
>> requests for all region servers - 6). So, is it good idea to run
>> balancer during heavy load ?
>>
>>
> I don't have sufficient experience running the balancer.  Perhaps wait till
> upload is done, then run it?
>
> St.Ack
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.Regarding the failure of new block creation - i failed to run hbase till
> i reformatted HDFS again.
>   
I'd be interested in the logs.

> I just wandering if hadoop re balancing is necessary? Will it balance itself
> ? As i understand hadoop balancer is moving data between data nodes, but in
> my case this is during massive (8 clients just adding a records - about 400
> requests for all region servers - 6). So, is it good idea to run
> balancer during heavy load ?
>   
I don't have sufficient experience running the balancer.  Perhaps wait 
till upload is done, then run it?

St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.Regarding the failure of new block creation - i failed to run hbase till
i reformatted HDFS again.
I just wandering if hadoop re balancing is necessary? Will it balance itself
? As i understand hadoop balancer is moving data between data nodes, but in
my case this is during massive (8 clients just adding a records - about 400
requests for all region servers - 6). So, is it good idea to run
balancer during heavy load ?

Best Regards.


On Tue, Nov 4, 2008 at 9:33 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.I happened yesterday after 28 hours of running i started the hadoop
>> balancer after 2 hours working (with some exception- ca't move block) my
>> hbase started to throws exception that can't create new block. It happened
>> on on couple of region servers and then master is failed to connect to
>> then
>> and eventually it crashed. Tomorrow I can cut last 2-3 hours from logs
>> (they
>> are huge) and send you.
>>
>>
>
> Sorry about that.  I presumed it safe given that balancer has been around a
> few releases and we're running it here continuously w/o issue.  Did estart
> fix things or are there now missing blocks?
>
>> BTW, some i sent email to list couple of days ago about blockCache
>> parameter
>> on column family descriptor, what is it and how it affect on performance ?
>>
>>
> Sorry, missed it.
>
> The blockcache is client-side caching of pieces of store files.  You can
> set the size of the blocks to cache client-side.  It uses java Soft
> References.  Blocks are evicted on roughly an LRU basis when memory is low.
>  It was added a good while ago by Tom White.
>
> By default it has been off but as of HBASE-953 commit of about a week or so
> ago, after some playing and tuning, the default has been flipped and now
> blockcache is on by default.  Block caching along with other performance
> improvements including rpc fixes and J-D's scanner pre-fetching and batch
> writing, will make the 0.19.0 release run faster than its predecessors in
> many regards.
>
> Some rough benchmarking running our performance test -- keep in mind, this
> is not-very-real-world just a single client going against a single
> regionserver (see wiki for more) -- shows writes running at ~3X speed, scans
> at ~7X, sequential reads at ~2X and random reads anywhere from slower to 2
> to 3 times faster dependent on how well the block cache is helping (or
> hindering).  If the regionserver has more memory, random reads run faster.
>  If not enough, regionserver is just spinning filling cache and random read
> times plummet.  I'll put up some numbers when we come closer to the 0.19.0
> release.
>
> If you do enable block cache, be sure to update your hbase-default.xml.
>  The old block size tends to provoke OOMEs.
> St.Ack
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.I happened yesterday after 28 hours of running i started the hadoop
> balancer after 2 hours working (with some exception- ca't move block) my
> hbase started to throws exception that can't create new block. It happened
> on on couple of region servers and then master is failed to connect to then
> and eventually it crashed. Tomorrow I can cut last 2-3 hours from logs (they
> are huge) and send you.
>   

Sorry about that.  I presumed it safe given that balancer has been 
around a few releases and we're running it here continuously w/o issue.  
Did estart fix things or are there now missing blocks?
> BTW, some i sent email to list couple of days ago about blockCache parameter
> on column family descriptor, what is it and how it affect on performance ?
>   
Sorry, missed it.

The blockcache is client-side caching of pieces of store files.  You can 
set the size of the blocks to cache client-side.  It uses java Soft 
References.  Blocks are evicted on roughly an LRU basis when memory is 
low.  It was added a good while ago by Tom White.

By default it has been off but as of HBASE-953 commit of about a week or 
so ago, after some playing and tuning, the default has been flipped and 
now blockcache is on by default.  Block caching along with other 
performance improvements including rpc fixes and J-D's scanner 
pre-fetching and batch writing, will make the 0.19.0 release run faster 
than its predecessors in many regards.

Some rough benchmarking running our performance test -- keep in mind, 
this is not-very-real-world just a single client going against a single 
regionserver (see wiki for more) -- shows writes running at ~3X speed, 
scans at ~7X, sequential reads at ~2X and random reads anywhere from 
slower to 2 to 3 times faster dependent on how well the block cache is 
helping (or hindering).  If the regionserver has more memory, random 
reads run faster.  If not enough, regionserver is just spinning filling 
cache and random read times plummet.  I'll put up some numbers when we 
come closer to the 0.19.0 release.

If you do enable block cache, be sure to update your hbase-default.xml.  
The old block size tends to provoke OOMEs.
St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I happened yesterday after 28 hours of running i started the hadoop
balancer after 2 hours working (with some exception- ca't move block) my
hbase started to throws exception that can't create new block. It happened
on on couple of region servers and then master is failed to connect to then
and eventually it crashed. Tomorrow I can cut last 2-3 hours from logs (they
are huge) and send you.

Thank You for Your assistance.

BTW, some i sent email to list couple of days ago about blockCache parameter
on column family descriptor, what is it and how it affect on performance ?

Best Regards.

On Tue, Nov 4, 2008 at 7:27 PM, stack <st...@duboce.net> wrote:

> Did it?  It shouldn't.  If it does, then its a bug we need to figure.
>  Whats its log say?  Otherwise you could do the rebalance manually but your
> results will be spotter (Up replication, let it run a while, then
> selectively shutdown nodes checking filesystem as you go to make sure at
> least one replica is still being served and then restore original
> replication level, etc.).
> St.Ack
>
> Slava Gorelik wrote:
>
>> Can i run balancer on hadoop during massive load on Hbase ?Last time i did
>> it i killed my data :-)
>>
>> On Tue, Nov 4, 2008 at 10:45 AM, Michael Stack <st...@duboce.net> wrote:
>>
>>
>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss
>>>> Clock.
>>>> Worked with 8 clients about 30 hours intensive load.
>>>>
>>>>
>>>>
>>>>
>>> Thanks for reporting back to the list.
>>>
>>>
>>>
>>>> Just small question, after about 28 hours (when i came back to work) i
>>>> found
>>>> that one of 7 datanodes in Hadoop is about 98% usage and all other about
>>>> 30%, is it normal ?
>>>>
>>>>
>>>>
>>>>
>>> I haven't kept a close eye on HDFS usage during hbase upload.  I do know
>>> that out-of-balance would seem to be a common condition and that its been
>>> reported on the list that hbase runs faster on a balanced HDFS.  You
>>> might
>>> try out the hadoop balancer.  There's a little note on it here in the
>>> hbase
>>> FAQ: http://wiki.apache.org/hadoop/Hbase/FAQ#7.
>>>
>>> St.Ack
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Did it?  It shouldn't.  If it does, then its a bug we need to figure.  
Whats its log say?  Otherwise you could do the rebalance manually but 
your results will be spotter (Up replication, let it run a while, then 
selectively shutdown nodes checking filesystem as you go to make sure at 
least one replica is still being served and then restore original 
replication level, etc.).
St.Ack

Slava Gorelik wrote:
> Can i run balancer on hadoop during massive load on Hbase ?Last time i did
> it i killed my data :-)
>
> On Tue, Nov 4, 2008 at 10:45 AM, Michael Stack <st...@duboce.net> wrote:
>
>   
>> Slava Gorelik wrote:
>>
>>     
>>> Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss
>>> Clock.
>>> Worked with 8 clients about 30 hours intensive load.
>>>
>>>
>>>       
>> Thanks for reporting back to the list.
>>
>>     
>>> Just small question, after about 28 hours (when i came back to work) i
>>> found
>>> that one of 7 datanodes in Hadoop is about 98% usage and all other about
>>> 30%, is it normal ?
>>>
>>>
>>>       
>> I haven't kept a close eye on HDFS usage during hbase upload.  I do know
>> that out-of-balance would seem to be a common condition and that its been
>> reported on the list that hbase runs faster on a balanced HDFS.  You might
>> try out the hadoop balancer.  There's a little note on it here in the hbase
>> FAQ: http://wiki.apache.org/hadoop/Hbase/FAQ#7.
>>
>> St.Ack
>>
>>     
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Can i run balancer on hadoop during massive load on Hbase ?Last time i did
it i killed my data :-)

On Tue, Nov 4, 2008 at 10:45 AM, Michael Stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss
>> Clock.
>> Worked with 8 clients about 30 hours intensive load.
>>
>>
> Thanks for reporting back to the list.
>
>> Just small question, after about 28 hours (when i came back to work) i
>> found
>> that one of 7 datanodes in Hadoop is about 98% usage and all other about
>> 30%, is it normal ?
>>
>>
> I haven't kept a close eye on HDFS usage during hbase upload.  I do know
> that out-of-balance would seem to be a common condition and that its been
> reported on the list that hbase runs faster on a balanced HDFS.  You might
> try out the hadoop balancer.  There's a little note on it here in the hbase
> FAQ: http://wiki.apache.org/hadoop/Hbase/FAQ#7.
>
> St.Ack
>

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss Clock.
> Worked with 8 clients about 30 hours intensive load.
>   
Thanks for reporting back to the list.
> Just small question, after about 28 hours (when i came back to work) i found
> that one of 7 datanodes in Hadoop is about 98% usage and all other about
> 30%, is it normal ?
>   
I haven't kept a close eye on HDFS usage during hbase upload.  I do know 
that out-of-balance would seem to be a common condition and that its 
been reported on the list that hbase runs faster on a balanced HDFS.  
You might try out the hadoop balancer.  There's a little note on it here 
in the hbase FAQ: http://wiki.apache.org/hadoop/Hbase/FAQ#7.

St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi Michael.After reformatting HDFS, Hbase started to work as a Swiss Clock.
Worked with 8 clients about 30 hours intensive load.

Just small question, after about 28 hours (when i came back to work) i found
that one of 7 datanodes in Hadoop is about 98% usage and all other about
30%, is it normal ?

Best Regards.



On Fri, Oct 31, 2008 at 10:16 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.No problem with silly question :-) Yes, sure i replaced, here the list
> of folder that begins with 73*:
>
> drwxr-xr-x   - XXXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/732078971/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/732215319/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13 /hbase/BizDB/733411255/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:14 /hbase/BizDB/733598097/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 10:50 /hbase/BizDB/734145833/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:09 /hbase/BizDB/735612900/BusinessObject
> drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:15 /hbase/BizDB/738009120/BusinessObject
>
> There no 735893330 folder.
> Scanning.META. in shell is not easy at all. .META. is huge and simple scan
> without providing specific column will give me about 10 min only listing of
> .META. content, so i failed  to find the 735893330, may be you can give me
> the name of the column, where this info is placed ?
>
> I think i'll reformat the HDFS and will start it from clean environment and
> the we'll see. I'll do it this Sunday and let you know.
>
> Best Regards and Big Thank You for your patience and assistance.
>
>
> On Fri, Oct 31, 2008 at 4:47 AM, Michael Stack <st...@duboce.net> wrote:
>
>> Slava Gorelik wrote:
>>
>>> Hi.I also noticed this exception.
>>> Strange that this exception is happened every time on the same
>>> regionserver.
>>> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
>>>  Very strange, but history folder in hadoop is empty.
>>>
>>>
>> It is odd indeed that the system keeps trying to load a region that does
>> not exist.
>>
>> I don't think it necessarily the same regionserver that is responsible.
>>  I'd think it an attribute of the region that we're trying to deploy on that
>> server.
>>
>> Silly question: you did replace 'X' with your machine name in the above?
>>
>> If you restart, it still tries to load this nonexistent region?
>>
>> If so, the .META. table is not consistent with whats on the filesystem.
>>  They've gotten out of sync.  Describing how to repair is involved.
>>
>>  Reformatting HDFS  will help ?
>>>
>>>
>>>
>> Do a "scan '.META.'" in the shell.  Do you see your region listed (look at
>> the encoded names attribute to find 735893330.
>>
>> If your table is damaged -- i'd guess it because ulimit was bad up to this
>> -- the best thing might to start over.
>>
>>  One more things in a last minute, i found that one node in cluster has
>>> totally different time, could this cause for such a problems ?
>>>
>>>
>> We thought we'd fixed all problems that could arise from time skew, but
>> you never know.  In our requirements, clocks must be synced.  Fix this too
>> if you can before reloading.
>>
>>  P.S. About logs, is it possible to send to some email - each log file
>>> compressed is about 1mb, and only in 3 files i found exceptions.
>>>
>>>
>>>
>> There probably is such a functionality but I'm not familiar.  Can you put
>> them under a webserver at your place so I can grab them?  You can send me
>> the URL offlist if you like.
>>
>> Thanks for your patience Slava.  We'll figure it.
>>
>> St.Ack
>>
>>
>>  On Thu, Oct 30, 2008 at 10:25 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>> Can you put them someplace that I can pull them?
>>>>
>>>> I took another look at your logs.  I see that a region is missing files.
>>>>  That means it will never open and just keep trying.  Grep your logs for
>>>> FileNotFound.  You'll see this:
>>>>
>>>>
>>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>>> File does not exist:
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
>>>>
>>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>>> File does not exist:
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>>>>
>>>> Try shutting down, and removing these files.   Remove the following
>>>> directories:
>>>>
>>>>
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>>>>
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
>>>>
>>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>>>>
>>>> Then retry restarting.
>>>>
>>>> You can try and figure how these files got lost by going back in your
>>>> history.
>>>>
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>> Michael,still have the problem, but the logs files are very big (50MB
>>>>> each)
>>>>> even compressed they are bigger than limit for this mailing list.
>>>>> Most of the problems are happened during compaction (i see in the log),
>>>>> may
>>>>> be i can send some parts from logs ?
>>>>>
>>>>> Best Regards.
>>>>>
>>>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <
>>>>> slava.gorelik@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>>>>> soon
>>>>>> will try again.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <
>>>>>> slava.gorelik@gmail.com
>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>>>>> I attached the limits.conf, please have a  look, may be i did it
>>>>>>> wrong.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thanks for the logs Slava.  I notice that you have not upped the
>>>>>>>> ulimit
>>>>>>>> on your cluster.  See the head of your logs where we print out the
>>>>>>>> ulimit.
>>>>>>>>  Its 1024.  This could be one cause of your grief especially when
>>>>>>>> you
>>>>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>>>>> including fsck run result.
>>>>>>>>> Today my program starting to fail couple of minutes from the begin,
>>>>>>>>> it's
>>>>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>>>>
>>>>>>>>> Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>>>>>> stack@duboce.net>> wrote:
>>>>>>>>>
>>>>>>>>>  See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>>>>
>>>>>>>>>  St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>      Hi.First of all i want to say thank you for you assistance !!!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>>>>      fsck said that HDFS is healthy.
>>>>>>>>>
>>>>>>>>>      Best Regards and Thank You
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>>>>      <ma...@duboce.net>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>>>>>              current usage is
>>>>>>>>>              about
>>>>>>>>>              30GB. This is after total re-format of the HDFS that
>>>>>>>>>              was made a hour
>>>>>>>>>              before.
>>>>>>>>>
>>>>>>>>>              BTW, the logs i sent are from the first exception that
>>>>>>>>>              i found in them.
>>>>>>>>>              Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Please enable DEBUG and retry.  Send me all logs.  What
>>>>>>>>>          does the fsck on
>>>>>>>>>          HDFS say?  There is something seriously wrong with your
>>>>>>>>>          cluster that you are
>>>>>>>>>          having so much trouble getting it running.  Lets try and
>>>>>>>>>          figure it.
>>>>>>>>>
>>>>>>>>>          St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>              On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>>>>              <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  I took a quick look Slava (Thanks for sending the
>>>>>>>>>                  files).   Here's a few
>>>>>>>>>                  notes:
>>>>>>>>>
>>>>>>>>>                  + The logs are from after the damage is done; the
>>>>>>>>>                  transition from good to
>>>>>>>>>                  bad is missing.  If I could see that, that would
>>>>>>>>> help
>>>>>>>>>                  + But what seems to be plain is that that your
>>>>>>>>>                  HDFS is very sick.  See
>>>>>>>>>                  this
>>>>>>>>>                  from head of one of the regionserver logs:
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>>                  org.apache.hadoop.dfs.DFSClient:
>>>>>>>>>                  DataStreamer
>>>>>>>>>                  Exception: java.io.IOException: Unable to create
>>>>>>>>>                  new block.
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>>>>                   at
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>>                  org.apache.hadoop.dfs.DFSClient: Error
>>>>>>>>>                  Recovery for block blk_-5188192041705782716_60000
>>>>>>>>>                  bad datanode[0]
>>>>>>>>>                  2008-10-27 23:41:12,685 ERROR
>>>>>>>>>
>>>>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>>>>                  Compaction/Split
>>>>>>>>>                  failed for region
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>>>>                  java.io.IOException: Could not get block
>>>>>>>>>                  locations. Aborting...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  If HDFS is ailing, hbase is too.  In fact, the
>>>>>>>>>                  regionservers will shut
>>>>>>>>>                  themselves to protect themselves against damaging
>>>>>>>>>                  or losing data:
>>>>>>>>>
>>>>>>>>>                  2008-10-27 23:41:12,688 FATAL
>>>>>>>>>                  org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>>>>                  Replay of hlog required. Forcing server restart
>>>>>>>>>
>>>>>>>>>                  So, whats up with your HDFS?  Not enough space
>>>>>>>>>                  alloted?  What happens if
>>>>>>>>>                  you run "./bin/hadoop fsck /"?  Does that give you
>>>>>>>>>                  a clue as to what
>>>>>>>>>                  happened?  Dig in the datanode and namenode logs.
>>>>>>>>>                   Look for where the
>>>>>>>>>                  exceptions start.  It might give you a clue.
>>>>>>>>>
>>>>>>>>>                  + The suse regionserver log had garbage in it.
>>>>>>>>>
>>>>>>>>>                  St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                      Hi.
>>>>>>>>>                      My happiness was very short :-( After i
>>>>>>>>>                      successfully added 1M rows (50k
>>>>>>>>>                      each row) i tried to add 10M rows.
>>>>>>>>>                      And after 3-4 working hours it started to
>>>>>>>>>                      dying. First one region server
>>>>>>>>>                      is died, after another one and eventually all
>>>>>>>>>                      cluster is dead.
>>>>>>>>>
>>>>>>>>>                      I attached log files (relevant part, archived)
>>>>>>>>>                      from region servers and
>>>>>>>>>                      from the master.
>>>>>>>>>
>>>>>>>>>                      Best Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                      On Mon, Oct 27, 2008 at 11:19 AM, Slava
>>>>>>>>> Gorelik
>>>>>>>>> <
>>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>>                      <ma...@gmail.com><mailto:
>>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>>                      <ma...@gmail.com>>> wrote:
>>>>>>>>>
>>>>>>>>>                       Hi.
>>>>>>>>>                       So far so good, after changing the file
>>>>>>>>>                      descriptors
>>>>>>>>>                       and dfs.datanode.socket.write.timeout,
>>>>>>>>>                      dfs.datanode.max.xcievers
>>>>>>>>>                       my cluster works stable.
>>>>>>>>>                       Thank You and Best Regards.
>>>>>>>>>
>>>>>>>>>                       P.S. Regarding deleting multiple columns
>>>>>>>>>                      missing functionality i
>>>>>>>>>                       filled jira :
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-961
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                       On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>>>>                      Stack <stack@duboce.net <mailto:
>>>>>>>>> stack@duboce.net
>>>>>>>>>                                 <mailto:stack@duboce.net
>>>>>>>>>
>>>>>>>>>                      <ma...@duboce.net>>> wrote:
>>>>>>>>>
>>>>>>>>>                           Slava Gorelik wrote:
>>>>>>>>>
>>>>>>>>>                               Hi.Haven't tried yet them, i'll try
>>>>>>>>>                      tomorrow morning. In
>>>>>>>>>                               general cluster is
>>>>>>>>>                               working well, the problems begins if
>>>>>>>>>                      i'm trying to add 10M
>>>>>>>>>                               rows, after 1.2M
>>>>>>>>>                               if happened.
>>>>>>>>>
>>>>>>>>>                           Anything else running beside the
>>>>>>>>>                      regionserver or datanodes
>>>>>>>>>                           that would suck resources?  When
>>>>>>>>>                      datanodes begin to slow, we
>>>>>>>>>                           begin to see the issue Jean-Adrien's
>>>>>>>>>                      configurations address.
>>>>>>>>>                            Are you uploading using MapReduce?  Are
>>>>>>>>>                      TTs running on same
>>>>>>>>>                           nodes as the datanode and regionserver?
>>>>>>>>>                       How are you doing the
>>>>>>>>>                           upload?  Describe what your uploader
>>>>>>>>>                      looks like (Sorry if
>>>>>>>>>                           you've already done this).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                I already changed the limit of files
>>>>>>>>>                      descriptors,
>>>>>>>>>
>>>>>>>>>                           Good.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                I'll try
>>>>>>>>>                               to change the properties:
>>>>>>>>>                                <property>
>>>>>>>>>                      <name>dfs.datanode.socket.write.timeout</name>
>>>>>>>>>                                <value>0</value>
>>>>>>>>>                               </property>
>>>>>>>>>
>>>>>>>>>                               <property>
>>>>>>>>>
>>>>>>>>>  <name>dfs.datanode.max.xcievers</name>
>>>>>>>>>                                <value>1023</value>
>>>>>>>>>                               </property>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                           Yeah, try it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                               And let you know, is any other
>>>>>>>>>                      prescriptions ? Did i miss
>>>>>>>>>                               something ?
>>>>>>>>>
>>>>>>>>>                               BTW, off topic, but i sent e-mail
>>>>>>>>>                      recently to the list and
>>>>>>>>>                               i can't see it:
>>>>>>>>>                               Is it possible to delete multiple
>>>>>>>>>                      columns in any way by
>>>>>>>>>                               regex : for example
>>>>>>>>>                               colum_name_* ?
>>>>>>>>>
>>>>>>>>>                           Not that I know of.  If its not in the
>>>>>>>>>                      API, it should be.
>>>>>>>>>                            Mind filing a JIRA?
>>>>>>>>>
>>>>>>>>>                           Thanks Slava.
>>>>>>>>>                           St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.No problem with silly question :-) Yes, sure i replaced, here the list of
folder that begins with 73*:

drwxr-xr-x   - XXXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/732078971/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/732215319/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:13
/hbase/BizDB/733411255/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:14
/hbase/BizDB/733598097/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 10:50
/hbase/BizDB/734145833/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:09
/hbase/BizDB/735612900/BusinessObject
drwxr-xr-x   - XXXXXXXX supergroup          0 2008-10-29 11:15
/hbase/BizDB/738009120/BusinessObject

There no 735893330 folder.
Scanning.META. in shell is not easy at all. .META. is huge and simple scan
without providing specific column will give me about 10 min only listing of
.META. content, so i failed  to find the 735893330, may be you can give me
the name of the column, where this info is placed ?

I think i'll reformat the HDFS and will start it from clean environment and
the we'll see. I'll do it this Sunday and let you know.

Best Regards and Big Thank You for your patience and assistance.


On Fri, Oct 31, 2008 at 4:47 AM, Michael Stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.I also noticed this exception.
>> Strange that this exception is happened every time on the same
>> regionserver.
>> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
>>  Very strange, but history folder in hadoop is empty.
>>
>>
> It is odd indeed that the system keeps trying to load a region that does
> not exist.
>
> I don't think it necessarily the same regionserver that is responsible.
>  I'd think it an attribute of the region that we're trying to deploy on that
> server.
>
> Silly question: you did replace 'X' with your machine name in the above?
>
> If you restart, it still tries to load this nonexistent region?
>
> If so, the .META. table is not consistent with whats on the filesystem.
>  They've gotten out of sync.  Describing how to repair is involved.
>
>  Reformatting HDFS  will help ?
>>
>>
>>
> Do a "scan '.META.'" in the shell.  Do you see your region listed (look at
> the encoded names attribute to find 735893330.
>
> If your table is damaged -- i'd guess it because ulimit was bad up to this
> -- the best thing might to start over.
>
>  One more things in a last minute, i found that one node in cluster has
>> totally different time, could this cause for such a problems ?
>>
>>
> We thought we'd fixed all problems that could arise from time skew, but you
> never know.  In our requirements, clocks must be synced.  Fix this too if
> you can before reloading.
>
>  P.S. About logs, is it possible to send to some email - each log file
>> compressed is about 1mb, and only in 3 files i found exceptions.
>>
>>
>>
> There probably is such a functionality but I'm not familiar.  Can you put
> them under a webserver at your place so I can grab them?  You can send me
> the URL offlist if you like.
>
> Thanks for your patience Slava.  We'll figure it.
>
> St.Ack
>
>
>  On Thu, Oct 30, 2008 at 10:25 PM, stack <st...@duboce.net> wrote:
>>
>>
>>
>>> Can you put them someplace that I can pull them?
>>>
>>> I took another look at your logs.  I see that a region is missing files.
>>>  That means it will never open and just keep trying.  Grep your logs for
>>> FileNotFound.  You'll see this:
>>>
>>>
>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>> File does not exist:
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
>>>
>>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>>> File does not exist:
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>>>
>>> Try shutting down, and removing these files.   Remove the following
>>> directories:
>>>
>>>
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>>>
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
>>>
>>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>>>
>>> Then retry restarting.
>>>
>>> You can try and figure how these files got lost by going back in your
>>> history.
>>>
>>>
>>> St.Ack
>>>
>>>
>>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Michael,still have the problem, but the logs files are very big (50MB
>>>> each)
>>>> even compressed they are bigger than limit for this mailing list.
>>>> Most of the problems are happened during compaction (i see in the log),
>>>> may
>>>> be i can send some parts from logs ?
>>>>
>>>> Best Regards.
>>>>
>>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>>>> soon
>>>>> will try again.
>>>>>
>>>>>
>>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <
>>>>> slava.gorelik@gmail.com
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>>>> I attached the limits.conf, please have a  look, may be i did it
>>>>>> wrong.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks for the logs Slava.  I notice that you have not upped the
>>>>>>> ulimit
>>>>>>> on your cluster.  See the head of your logs where we print out the
>>>>>>> ulimit.
>>>>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi.
>>>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>>>> including fsck run result.
>>>>>>>> Today my program starting to fail couple of minutes from the begin,
>>>>>>>> it's
>>>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>>>
>>>>>>>> Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>>>>> stack@duboce.net>> wrote:
>>>>>>>>
>>>>>>>>  See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>>>
>>>>>>>>  St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>  Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>      Hi.First of all i want to say thank you for you assistance !!!
>>>>>>>>
>>>>>>>>
>>>>>>>>      DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>>>      fsck said that HDFS is healthy.
>>>>>>>>
>>>>>>>>      Best Regards and Thank You
>>>>>>>>
>>>>>>>>
>>>>>>>>      On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>>>      <ma...@duboce.net>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>          Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>              Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>>>>              current usage is
>>>>>>>>              about
>>>>>>>>              30GB. This is after total re-format of the HDFS that
>>>>>>>>              was made a hour
>>>>>>>>              before.
>>>>>>>>
>>>>>>>>              BTW, the logs i sent are from the first exception that
>>>>>>>>              i found in them.
>>>>>>>>              Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>          Please enable DEBUG and retry.  Send me all logs.  What
>>>>>>>>          does the fsck on
>>>>>>>>          HDFS say?  There is something seriously wrong with your
>>>>>>>>          cluster that you are
>>>>>>>>          having so much trouble getting it running.  Lets try and
>>>>>>>>          figure it.
>>>>>>>>
>>>>>>>>          St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>              On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>>>              <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  I took a quick look Slava (Thanks for sending the
>>>>>>>>                  files).   Here's a few
>>>>>>>>                  notes:
>>>>>>>>
>>>>>>>>                  + The logs are from after the damage is done; the
>>>>>>>>                  transition from good to
>>>>>>>>                  bad is missing.  If I could see that, that would
>>>>>>>> help
>>>>>>>>                  + But what seems to be plain is that that your
>>>>>>>>                  HDFS is very sick.  See
>>>>>>>>                  this
>>>>>>>>                  from head of one of the regionserver logs:
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>                  org.apache.hadoop.dfs.DFSClient:
>>>>>>>>                  DataStreamer
>>>>>>>>                  Exception: java.io.IOException: Unable to create
>>>>>>>>                  new block.
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>>>                   at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,682 WARN
>>>>>>>>                  org.apache.hadoop.dfs.DFSClient: Error
>>>>>>>>                  Recovery for block blk_-5188192041705782716_60000
>>>>>>>>                  bad datanode[0]
>>>>>>>>                  2008-10-27 23:41:12,685 ERROR
>>>>>>>>
>>>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>>>                  Compaction/Split
>>>>>>>>                  failed for region
>>>>>>>>
>>>>>>>>
>>>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>>>                  java.io.IOException: Could not get block
>>>>>>>>                  locations. Aborting...
>>>>>>>>
>>>>>>>>
>>>>>>>>                  If HDFS is ailing, hbase is too.  In fact, the
>>>>>>>>                  regionservers will shut
>>>>>>>>                  themselves to protect themselves against damaging
>>>>>>>>                  or losing data:
>>>>>>>>
>>>>>>>>                  2008-10-27 23:41:12,688 FATAL
>>>>>>>>                  org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>>>                  Replay of hlog required. Forcing server restart
>>>>>>>>
>>>>>>>>                  So, whats up with your HDFS?  Not enough space
>>>>>>>>                  alloted?  What happens if
>>>>>>>>                  you run "./bin/hadoop fsck /"?  Does that give you
>>>>>>>>                  a clue as to what
>>>>>>>>                  happened?  Dig in the datanode and namenode logs.
>>>>>>>>                   Look for where the
>>>>>>>>                  exceptions start.  It might give you a clue.
>>>>>>>>
>>>>>>>>                  + The suse regionserver log had garbage in it.
>>>>>>>>
>>>>>>>>                  St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>                  Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                      Hi.
>>>>>>>>                      My happiness was very short :-( After i
>>>>>>>>                      successfully added 1M rows (50k
>>>>>>>>                      each row) i tried to add 10M rows.
>>>>>>>>                      And after 3-4 working hours it started to
>>>>>>>>                      dying. First one region server
>>>>>>>>                      is died, after another one and eventually all
>>>>>>>>                      cluster is dead.
>>>>>>>>
>>>>>>>>                      I attached log files (relevant part, archived)
>>>>>>>>                      from region servers and
>>>>>>>>                      from the master.
>>>>>>>>
>>>>>>>>                      Best Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                      On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
>>>>>>>> <
>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>                      <ma...@gmail.com><mailto:
>>>>>>>>                      slava.gorelik@gmail.com
>>>>>>>>                      <ma...@gmail.com>>> wrote:
>>>>>>>>
>>>>>>>>                       Hi.
>>>>>>>>                       So far so good, after changing the file
>>>>>>>>                      descriptors
>>>>>>>>                       and dfs.datanode.socket.write.timeout,
>>>>>>>>                      dfs.datanode.max.xcievers
>>>>>>>>                       my cluster works stable.
>>>>>>>>                       Thank You and Best Regards.
>>>>>>>>
>>>>>>>>                       P.S. Regarding deleting multiple columns
>>>>>>>>                      missing functionality i
>>>>>>>>                       filled jira :
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/HBASE-961
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                       On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>>>                      Stack <stack@duboce.net <mailto:
>>>>>>>> stack@duboce.net
>>>>>>>>                                 <mailto:stack@duboce.net
>>>>>>>>
>>>>>>>>                      <ma...@duboce.net>>> wrote:
>>>>>>>>
>>>>>>>>                           Slava Gorelik wrote:
>>>>>>>>
>>>>>>>>                               Hi.Haven't tried yet them, i'll try
>>>>>>>>                      tomorrow morning. In
>>>>>>>>                               general cluster is
>>>>>>>>                               working well, the problems begins if
>>>>>>>>                      i'm trying to add 10M
>>>>>>>>                               rows, after 1.2M
>>>>>>>>                               if happened.
>>>>>>>>
>>>>>>>>                           Anything else running beside the
>>>>>>>>                      regionserver or datanodes
>>>>>>>>                           that would suck resources?  When
>>>>>>>>                      datanodes begin to slow, we
>>>>>>>>                           begin to see the issue Jean-Adrien's
>>>>>>>>                      configurations address.
>>>>>>>>                            Are you uploading using MapReduce?  Are
>>>>>>>>                      TTs running on same
>>>>>>>>                           nodes as the datanode and regionserver?
>>>>>>>>                       How are you doing the
>>>>>>>>                           upload?  Describe what your uploader
>>>>>>>>                      looks like (Sorry if
>>>>>>>>                           you've already done this).
>>>>>>>>
>>>>>>>>
>>>>>>>>                                I already changed the limit of files
>>>>>>>>                      descriptors,
>>>>>>>>
>>>>>>>>                           Good.
>>>>>>>>
>>>>>>>>
>>>>>>>>                                I'll try
>>>>>>>>                               to change the properties:
>>>>>>>>                                <property>
>>>>>>>>                      <name>dfs.datanode.socket.write.timeout</name>
>>>>>>>>                                <value>0</value>
>>>>>>>>                               </property>
>>>>>>>>
>>>>>>>>                               <property>
>>>>>>>>
>>>>>>>>  <name>dfs.datanode.max.xcievers</name>
>>>>>>>>                                <value>1023</value>
>>>>>>>>                               </property>
>>>>>>>>
>>>>>>>>
>>>>>>>>                           Yeah, try it.
>>>>>>>>
>>>>>>>>
>>>>>>>>                               And let you know, is any other
>>>>>>>>                      prescriptions ? Did i miss
>>>>>>>>                               something ?
>>>>>>>>
>>>>>>>>                               BTW, off topic, but i sent e-mail
>>>>>>>>                      recently to the list and
>>>>>>>>                               i can't see it:
>>>>>>>>                               Is it possible to delete multiple
>>>>>>>>                      columns in any way by
>>>>>>>>                               regex : for example
>>>>>>>>                               colum_name_* ?
>>>>>>>>
>>>>>>>>                           Not that I know of.  If its not in the
>>>>>>>>                      API, it should be.
>>>>>>>>                            Mind filing a JIRA?
>>>>>>>>
>>>>>>>>                           Thanks Slava.
>>>>>>>>                           St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.I also noticed this exception.
> Strange that this exception is happened every time on the same regionserver.
> Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
>  Very strange, but history folder in hadoop is empty.
>   
It is odd indeed that the system keeps trying to load a region that does 
not exist.

I don't think it necessarily the same regionserver that is responsible.  
I'd think it an attribute of the region that we're trying to deploy on 
that server.

Silly question: you did replace 'X' with your machine name in the above?

If you restart, it still tries to load this nonexistent region?

If so, the .META. table is not consistent with whats on the filesystem.  
They've gotten out of sync.  Describing how to repair is involved.

> Reformatting HDFS  will help ?
>
>   
Do a "scan '.META.'" in the shell.  Do you see your region listed (look 
at the encoded names attribute to find 735893330.

If your table is damaged -- i'd guess it because ulimit was bad up to 
this -- the best thing might to start over.

> One more things in a last minute, i found that one node in cluster has
> totally different time, could this cause for such a problems ?
>   
We thought we'd fixed all problems that could arise from time skew, but 
you never know.  In our requirements, clocks must be synced.  Fix this 
too if you can before reloading.

> P.S. About logs, is it possible to send to some email - each log file
> compressed is about 1mb, and only in 3 files i found exceptions.
>
>   
There probably is such a functionality but I'm not familiar.  Can you 
put them under a webserver at your place so I can grab them?  You can 
send me the URL offlist if you like.

Thanks for your patience Slava.  We'll figure it.
St.Ack


> On Thu, Oct 30, 2008 at 10:25 PM, stack <st...@duboce.net> wrote:
>
>   
>> Can you put them someplace that I can pull them?
>>
>> I took another look at your logs.  I see that a region is missing files.
>>  That means it will never open and just keep trying.  Grep your logs for
>> FileNotFound.  You'll see this:
>>
>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>> File does not exist:
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
>> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
>> File does not exist:
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>>
>> Try shutting down, and removing these files.   Remove the following
>> directories:
>>
>>
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>>
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
>> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>>
>> Then retry restarting.
>>
>> You can try and figure how these files got lost by going back in your
>> history.
>>
>>
>> St.Ack
>>
>>
>>
>> Slava Gorelik wrote:
>>
>>     
>>> Michael,still have the problem, but the logs files are very big (50MB
>>> each)
>>> even compressed they are bigger than limit for this mailing list.
>>> Most of the problems are happened during compaction (i see in the log),
>>> may
>>> be i can send some parts from logs ?
>>>
>>> Best Regards.
>>>
>>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com
>>>       
>>>> wrote:
>>>>         
>>>
>>>       
>>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>>> soon
>>>> will try again.
>>>>
>>>>
>>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <slava.gorelik@gmail.com
>>>>         
>>>>> wrote:
>>>>>           
>>>>
>>>>         
>>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>>>>
>>>>> Best Regards.
>>>>>
>>>>>
>>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>>>>> on your cluster.  See the head of your logs where we print out the
>>>>>> ulimit.
>>>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi.
>>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>>> including fsck run result.
>>>>>>> Today my program starting to fail couple of minutes from the begin,
>>>>>>> it's
>>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>>>> stack@duboce.net>> wrote:
>>>>>>>
>>>>>>>   See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>>
>>>>>>>   St.Ack
>>>>>>>
>>>>>>>
>>>>>>>   Slava Gorelik wrote:
>>>>>>>
>>>>>>>       Hi.First of all i want to say thank you for you assistance !!!
>>>>>>>
>>>>>>>
>>>>>>>       DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>>       fsck said that HDFS is healthy.
>>>>>>>
>>>>>>>       Best Regards and Thank You
>>>>>>>
>>>>>>>
>>>>>>>       On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>>       <ma...@duboce.net>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>           Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>               Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>>>               current usage is
>>>>>>>               about
>>>>>>>               30GB. This is after total re-format of the HDFS that
>>>>>>>               was made a hour
>>>>>>>               before.
>>>>>>>
>>>>>>>               BTW, the logs i sent are from the first exception that
>>>>>>>               i found in them.
>>>>>>>               Best Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>           Please enable DEBUG and retry.  Send me all logs.  What
>>>>>>>           does the fsck on
>>>>>>>           HDFS say?  There is something seriously wrong with your
>>>>>>>           cluster that you are
>>>>>>>           having so much trouble getting it running.  Lets try and
>>>>>>>           figure it.
>>>>>>>
>>>>>>>           St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>>               <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                   I took a quick look Slava (Thanks for sending the
>>>>>>>                   files).   Here's a few
>>>>>>>                   notes:
>>>>>>>
>>>>>>>                   + The logs are from after the damage is done; the
>>>>>>>                   transition from good to
>>>>>>>                   bad is missing.  If I could see that, that would
>>>>>>> help
>>>>>>>                   + But what seems to be plain is that that your
>>>>>>>                   HDFS is very sick.  See
>>>>>>>                   this
>>>>>>>                   from head of one of the regionserver logs:
>>>>>>>
>>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>>                   org.apache.hadoop.dfs.DFSClient:
>>>>>>>                   DataStreamer
>>>>>>>                   Exception: java.io.IOException: Unable to create
>>>>>>>                   new block.
>>>>>>>                    at
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>>                    at
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>>                    at
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>>
>>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>>                   org.apache.hadoop.dfs.DFSClient: Error
>>>>>>>                   Recovery for block blk_-5188192041705782716_60000
>>>>>>>                   bad datanode[0]
>>>>>>>                   2008-10-27 23:41:12,685 ERROR
>>>>>>>
>>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>>                   Compaction/Split
>>>>>>>                   failed for region
>>>>>>>
>>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>>                   java.io.IOException: Could not get block
>>>>>>>                   locations. Aborting...
>>>>>>>
>>>>>>>
>>>>>>>                   If HDFS is ailing, hbase is too.  In fact, the
>>>>>>>                   regionservers will shut
>>>>>>>                   themselves to protect themselves against damaging
>>>>>>>                   or losing data:
>>>>>>>
>>>>>>>                   2008-10-27 23:41:12,688 FATAL
>>>>>>>                   org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>>                   Replay of hlog required. Forcing server restart
>>>>>>>
>>>>>>>                   So, whats up with your HDFS?  Not enough space
>>>>>>>                   alloted?  What happens if
>>>>>>>                   you run "./bin/hadoop fsck /"?  Does that give you
>>>>>>>                   a clue as to what
>>>>>>>                   happened?  Dig in the datanode and namenode logs.
>>>>>>>                    Look for where the
>>>>>>>                   exceptions start.  It might give you a clue.
>>>>>>>
>>>>>>>                   + The suse regionserver log had garbage in it.
>>>>>>>
>>>>>>>                   St.Ack
>>>>>>>
>>>>>>>
>>>>>>>                   Slava Gorelik wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                       Hi.
>>>>>>>                       My happiness was very short :-( After i
>>>>>>>                       successfully added 1M rows (50k
>>>>>>>                       each row) i tried to add 10M rows.
>>>>>>>                       And after 3-4 working hours it started to
>>>>>>>                       dying. First one region server
>>>>>>>                       is died, after another one and eventually all
>>>>>>>                       cluster is dead.
>>>>>>>
>>>>>>>                       I attached log files (relevant part, archived)
>>>>>>>                       from region servers and
>>>>>>>                       from the master.
>>>>>>>
>>>>>>>                       Best Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                       On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
>>>>>>> <
>>>>>>>                       slava.gorelik@gmail.com
>>>>>>>                       <ma...@gmail.com><mailto:
>>>>>>>                       slava.gorelik@gmail.com
>>>>>>>                       <ma...@gmail.com>>> wrote:
>>>>>>>
>>>>>>>                        Hi.
>>>>>>>                        So far so good, after changing the file
>>>>>>>                       descriptors
>>>>>>>                        and dfs.datanode.socket.write.timeout,
>>>>>>>                       dfs.datanode.max.xcievers
>>>>>>>                        my cluster works stable.
>>>>>>>                        Thank You and Best Regards.
>>>>>>>
>>>>>>>                        P.S. Regarding deleting multiple columns
>>>>>>>                       missing functionality i
>>>>>>>                        filled jira :
>>>>>>>                       https://issues.apache.org/jira/browse/HBASE-961
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                        On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>>                       Stack <stack@duboce.net <mailto:
>>>>>>> stack@duboce.net
>>>>>>>                                  <mailto:stack@duboce.net
>>>>>>>
>>>>>>>                       <ma...@duboce.net>>> wrote:
>>>>>>>
>>>>>>>                            Slava Gorelik wrote:
>>>>>>>
>>>>>>>                                Hi.Haven't tried yet them, i'll try
>>>>>>>                       tomorrow morning. In
>>>>>>>                                general cluster is
>>>>>>>                                working well, the problems begins if
>>>>>>>                       i'm trying to add 10M
>>>>>>>                                rows, after 1.2M
>>>>>>>                                if happened.
>>>>>>>
>>>>>>>                            Anything else running beside the
>>>>>>>                       regionserver or datanodes
>>>>>>>                            that would suck resources?  When
>>>>>>>                       datanodes begin to slow, we
>>>>>>>                            begin to see the issue Jean-Adrien's
>>>>>>>                       configurations address.
>>>>>>>                             Are you uploading using MapReduce?  Are
>>>>>>>                       TTs running on same
>>>>>>>                            nodes as the datanode and regionserver?
>>>>>>>                        How are you doing the
>>>>>>>                            upload?  Describe what your uploader
>>>>>>>                       looks like (Sorry if
>>>>>>>                            you've already done this).
>>>>>>>
>>>>>>>
>>>>>>>                                 I already changed the limit of files
>>>>>>>                       descriptors,
>>>>>>>
>>>>>>>                            Good.
>>>>>>>
>>>>>>>
>>>>>>>                                 I'll try
>>>>>>>                                to change the properties:
>>>>>>>                                 <property>
>>>>>>>                       <name>dfs.datanode.socket.write.timeout</name>
>>>>>>>                                 <value>0</value>
>>>>>>>                                </property>
>>>>>>>
>>>>>>>                                <property>
>>>>>>>                                 <name>dfs.datanode.max.xcievers</name>
>>>>>>>                                 <value>1023</value>
>>>>>>>                                </property>
>>>>>>>
>>>>>>>
>>>>>>>                            Yeah, try it.
>>>>>>>
>>>>>>>
>>>>>>>                                And let you know, is any other
>>>>>>>                       prescriptions ? Did i miss
>>>>>>>                                something ?
>>>>>>>
>>>>>>>                                BTW, off topic, but i sent e-mail
>>>>>>>                       recently to the list and
>>>>>>>                                i can't see it:
>>>>>>>                                Is it possible to delete multiple
>>>>>>>                       columns in any way by
>>>>>>>                                regex : for example
>>>>>>>                                colum_name_* ?
>>>>>>>
>>>>>>>                            Not that I know of.  If its not in the
>>>>>>>                       API, it should be.
>>>>>>>                             Mind filing a JIRA?
>>>>>>>
>>>>>>>                            Thanks Slava.
>>>>>>>                            St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>       
>>     
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I also noticed this exception.
Strange that this exception is happened every time on the same regionserver.
Tried to find directory hdfs://X:9000/hbase/BizDB/735893330 - not exist.
 Very strange, but history folder in hadoop is empty.

Reformatting HDFS  will help ?

One more things in a last minute, i found that one node in cluster has
totally different time, could this cause for such a problems ?

P.S. About logs, is it possible to send to some email - each log file
compressed is about 1mb, and only in 3 files i found exceptions.


On Thu, Oct 30, 2008 at 10:25 PM, stack <st...@duboce.net> wrote:

> Can you put them someplace that I can pull them?
>
> I took another look at your logs.  I see that a region is missing files.
>  That means it will never open and just keep trying.  Grep your logs for
> FileNotFound.  You'll see this:
>
> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
> File does not exist:
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
> hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:
> File does not exist:
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data
>
> Try shutting down, and removing these files.   Remove the following
> directories:
>
>
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
>
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
> hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637
>
> Then retry restarting.
>
> You can try and figure how these files got lost by going back in your
> history.
>
>
> St.Ack
>
>
>
> Slava Gorelik wrote:
>
>> Michael,still have the problem, but the logs files are very big (50MB
>> each)
>> even compressed they are bigger than limit for this mailing list.
>> Most of the problems are happened during compaction (i see in the log),
>> may
>> be i can send some parts from logs ?
>>
>> Best Regards.
>>
>> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <slava.gorelik@gmail.com
>> >wrote:
>>
>>
>>
>>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now,
>>> soon
>>> will try again.
>>>
>>>
>>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <slava.gorelik@gmail.com
>>> >wrote:
>>>
>>>
>>>
>>>> Hi.Very strange, i see in limits.conf that it's upped.
>>>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>>>
>>>> Best Regards.
>>>>
>>>>
>>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>>>> on your cluster.  See the head of your logs where we print out the
>>>>> ulimit.
>>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>>> seemingly have many regions (>1000).  Please try upping it.
>>>>> St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi.
>>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>>> including fsck run result.
>>>>>> Today my program starting to fail couple of minutes from the begin,
>>>>>> it's
>>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>>> stack@duboce.net>> wrote:
>>>>>>
>>>>>>   See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>>
>>>>>>   St.Ack
>>>>>>
>>>>>>
>>>>>>   Slava Gorelik wrote:
>>>>>>
>>>>>>       Hi.First of all i want to say thank you for you assistance !!!
>>>>>>
>>>>>>
>>>>>>       DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>>       fsck said that HDFS is healthy.
>>>>>>
>>>>>>       Best Regards and Thank You
>>>>>>
>>>>>>
>>>>>>       On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>>       <ma...@duboce.net>> wrote:
>>>>>>
>>>>>>
>>>>>>           Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>               Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>>               current usage is
>>>>>>               about
>>>>>>               30GB. This is after total re-format of the HDFS that
>>>>>>               was made a hour
>>>>>>               before.
>>>>>>
>>>>>>               BTW, the logs i sent are from the first exception that
>>>>>>               i found in them.
>>>>>>               Best Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>           Please enable DEBUG and retry.  Send me all logs.  What
>>>>>>           does the fsck on
>>>>>>           HDFS say?  There is something seriously wrong with your
>>>>>>           cluster that you are
>>>>>>           having so much trouble getting it running.  Lets try and
>>>>>>           figure it.
>>>>>>
>>>>>>           St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>               On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>>               <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                   I took a quick look Slava (Thanks for sending the
>>>>>>                   files).   Here's a few
>>>>>>                   notes:
>>>>>>
>>>>>>                   + The logs are from after the damage is done; the
>>>>>>                   transition from good to
>>>>>>                   bad is missing.  If I could see that, that would
>>>>>> help
>>>>>>                   + But what seems to be plain is that that your
>>>>>>                   HDFS is very sick.  See
>>>>>>                   this
>>>>>>                   from head of one of the regionserver logs:
>>>>>>
>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>                   org.apache.hadoop.dfs.DFSClient:
>>>>>>                   DataStreamer
>>>>>>                   Exception: java.io.IOException: Unable to create
>>>>>>                   new block.
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>>                    at
>>>>>>
>>>>>>
>>>>>>
>>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>>
>>>>>>                   2008-10-27 23:41:12,682 WARN
>>>>>>                   org.apache.hadoop.dfs.DFSClient: Error
>>>>>>                   Recovery for block blk_-5188192041705782716_60000
>>>>>>                   bad datanode[0]
>>>>>>                   2008-10-27 23:41:12,685 ERROR
>>>>>>
>>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>>                   Compaction/Split
>>>>>>                   failed for region
>>>>>>
>>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>>                   java.io.IOException: Could not get block
>>>>>>                   locations. Aborting...
>>>>>>
>>>>>>
>>>>>>                   If HDFS is ailing, hbase is too.  In fact, the
>>>>>>                   regionservers will shut
>>>>>>                   themselves to protect themselves against damaging
>>>>>>                   or losing data:
>>>>>>
>>>>>>                   2008-10-27 23:41:12,688 FATAL
>>>>>>                   org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>>                   Replay of hlog required. Forcing server restart
>>>>>>
>>>>>>                   So, whats up with your HDFS?  Not enough space
>>>>>>                   alloted?  What happens if
>>>>>>                   you run "./bin/hadoop fsck /"?  Does that give you
>>>>>>                   a clue as to what
>>>>>>                   happened?  Dig in the datanode and namenode logs.
>>>>>>                    Look for where the
>>>>>>                   exceptions start.  It might give you a clue.
>>>>>>
>>>>>>                   + The suse regionserver log had garbage in it.
>>>>>>
>>>>>>                   St.Ack
>>>>>>
>>>>>>
>>>>>>                   Slava Gorelik wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                       Hi.
>>>>>>                       My happiness was very short :-( After i
>>>>>>                       successfully added 1M rows (50k
>>>>>>                       each row) i tried to add 10M rows.
>>>>>>                       And after 3-4 working hours it started to
>>>>>>                       dying. First one region server
>>>>>>                       is died, after another one and eventually all
>>>>>>                       cluster is dead.
>>>>>>
>>>>>>                       I attached log files (relevant part, archived)
>>>>>>                       from region servers and
>>>>>>                       from the master.
>>>>>>
>>>>>>                       Best Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                       On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik
>>>>>> <
>>>>>>                       slava.gorelik@gmail.com
>>>>>>                       <ma...@gmail.com><mailto:
>>>>>>                       slava.gorelik@gmail.com
>>>>>>                       <ma...@gmail.com>>> wrote:
>>>>>>
>>>>>>                        Hi.
>>>>>>                        So far so good, after changing the file
>>>>>>                       descriptors
>>>>>>                        and dfs.datanode.socket.write.timeout,
>>>>>>                       dfs.datanode.max.xcievers
>>>>>>                        my cluster works stable.
>>>>>>                        Thank You and Best Regards.
>>>>>>
>>>>>>                        P.S. Regarding deleting multiple columns
>>>>>>                       missing functionality i
>>>>>>                        filled jira :
>>>>>>                       https://issues.apache.org/jira/browse/HBASE-961
>>>>>>
>>>>>>
>>>>>>
>>>>>>                        On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>>                       Stack <stack@duboce.net <mailto:
>>>>>> stack@duboce.net
>>>>>>                                  <mailto:stack@duboce.net
>>>>>>
>>>>>>                       <ma...@duboce.net>>> wrote:
>>>>>>
>>>>>>                            Slava Gorelik wrote:
>>>>>>
>>>>>>                                Hi.Haven't tried yet them, i'll try
>>>>>>                       tomorrow morning. In
>>>>>>                                general cluster is
>>>>>>                                working well, the problems begins if
>>>>>>                       i'm trying to add 10M
>>>>>>                                rows, after 1.2M
>>>>>>                                if happened.
>>>>>>
>>>>>>                            Anything else running beside the
>>>>>>                       regionserver or datanodes
>>>>>>                            that would suck resources?  When
>>>>>>                       datanodes begin to slow, we
>>>>>>                            begin to see the issue Jean-Adrien's
>>>>>>                       configurations address.
>>>>>>                             Are you uploading using MapReduce?  Are
>>>>>>                       TTs running on same
>>>>>>                            nodes as the datanode and regionserver?
>>>>>>                        How are you doing the
>>>>>>                            upload?  Describe what your uploader
>>>>>>                       looks like (Sorry if
>>>>>>                            you've already done this).
>>>>>>
>>>>>>
>>>>>>                                 I already changed the limit of files
>>>>>>                       descriptors,
>>>>>>
>>>>>>                            Good.
>>>>>>
>>>>>>
>>>>>>                                 I'll try
>>>>>>                                to change the properties:
>>>>>>                                 <property>
>>>>>>                       <name>dfs.datanode.socket.write.timeout</name>
>>>>>>                                 <value>0</value>
>>>>>>                                </property>
>>>>>>
>>>>>>                                <property>
>>>>>>                                 <name>dfs.datanode.max.xcievers</name>
>>>>>>                                 <value>1023</value>
>>>>>>                                </property>
>>>>>>
>>>>>>
>>>>>>                            Yeah, try it.
>>>>>>
>>>>>>
>>>>>>                                And let you know, is any other
>>>>>>                       prescriptions ? Did i miss
>>>>>>                                something ?
>>>>>>
>>>>>>                                BTW, off topic, but i sent e-mail
>>>>>>                       recently to the list and
>>>>>>                                i can't see it:
>>>>>>                                Is it possible to delete multiple
>>>>>>                       columns in any way by
>>>>>>                                regex : for example
>>>>>>                                colum_name_* ?
>>>>>>
>>>>>>                            Not that I know of.  If its not in the
>>>>>>                       API, it should be.
>>>>>>                             Mind filing a JIRA?
>>>>>>
>>>>>>                            Thanks Slava.
>>>>>>                            St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Can you put them someplace that I can pull them?

I took another look at your logs.  I see that a region is missing 
files.  That means it will never open and just keep trying.  Grep your 
logs for FileNotFound.  You'll see this:

hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: 
File does not exist: 
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/data
hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException: 
File does not exist: 
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data

Try shutting down, and removing these files.   Remove the following 
directories:

hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637

Then retry restarting.

You can try and figure how these files got lost by going back in your 
history.

St.Ack



Slava Gorelik wrote:
> Michael,still have the problem, but the logs files are very big (50MB each)
> even compressed they are bigger than limit for this mailing list.
> Most of the problems are happened during compaction (i see in the log), may
> be i can send some parts from logs ?
>
> Best Regards.
>
> On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <sl...@gmail.com>wrote:
>
>   
>> Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
>> will try again.
>>
>>
>> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <sl...@gmail.com>wrote:
>>
>>     
>>> Hi.Very strange, i see in limits.conf that it's upped.
>>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>>
>>> Best Regards.
>>>
>>>
>>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>>
>>>       
>>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>>> on your cluster.  See the head of your logs where we print out the ulimit.
>>>>  Its 1024.  This could be one cause of your grief especially when you
>>>> seemingly have many regions (>1000).  Please try upping it.
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>         
>>>>> Hi.
>>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>>> including fsck run result.
>>>>> Today my program starting to fail couple of minutes from the begin, it's
>>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>>
>>>>> Best Regards.
>>>>>
>>>>>
>>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>>> stack@duboce.net>> wrote:
>>>>>
>>>>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>>
>>>>>    St.Ack
>>>>>
>>>>>
>>>>>    Slava Gorelik wrote:
>>>>>
>>>>>        Hi.First of all i want to say thank you for you assistance !!!
>>>>>
>>>>>
>>>>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>>>>        fsck said that HDFS is healthy.
>>>>>
>>>>>        Best Regards and Thank You
>>>>>
>>>>>
>>>>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>>        <ma...@duboce.net>> wrote:
>>>>>
>>>>>
>>>>>            Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>>                current usage is
>>>>>                about
>>>>>                30GB. This is after total re-format of the HDFS that
>>>>>                was made a hour
>>>>>                before.
>>>>>
>>>>>                BTW, the logs i sent are from the first exception that
>>>>>                i found in them.
>>>>>                Best Regards.
>>>>>
>>>>>
>>>>>
>>>>>            Please enable DEBUG and retry.  Send me all logs.  What
>>>>>            does the fsck on
>>>>>            HDFS say?  There is something seriously wrong with your
>>>>>            cluster that you are
>>>>>            having so much trouble getting it running.  Lets try and
>>>>>            figure it.
>>>>>
>>>>>            St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>>                <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                    I took a quick look Slava (Thanks for sending the
>>>>>                    files).   Here's a few
>>>>>                    notes:
>>>>>
>>>>>                    + The logs are from after the damage is done; the
>>>>>                    transition from good to
>>>>>                    bad is missing.  If I could see that, that would help
>>>>>                    + But what seems to be plain is that that your
>>>>>                    HDFS is very sick.  See
>>>>>                    this
>>>>>                    from head of one of the regionserver logs:
>>>>>
>>>>>                    2008-10-27 23:41:12,682 WARN
>>>>>                    org.apache.hadoop.dfs.DFSClient:
>>>>>                    DataStreamer
>>>>>                    Exception: java.io.IOException: Unable to create
>>>>>                    new block.
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>                     at
>>>>>
>>>>>
>>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>
>>>>>                    2008-10-27 23:41:12,682 WARN
>>>>>                    org.apache.hadoop.dfs.DFSClient: Error
>>>>>                    Recovery for block blk_-5188192041705782716_60000
>>>>>                    bad datanode[0]
>>>>>                    2008-10-27 23:41:12,685 ERROR
>>>>>
>>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>>                    Compaction/Split
>>>>>                    failed for region
>>>>>
>>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>>                    java.io.IOException: Could not get block
>>>>>                    locations. Aborting...
>>>>>
>>>>>
>>>>>                    If HDFS is ailing, hbase is too.  In fact, the
>>>>>                    regionservers will shut
>>>>>                    themselves to protect themselves against damaging
>>>>>                    or losing data:
>>>>>
>>>>>                    2008-10-27 23:41:12,688 FATAL
>>>>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>>>>                    Replay of hlog required. Forcing server restart
>>>>>
>>>>>                    So, whats up with your HDFS?  Not enough space
>>>>>                    alloted?  What happens if
>>>>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>>>>                    a clue as to what
>>>>>                    happened?  Dig in the datanode and namenode logs.
>>>>>                     Look for where the
>>>>>                    exceptions start.  It might give you a clue.
>>>>>
>>>>>                    + The suse regionserver log had garbage in it.
>>>>>
>>>>>                    St.Ack
>>>>>
>>>>>
>>>>>                    Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                        Hi.
>>>>>                        My happiness was very short :-( After i
>>>>>                        successfully added 1M rows (50k
>>>>>                        each row) i tried to add 10M rows.
>>>>>                        And after 3-4 working hours it started to
>>>>>                        dying. First one region server
>>>>>                        is died, after another one and eventually all
>>>>>                        cluster is dead.
>>>>>
>>>>>                        I attached log files (relevant part, archived)
>>>>>                        from region servers and
>>>>>                        from the master.
>>>>>
>>>>>                        Best Regards.
>>>>>
>>>>>
>>>>>
>>>>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>>>                        slava.gorelik@gmail.com
>>>>>                        <ma...@gmail.com><mailto:
>>>>>                        slava.gorelik@gmail.com
>>>>>                        <ma...@gmail.com>>> wrote:
>>>>>
>>>>>                         Hi.
>>>>>                         So far so good, after changing the file
>>>>>                        descriptors
>>>>>                         and dfs.datanode.socket.write.timeout,
>>>>>                        dfs.datanode.max.xcievers
>>>>>                         my cluster works stable.
>>>>>                         Thank You and Best Regards.
>>>>>
>>>>>                         P.S. Regarding deleting multiple columns
>>>>>                        missing functionality i
>>>>>                         filled jira :
>>>>>                        https://issues.apache.org/jira/browse/HBASE-961
>>>>>
>>>>>
>>>>>
>>>>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>>                        Stack <stack@duboce.net <mailto:stack@duboce.net
>>>>>           
>>>>>                         <mailto:stack@duboce.net
>>>>>
>>>>>                        <ma...@duboce.net>>> wrote:
>>>>>
>>>>>                             Slava Gorelik wrote:
>>>>>
>>>>>                                 Hi.Haven't tried yet them, i'll try
>>>>>                        tomorrow morning. In
>>>>>                                 general cluster is
>>>>>                                 working well, the problems begins if
>>>>>                        i'm trying to add 10M
>>>>>                                 rows, after 1.2M
>>>>>                                 if happened.
>>>>>
>>>>>                             Anything else running beside the
>>>>>                        regionserver or datanodes
>>>>>                             that would suck resources?  When
>>>>>                        datanodes begin to slow, we
>>>>>                             begin to see the issue Jean-Adrien's
>>>>>                        configurations address.
>>>>>                              Are you uploading using MapReduce?  Are
>>>>>                        TTs running on same
>>>>>                             nodes as the datanode and regionserver?
>>>>>                         How are you doing the
>>>>>                             upload?  Describe what your uploader
>>>>>                        looks like (Sorry if
>>>>>                             you've already done this).
>>>>>
>>>>>
>>>>>                                  I already changed the limit of files
>>>>>                        descriptors,
>>>>>
>>>>>                             Good.
>>>>>
>>>>>
>>>>>                                  I'll try
>>>>>                                 to change the properties:
>>>>>                                  <property>
>>>>>                        <name>dfs.datanode.socket.write.timeout</name>
>>>>>                                  <value>0</value>
>>>>>                                 </property>
>>>>>
>>>>>                                 <property>
>>>>>                                  <name>dfs.datanode.max.xcievers</name>
>>>>>                                  <value>1023</value>
>>>>>                                 </property>
>>>>>
>>>>>
>>>>>                             Yeah, try it.
>>>>>
>>>>>
>>>>>                                 And let you know, is any other
>>>>>                        prescriptions ? Did i miss
>>>>>                                 something ?
>>>>>
>>>>>                                 BTW, off topic, but i sent e-mail
>>>>>                        recently to the list and
>>>>>                                 i can't see it:
>>>>>                                 Is it possible to delete multiple
>>>>>                        columns in any way by
>>>>>                                 regex : for example
>>>>>                                 colum_name_* ?
>>>>>
>>>>>                             Not that I know of.  If its not in the
>>>>>                        API, it should be.
>>>>>                              Mind filing a JIRA?
>>>>>
>>>>>                             Thanks Slava.
>>>>>                             St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Michael,still have the problem, but the logs files are very big (50MB each)
even compressed they are bigger than limit for this mailing list.
Most of the problems are happened during compaction (i see in the log), may
be i can send some parts from logs ?

Best Regards.

On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
> will try again.
>
>
> On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <sl...@gmail.com>wrote:
>
>> Hi.Very strange, i see in limits.conf that it's upped.
>> I attached the limits.conf, please have a  look, may be i did it wrong.
>>
>> Best Regards.
>>
>>
>> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>>
>>> Thanks for the logs Slava.  I notice that you have not upped the ulimit
>>> on your cluster.  See the head of your logs where we print out the ulimit.
>>>  Its 1024.  This could be one cause of your grief especially when you
>>> seemingly have many regions (>1000).  Please try upping it.
>>> St.Ack
>>>
>>>
>>>
>>>
>>> Slava Gorelik wrote:
>>>
>>>> Hi.
>>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>>> including fsck run result.
>>>> Today my program starting to fail couple of minutes from the begin, it's
>>>> very easy to reproduce the problem, cluster became very unstable.
>>>>
>>>> Best Regards.
>>>>
>>>>
>>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>>> stack@duboce.net>> wrote:
>>>>
>>>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>>
>>>>    St.Ack
>>>>
>>>>
>>>>    Slava Gorelik wrote:
>>>>
>>>>        Hi.First of all i want to say thank you for you assistance !!!
>>>>
>>>>
>>>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>>>        fsck said that HDFS is healthy.
>>>>
>>>>        Best Regards and Thank You
>>>>
>>>>
>>>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>>        <ma...@duboce.net>> wrote:
>>>>
>>>>
>>>>            Slava Gorelik wrote:
>>>>
>>>>
>>>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>>                current usage is
>>>>                about
>>>>                30GB. This is after total re-format of the HDFS that
>>>>                was made a hour
>>>>                before.
>>>>
>>>>                BTW, the logs i sent are from the first exception that
>>>>                i found in them.
>>>>                Best Regards.
>>>>
>>>>
>>>>
>>>>            Please enable DEBUG and retry.  Send me all logs.  What
>>>>            does the fsck on
>>>>            HDFS say?  There is something seriously wrong with your
>>>>            cluster that you are
>>>>            having so much trouble getting it running.  Lets try and
>>>>            figure it.
>>>>
>>>>            St.Ack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>>                <stack@duboce.net <ma...@duboce.net>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                    I took a quick look Slava (Thanks for sending the
>>>>                    files).   Here's a few
>>>>                    notes:
>>>>
>>>>                    + The logs are from after the damage is done; the
>>>>                    transition from good to
>>>>                    bad is missing.  If I could see that, that would help
>>>>                    + But what seems to be plain is that that your
>>>>                    HDFS is very sick.  See
>>>>                    this
>>>>                    from head of one of the regionserver logs:
>>>>
>>>>                    2008-10-27 23:41:12,682 WARN
>>>>                    org.apache.hadoop.dfs.DFSClient:
>>>>                    DataStreamer
>>>>                    Exception: java.io.IOException: Unable to create
>>>>                    new block.
>>>>                     at
>>>>
>>>>
>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>                     at
>>>>
>>>>
>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>                     at
>>>>
>>>>
>>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>
>>>>                    2008-10-27 23:41:12,682 WARN
>>>>                    org.apache.hadoop.dfs.DFSClient: Error
>>>>                    Recovery for block blk_-5188192041705782716_60000
>>>>                    bad datanode[0]
>>>>                    2008-10-27 23:41:12,685 ERROR
>>>>
>>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>                    Compaction/Split
>>>>                    failed for region
>>>>
>>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>                    java.io.IOException: Could not get block
>>>>                    locations. Aborting...
>>>>
>>>>
>>>>                    If HDFS is ailing, hbase is too.  In fact, the
>>>>                    regionservers will shut
>>>>                    themselves to protect themselves against damaging
>>>>                    or losing data:
>>>>
>>>>                    2008-10-27 23:41:12,688 FATAL
>>>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>>>                    Replay of hlog required. Forcing server restart
>>>>
>>>>                    So, whats up with your HDFS?  Not enough space
>>>>                    alloted?  What happens if
>>>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>>>                    a clue as to what
>>>>                    happened?  Dig in the datanode and namenode logs.
>>>>                     Look for where the
>>>>                    exceptions start.  It might give you a clue.
>>>>
>>>>                    + The suse regionserver log had garbage in it.
>>>>
>>>>                    St.Ack
>>>>
>>>>
>>>>                    Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                        Hi.
>>>>                        My happiness was very short :-( After i
>>>>                        successfully added 1M rows (50k
>>>>                        each row) i tried to add 10M rows.
>>>>                        And after 3-4 working hours it started to
>>>>                        dying. First one region server
>>>>                        is died, after another one and eventually all
>>>>                        cluster is dead.
>>>>
>>>>                        I attached log files (relevant part, archived)
>>>>                        from region servers and
>>>>                        from the master.
>>>>
>>>>                        Best Regards.
>>>>
>>>>
>>>>
>>>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>>                        slava.gorelik@gmail.com
>>>>                        <ma...@gmail.com><mailto:
>>>>                        slava.gorelik@gmail.com
>>>>                        <ma...@gmail.com>>> wrote:
>>>>
>>>>                         Hi.
>>>>                         So far so good, after changing the file
>>>>                        descriptors
>>>>                         and dfs.datanode.socket.write.timeout,
>>>>                        dfs.datanode.max.xcievers
>>>>                         my cluster works stable.
>>>>                         Thank You and Best Regards.
>>>>
>>>>                         P.S. Regarding deleting multiple columns
>>>>                        missing functionality i
>>>>                         filled jira :
>>>>                        https://issues.apache.org/jira/browse/HBASE-961
>>>>
>>>>
>>>>
>>>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>>                        Stack <stack@duboce.net <mailto:stack@duboce.net
>>>> >
>>>>                         <mailto:stack@duboce.net
>>>>
>>>>                        <ma...@duboce.net>>> wrote:
>>>>
>>>>                             Slava Gorelik wrote:
>>>>
>>>>                                 Hi.Haven't tried yet them, i'll try
>>>>                        tomorrow morning. In
>>>>                                 general cluster is
>>>>                                 working well, the problems begins if
>>>>                        i'm trying to add 10M
>>>>                                 rows, after 1.2M
>>>>                                 if happened.
>>>>
>>>>                             Anything else running beside the
>>>>                        regionserver or datanodes
>>>>                             that would suck resources?  When
>>>>                        datanodes begin to slow, we
>>>>                             begin to see the issue Jean-Adrien's
>>>>                        configurations address.
>>>>                              Are you uploading using MapReduce?  Are
>>>>                        TTs running on same
>>>>                             nodes as the datanode and regionserver?
>>>>                         How are you doing the
>>>>                             upload?  Describe what your uploader
>>>>                        looks like (Sorry if
>>>>                             you've already done this).
>>>>
>>>>
>>>>                                  I already changed the limit of files
>>>>                        descriptors,
>>>>
>>>>                             Good.
>>>>
>>>>
>>>>                                  I'll try
>>>>                                 to change the properties:
>>>>                                  <property>
>>>>                        <name>dfs.datanode.socket.write.timeout</name>
>>>>                                  <value>0</value>
>>>>                                 </property>
>>>>
>>>>                                 <property>
>>>>                                  <name>dfs.datanode.max.xcievers</name>
>>>>                                  <value>1023</value>
>>>>                                 </property>
>>>>
>>>>
>>>>                             Yeah, try it.
>>>>
>>>>
>>>>                                 And let you know, is any other
>>>>                        prescriptions ? Did i miss
>>>>                                 something ?
>>>>
>>>>                                 BTW, off topic, but i sent e-mail
>>>>                        recently to the list and
>>>>                                 i can't see it:
>>>>                                 Is it possible to delete multiple
>>>>                        columns in any way by
>>>>                                 regex : for example
>>>>                                 colum_name_* ?
>>>>
>>>>                             Not that I know of.  If its not in the
>>>>                        API, it should be.
>>>>                              Mind filing a JIRA?
>>>>
>>>>                             Thanks Slava.
>>>>                             St.Ack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
will try again.


On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.Very strange, i see in limits.conf that it's upped.
> I attached the limits.conf, please have a  look, may be i did it wrong.
>
> Best Regards.
>
>
> On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:
>
>> Thanks for the logs Slava.  I notice that you have not upped the ulimit on
>> your cluster.  See the head of your logs where we print out the ulimit.  Its
>> 1024.  This could be one cause of your grief especially when you seemingly
>> have many regions (>1000).  Please try upping it.
>> St.Ack
>>
>>
>>
>>
>> Slava Gorelik wrote:
>>
>>> Hi.
>>> I enabled DEBUG log level and now I'm sending all logs (archived)
>>> including fsck run result.
>>> Today my program starting to fail couple of minutes from the begin, it's
>>> very easy to reproduce the problem, cluster became very unstable.
>>>
>>> Best Regards.
>>>
>>>
>>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>>> stack@duboce.net>> wrote:
>>>
>>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>>
>>>    St.Ack
>>>
>>>
>>>    Slava Gorelik wrote:
>>>
>>>        Hi.First of all i want to say thank you for you assistance !!!
>>>
>>>
>>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>>        fsck said that HDFS is healthy.
>>>
>>>        Best Regards and Thank You
>>>
>>>
>>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>>        <ma...@duboce.net>> wrote:
>>>
>>>
>>>            Slava Gorelik wrote:
>>>
>>>
>>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>>                current usage is
>>>                about
>>>                30GB. This is after total re-format of the HDFS that
>>>                was made a hour
>>>                before.
>>>
>>>                BTW, the logs i sent are from the first exception that
>>>                i found in them.
>>>                Best Regards.
>>>
>>>
>>>
>>>            Please enable DEBUG and retry.  Send me all logs.  What
>>>            does the fsck on
>>>            HDFS say?  There is something seriously wrong with your
>>>            cluster that you are
>>>            having so much trouble getting it running.  Lets try and
>>>            figure it.
>>>
>>>            St.Ack
>>>
>>>
>>>
>>>
>>>
>>>
>>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>>                <stack@duboce.net <ma...@duboce.net>> wrote:
>>>
>>>
>>>
>>>
>>>                    I took a quick look Slava (Thanks for sending the
>>>                    files).   Here's a few
>>>                    notes:
>>>
>>>                    + The logs are from after the damage is done; the
>>>                    transition from good to
>>>                    bad is missing.  If I could see that, that would help
>>>                    + But what seems to be plain is that that your
>>>                    HDFS is very sick.  See
>>>                    this
>>>                    from head of one of the regionserver logs:
>>>
>>>                    2008-10-27 23:41:12,682 WARN
>>>                    org.apache.hadoop.dfs.DFSClient:
>>>                    DataStreamer
>>>                    Exception: java.io.IOException: Unable to create
>>>                    new block.
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>                     at
>>>
>>>
>>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>
>>>                    2008-10-27 23:41:12,682 WARN
>>>                    org.apache.hadoop.dfs.DFSClient: Error
>>>                    Recovery for block blk_-5188192041705782716_60000
>>>                    bad datanode[0]
>>>                    2008-10-27 23:41:12,685 ERROR
>>>
>>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>                    Compaction/Split
>>>                    failed for region
>>>
>>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>                    java.io.IOException: Could not get block
>>>                    locations. Aborting...
>>>
>>>
>>>                    If HDFS is ailing, hbase is too.  In fact, the
>>>                    regionservers will shut
>>>                    themselves to protect themselves against damaging
>>>                    or losing data:
>>>
>>>                    2008-10-27 23:41:12,688 FATAL
>>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>>                    Replay of hlog required. Forcing server restart
>>>
>>>                    So, whats up with your HDFS?  Not enough space
>>>                    alloted?  What happens if
>>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>>                    a clue as to what
>>>                    happened?  Dig in the datanode and namenode logs.
>>>                     Look for where the
>>>                    exceptions start.  It might give you a clue.
>>>
>>>                    + The suse regionserver log had garbage in it.
>>>
>>>                    St.Ack
>>>
>>>
>>>                    Slava Gorelik wrote:
>>>
>>>
>>>
>>>
>>>                        Hi.
>>>                        My happiness was very short :-( After i
>>>                        successfully added 1M rows (50k
>>>                        each row) i tried to add 10M rows.
>>>                        And after 3-4 working hours it started to
>>>                        dying. First one region server
>>>                        is died, after another one and eventually all
>>>                        cluster is dead.
>>>
>>>                        I attached log files (relevant part, archived)
>>>                        from region servers and
>>>                        from the master.
>>>
>>>                        Best Regards.
>>>
>>>
>>>
>>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>                        slava.gorelik@gmail.com
>>>                        <ma...@gmail.com><mailto:
>>>                        slava.gorelik@gmail.com
>>>                        <ma...@gmail.com>>> wrote:
>>>
>>>                         Hi.
>>>                         So far so good, after changing the file
>>>                        descriptors
>>>                         and dfs.datanode.socket.write.timeout,
>>>                        dfs.datanode.max.xcievers
>>>                         my cluster works stable.
>>>                         Thank You and Best Regards.
>>>
>>>                         P.S. Regarding deleting multiple columns
>>>                        missing functionality i
>>>                         filled jira :
>>>                        https://issues.apache.org/jira/browse/HBASE-961
>>>
>>>
>>>
>>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>>                        Stack <stack@duboce.net <ma...@duboce.net>
>>>                         <mailto:stack@duboce.net
>>>
>>>                        <ma...@duboce.net>>> wrote:
>>>
>>>                             Slava Gorelik wrote:
>>>
>>>                                 Hi.Haven't tried yet them, i'll try
>>>                        tomorrow morning. In
>>>                                 general cluster is
>>>                                 working well, the problems begins if
>>>                        i'm trying to add 10M
>>>                                 rows, after 1.2M
>>>                                 if happened.
>>>
>>>                             Anything else running beside the
>>>                        regionserver or datanodes
>>>                             that would suck resources?  When
>>>                        datanodes begin to slow, we
>>>                             begin to see the issue Jean-Adrien's
>>>                        configurations address.
>>>                              Are you uploading using MapReduce?  Are
>>>                        TTs running on same
>>>                             nodes as the datanode and regionserver?
>>>                         How are you doing the
>>>                             upload?  Describe what your uploader
>>>                        looks like (Sorry if
>>>                             you've already done this).
>>>
>>>
>>>                                  I already changed the limit of files
>>>                        descriptors,
>>>
>>>                             Good.
>>>
>>>
>>>                                  I'll try
>>>                                 to change the properties:
>>>                                  <property>
>>>                        <name>dfs.datanode.socket.write.timeout</name>
>>>                                  <value>0</value>
>>>                                 </property>
>>>
>>>                                 <property>
>>>                                  <name>dfs.datanode.max.xcievers</name>
>>>                                  <value>1023</value>
>>>                                 </property>
>>>
>>>
>>>                             Yeah, try it.
>>>
>>>
>>>                                 And let you know, is any other
>>>                        prescriptions ? Did i miss
>>>                                 something ?
>>>
>>>                                 BTW, off topic, but i sent e-mail
>>>                        recently to the list and
>>>                                 i can't see it:
>>>                                 Is it possible to delete multiple
>>>                        columns in any way by
>>>                                 regex : for example
>>>                                 colum_name_* ?
>>>
>>>                             Not that I know of.  If its not in the
>>>                        API, it should be.
>>>                              Mind filing a JIRA?
>>>
>>>                             Thanks Slava.
>>>                             St.Ack
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.Very strange, i see in limits.conf that it's upped.
I attached the limits.conf, please have a  look, may be i did it wrong.

Best Regards.


On Thu, Oct 30, 2008 at 7:52 PM, stack <st...@duboce.net> wrote:

> Thanks for the logs Slava.  I notice that you have not upped the ulimit on
> your cluster.  See the head of your logs where we print out the ulimit.  Its
> 1024.  This could be one cause of your grief especially when you seemingly
> have many regions (>1000).  Please try upping it.
> St.Ack
>
>
>
>
> Slava Gorelik wrote:
>
>> Hi.
>> I enabled DEBUG log level and now I'm sending all logs (archived)
>> including fsck run result.
>> Today my program starting to fail couple of minutes from the begin, it's
>> very easy to reproduce the problem, cluster became very unstable.
>>
>> Best Regards.
>>
>>
>> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net <mailto:
>> stack@duboce.net>> wrote:
>>
>>    See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>>
>>    St.Ack
>>
>>
>>    Slava Gorelik wrote:
>>
>>        Hi.First of all i want to say thank you for you assistance !!!
>>
>>
>>        DEBUG on hadoop or hbase ? And how can i enable ?
>>        fsck said that HDFS is healthy.
>>
>>        Best Regards and Thank You
>>
>>
>>        On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>>        <ma...@duboce.net>> wrote:
>>
>>
>>            Slava Gorelik wrote:
>>
>>
>>                Hi.HDFS capacity is about 800gb (8 datanodes) and the
>>                current usage is
>>                about
>>                30GB. This is after total re-format of the HDFS that
>>                was made a hour
>>                before.
>>
>>                BTW, the logs i sent are from the first exception that
>>                i found in them.
>>                Best Regards.
>>
>>
>>
>>            Please enable DEBUG and retry.  Send me all logs.  What
>>            does the fsck on
>>            HDFS say?  There is something seriously wrong with your
>>            cluster that you are
>>            having so much trouble getting it running.  Lets try and
>>            figure it.
>>
>>            St.Ack
>>
>>
>>
>>
>>
>>
>>                On Tue, Oct 28, 2008 at 7:12 PM, stack
>>                <stack@duboce.net <ma...@duboce.net>> wrote:
>>
>>
>>
>>
>>                    I took a quick look Slava (Thanks for sending the
>>                    files).   Here's a few
>>                    notes:
>>
>>                    + The logs are from after the damage is done; the
>>                    transition from good to
>>                    bad is missing.  If I could see that, that would help
>>                    + But what seems to be plain is that that your
>>                    HDFS is very sick.  See
>>                    this
>>                    from head of one of the regionserver logs:
>>
>>                    2008-10-27 23:41:12,682 WARN
>>                    org.apache.hadoop.dfs.DFSClient:
>>                    DataStreamer
>>                    Exception: java.io.IOException: Unable to create
>>                    new block.
>>                     at
>>
>>
>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>                     at
>>
>>
>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>                     at
>>
>>
>>  org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>
>>                    2008-10-27 23:41:12,682 WARN
>>                    org.apache.hadoop.dfs.DFSClient: Error
>>                    Recovery for block blk_-5188192041705782716_60000
>>                    bad datanode[0]
>>                    2008-10-27 23:41:12,685 ERROR
>>
>>  org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>                    Compaction/Split
>>                    failed for region
>>
>>  BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>                    java.io.IOException: Could not get block
>>                    locations. Aborting...
>>
>>
>>                    If HDFS is ailing, hbase is too.  In fact, the
>>                    regionservers will shut
>>                    themselves to protect themselves against damaging
>>                    or losing data:
>>
>>                    2008-10-27 23:41:12,688 FATAL
>>                    org.apache.hadoop.hbase.regionserver.Flusher:
>>                    Replay of hlog required. Forcing server restart
>>
>>                    So, whats up with your HDFS?  Not enough space
>>                    alloted?  What happens if
>>                    you run "./bin/hadoop fsck /"?  Does that give you
>>                    a clue as to what
>>                    happened?  Dig in the datanode and namenode logs.
>>                     Look for where the
>>                    exceptions start.  It might give you a clue.
>>
>>                    + The suse regionserver log had garbage in it.
>>
>>                    St.Ack
>>
>>
>>                    Slava Gorelik wrote:
>>
>>
>>
>>
>>                        Hi.
>>                        My happiness was very short :-( After i
>>                        successfully added 1M rows (50k
>>                        each row) i tried to add 10M rows.
>>                        And after 3-4 working hours it started to
>>                        dying. First one region server
>>                        is died, after another one and eventually all
>>                        cluster is dead.
>>
>>                        I attached log files (relevant part, archived)
>>                        from region servers and
>>                        from the master.
>>
>>                        Best Regards.
>>
>>
>>
>>                        On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>                        slava.gorelik@gmail.com
>>                        <ma...@gmail.com><mailto:
>>                        slava.gorelik@gmail.com
>>                        <ma...@gmail.com>>> wrote:
>>
>>                         Hi.
>>                         So far so good, after changing the file
>>                        descriptors
>>                         and dfs.datanode.socket.write.timeout,
>>                        dfs.datanode.max.xcievers
>>                         my cluster works stable.
>>                         Thank You and Best Regards.
>>
>>                         P.S. Regarding deleting multiple columns
>>                        missing functionality i
>>                         filled jira :
>>                        https://issues.apache.org/jira/browse/HBASE-961
>>
>>
>>
>>                         On Sun, Oct 26, 2008 at 12:58 AM, Michael
>>                        Stack <stack@duboce.net <ma...@duboce.net>
>>                         <mailto:stack@duboce.net
>>
>>                        <ma...@duboce.net>>> wrote:
>>
>>                             Slava Gorelik wrote:
>>
>>                                 Hi.Haven't tried yet them, i'll try
>>                        tomorrow morning. In
>>                                 general cluster is
>>                                 working well, the problems begins if
>>                        i'm trying to add 10M
>>                                 rows, after 1.2M
>>                                 if happened.
>>
>>                             Anything else running beside the
>>                        regionserver or datanodes
>>                             that would suck resources?  When
>>                        datanodes begin to slow, we
>>                             begin to see the issue Jean-Adrien's
>>                        configurations address.
>>                              Are you uploading using MapReduce?  Are
>>                        TTs running on same
>>                             nodes as the datanode and regionserver?
>>                         How are you doing the
>>                             upload?  Describe what your uploader
>>                        looks like (Sorry if
>>                             you've already done this).
>>
>>
>>                                  I already changed the limit of files
>>                        descriptors,
>>
>>                             Good.
>>
>>
>>                                  I'll try
>>                                 to change the properties:
>>                                  <property>
>>                        <name>dfs.datanode.socket.write.timeout</name>
>>                                  <value>0</value>
>>                                 </property>
>>
>>                                 <property>
>>                                  <name>dfs.datanode.max.xcievers</name>
>>                                  <value>1023</value>
>>                                 </property>
>>
>>
>>                             Yeah, try it.
>>
>>
>>                                 And let you know, is any other
>>                        prescriptions ? Did i miss
>>                                 something ?
>>
>>                                 BTW, off topic, but i sent e-mail
>>                        recently to the list and
>>                                 i can't see it:
>>                                 Is it possible to delete multiple
>>                        columns in any way by
>>                                 regex : for example
>>                                 colum_name_* ?
>>
>>                             Not that I know of.  If its not in the
>>                        API, it should be.
>>                              Mind filing a JIRA?
>>
>>                             Thanks Slava.
>>                             St.Ack
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Thanks for the logs Slava.  I notice that you have not upped the ulimit 
on your cluster.  See the head of your logs where we print out the 
ulimit.  Its 1024.  This could be one cause of your grief especially 
when you seemingly have many regions (>1000).  Please try upping it.
St.Ack




Slava Gorelik wrote:
> Hi.
> I enabled DEBUG log level and now I'm sending all logs (archived) 
> including fsck run result.
> Today my program starting to fail couple of minutes from the begin, 
> it's very easy to reproduce the problem, cluster became very unstable.
>
> Best Regards.
>
>
> On Tue, Oct 28, 2008 at 11:05 PM, stack <stack@duboce.net 
> <ma...@duboce.net>> wrote:
>
>     See http://wiki.apache.org/hadoop/Hbase/FAQ#5
>
>     St.Ack
>
>
>     Slava Gorelik wrote:
>
>         Hi.First of all i want to say thank you for you assistance !!!
>
>
>         DEBUG on hadoop or hbase ? And how can i enable ?
>         fsck said that HDFS is healthy.
>
>         Best Regards and Thank You
>
>
>         On Tue, Oct 28, 2008 at 8:45 PM, stack <stack@duboce.net
>         <ma...@duboce.net>> wrote:
>
>          
>
>             Slava Gorelik wrote:
>
>                
>
>                 Hi.HDFS capacity is about 800gb (8 datanodes) and the
>                 current usage is
>                 about
>                 30GB. This is after total re-format of the HDFS that
>                 was made a hour
>                 before.
>
>                 BTW, the logs i sent are from the first exception that
>                 i found in them.
>                 Best Regards.
>
>
>                      
>
>             Please enable DEBUG and retry.  Send me all logs.  What
>             does the fsck on
>             HDFS say?  There is something seriously wrong with your
>             cluster that you are
>             having so much trouble getting it running.  Lets try and
>             figure it.
>
>             St.Ack
>
>
>
>
>
>                
>
>                 On Tue, Oct 28, 2008 at 7:12 PM, stack
>                 <stack@duboce.net <ma...@duboce.net>> wrote:
>
>
>
>                      
>
>                     I took a quick look Slava (Thanks for sending the
>                     files).   Here's a few
>                     notes:
>
>                     + The logs are from after the damage is done; the
>                     transition from good to
>                     bad is missing.  If I could see that, that would help
>                     + But what seems to be plain is that that your
>                     HDFS is very sick.  See
>                     this
>                     from head of one of the regionserver logs:
>
>                     2008-10-27 23:41:12,682 WARN
>                     org.apache.hadoop.dfs.DFSClient:
>                     DataStreamer
>                     Exception: java.io.IOException: Unable to create
>                     new block.
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>                      at
>
>                     org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>
>                     2008-10-27 23:41:12,682 WARN
>                     org.apache.hadoop.dfs.DFSClient: Error
>                     Recovery for block blk_-5188192041705782716_60000
>                     bad datanode[0]
>                     2008-10-27 23:41:12,685 ERROR
>                     org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>                     Compaction/Split
>                     failed for region
>                     BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>                     java.io.IOException: Could not get block
>                     locations. Aborting...
>
>
>                     If HDFS is ailing, hbase is too.  In fact, the
>                     regionservers will shut
>                     themselves to protect themselves against damaging
>                     or losing data:
>
>                     2008-10-27 23:41:12,688 FATAL
>                     org.apache.hadoop.hbase.regionserver.Flusher:
>                     Replay of hlog required. Forcing server restart
>
>                     So, whats up with your HDFS?  Not enough space
>                     alloted?  What happens if
>                     you run "./bin/hadoop fsck /"?  Does that give you
>                     a clue as to what
>                     happened?  Dig in the datanode and namenode logs.
>                      Look for where the
>                     exceptions start.  It might give you a clue.
>
>                     + The suse regionserver log had garbage in it.
>
>                     St.Ack
>
>
>                     Slava Gorelik wrote:
>
>
>
>                            
>
>                         Hi.
>                         My happiness was very short :-( After i
>                         successfully added 1M rows (50k
>                         each row) i tried to add 10M rows.
>                         And after 3-4 working hours it started to
>                         dying. First one region server
>                         is died, after another one and eventually all
>                         cluster is dead.
>
>                         I attached log files (relevant part, archived)
>                         from region servers and
>                         from the master.
>
>                         Best Regards.
>
>
>
>                         On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>                         slava.gorelik@gmail.com
>                         <ma...@gmail.com><mailto:
>                         slava.gorelik@gmail.com
>                         <ma...@gmail.com>>> wrote:
>
>                          Hi.
>                          So far so good, after changing the file
>                         descriptors
>                          and dfs.datanode.socket.write.timeout,
>                         dfs.datanode.max.xcievers
>                          my cluster works stable.
>                          Thank You and Best Regards.
>
>                          P.S. Regarding deleting multiple columns
>                         missing functionality i
>                          filled jira :
>                         https://issues.apache.org/jira/browse/HBASE-961
>
>
>
>                          On Sun, Oct 26, 2008 at 12:58 AM, Michael
>                         Stack <stack@duboce.net <ma...@duboce.net>
>                          <mailto:stack@duboce.net
>                         <ma...@duboce.net>>> wrote:
>
>                              Slava Gorelik wrote:
>
>                                  Hi.Haven't tried yet them, i'll try
>                         tomorrow morning. In
>                                  general cluster is
>                                  working well, the problems begins if
>                         i'm trying to add 10M
>                                  rows, after 1.2M
>                                  if happened.
>
>                              Anything else running beside the
>                         regionserver or datanodes
>                              that would suck resources?  When
>                         datanodes begin to slow, we
>                              begin to see the issue Jean-Adrien's
>                         configurations address.
>                               Are you uploading using MapReduce?  Are
>                         TTs running on same
>                              nodes as the datanode and regionserver?
>                          How are you doing the
>                              upload?  Describe what your uploader
>                         looks like (Sorry if
>                              you've already done this).
>
>
>                                   I already changed the limit of files
>                         descriptors,
>
>                              Good.
>
>
>                                   I'll try
>                                  to change the properties:
>                                   <property>
>                         <name>dfs.datanode.socket.write.timeout</name>
>                                   <value>0</value>
>                                  </property>
>
>                                  <property>
>                                   <name>dfs.datanode.max.xcievers</name>
>                                   <value>1023</value>
>                                  </property>
>
>
>                              Yeah, try it.
>
>
>                                  And let you know, is any other
>                         prescriptions ? Did i miss
>                                  something ?
>
>                                  BTW, off topic, but i sent e-mail
>                         recently to the list and
>                                  i can't see it:
>                                  Is it possible to delete multiple
>                         columns in any way by
>                                  regex : for example
>                                  colum_name_* ?
>
>                              Not that I know of.  If its not in the
>                         API, it should be.
>                               Mind filing a JIRA?
>
>                              Thanks Slava.
>                              St.Ack
>
>
>
>
>
>
>                                  
>
>                      
>
>                
>
>
>          
>
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I enabled DEBUG log level and now I'm sending all logs (archived)
including fsck run result.
Today my program starting to fail couple of minutes from the begin,
it's very easy to reproduce the problem, cluster became very unstable.

Best Regards.


On Tue, Oct 28, 2008 at 11:05 PM, stack <st...@duboce.net> wrote:

> See http://wiki.apache.org/hadoop/Hbase/FAQ#5
> St.Ack
>
>
> Slava Gorelik wrote:
>
>> Hi.First of all i want to say thank you for you assistance !!!
>>
>>
>> DEBUG on hadoop or hbase ? And how can i enable ?
>> fsck said that HDFS is healthy.
>>
>> Best Regards and Thank You
>>
>>
>> On Tue, Oct 28, 2008 at 8:45 PM, stack <st...@duboce.net> wrote:
>>
>>
>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is
>>>> about
>>>> 30GB. This is after total re-format of the HDFS that was made a hour
>>>> before.
>>>>
>>>> BTW, the logs i sent are from the first exception that i found in them.
>>>> Best Regards.
>>>>
>>>>
>>>>
>>>>
>>> Please enable DEBUG and retry.  Send me all logs.  What does the fsck on
>>> HDFS say?  There is something seriously wrong with your cluster that you
>>> are
>>> having so much trouble getting it running.  Lets try and figure it.
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> I took a quick look Slava (Thanks for sending the files).   Here's a
>>>>> few
>>>>> notes:
>>>>>
>>>>> + The logs are from after the damage is done; the transition from good
>>>>> to
>>>>> bad is missing.  If I could see that, that would help
>>>>> + But what seems to be plain is that that your HDFS is very sick.  See
>>>>> this
>>>>> from head of one of the regionserver logs:
>>>>>
>>>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:
>>>>> DataStreamer
>>>>> Exception: java.io.IOException: Unable to create new block.
>>>>>  at
>>>>>
>>>>>
>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>>  at
>>>>>
>>>>>
>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>>  at
>>>>>
>>>>>
>>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>>
>>>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
>>>>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
>>>>> 2008-10-27 23:41:12,685 ERROR
>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>>> Compaction/Split
>>>>> failed for region
>>>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>>> java.io.IOException: Could not get block locations. Aborting...
>>>>>
>>>>>
>>>>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
>>>>> themselves to protect themselves against damaging or losing data:
>>>>>
>>>>> 2008-10-27 23:41:12,688 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.Flusher:
>>>>> Replay of hlog required. Forcing server restart
>>>>>
>>>>> So, whats up with your HDFS?  Not enough space alloted?  What happens
>>>>> if
>>>>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
>>>>> happened?  Dig in the datanode and namenode logs.  Look for where the
>>>>> exceptions start.  It might give you a clue.
>>>>>
>>>>> + The suse regionserver log had garbage in it.
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>> Slava Gorelik wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi.
>>>>>> My happiness was very short :-( After i successfully added 1M rows
>>>>>> (50k
>>>>>> each row) i tried to add 10M rows.
>>>>>> And after 3-4 working hours it started to dying. First one region
>>>>>> server
>>>>>> is died, after another one and eventually all cluster is dead.
>>>>>>
>>>>>> I attached log files (relevant part, archived) from region servers and
>>>>>> from the master.
>>>>>>
>>>>>> Best Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>>>> slava.gorelik@gmail.com<mailto:
>>>>>> slava.gorelik@gmail.com>> wrote:
>>>>>>
>>>>>>  Hi.
>>>>>>  So far so good, after changing the file descriptors
>>>>>>  and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>>>>>  my cluster works stable.
>>>>>>  Thank You and Best Regards.
>>>>>>
>>>>>>  P.S. Regarding deleting multiple columns missing functionality i
>>>>>>  filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>>>>>  <ma...@duboce.net>> wrote:
>>>>>>
>>>>>>      Slava Gorelik wrote:
>>>>>>
>>>>>>          Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>>>>>          general cluster is
>>>>>>          working well, the problems begins if i'm trying to add 10M
>>>>>>          rows, after 1.2M
>>>>>>          if happened.
>>>>>>
>>>>>>      Anything else running beside the regionserver or datanodes
>>>>>>      that would suck resources?  When datanodes begin to slow, we
>>>>>>      begin to see the issue Jean-Adrien's configurations address.
>>>>>>       Are you uploading using MapReduce?  Are TTs running on same
>>>>>>      nodes as the datanode and regionserver?  How are you doing the
>>>>>>      upload?  Describe what your uploader looks like (Sorry if
>>>>>>      you've already done this).
>>>>>>
>>>>>>
>>>>>>           I already changed the limit of files descriptors,
>>>>>>
>>>>>>      Good.
>>>>>>
>>>>>>
>>>>>>           I'll try
>>>>>>          to change the properties:
>>>>>>           <property> <name>dfs.datanode.socket.write.timeout</name>
>>>>>>           <value>0</value>
>>>>>>          </property>
>>>>>>
>>>>>>          <property>
>>>>>>           <name>dfs.datanode.max.xcievers</name>
>>>>>>           <value>1023</value>
>>>>>>          </property>
>>>>>>
>>>>>>
>>>>>>      Yeah, try it.
>>>>>>
>>>>>>
>>>>>>          And let you know, is any other prescriptions ? Did i miss
>>>>>>          something ?
>>>>>>
>>>>>>          BTW, off topic, but i sent e-mail recently to the list and
>>>>>>          i can't see it:
>>>>>>          Is it possible to delete multiple columns in any way by
>>>>>>          regex : for example
>>>>>>          colum_name_* ?
>>>>>>
>>>>>>      Not that I know of.  If its not in the API, it should be.
>>>>>>       Mind filing a JIRA?
>>>>>>
>>>>>>      Thanks Slava.
>>>>>>      St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

See http://wiki.apache.org/hadoop/Hbase/FAQ#5
St.Ack


Slava Gorelik wrote:
> Hi.First of all i want to say thank you for you assistance !!!
>
> DEBUG on hadoop or hbase ? And how can i enable ?
> fsck said that HDFS is healthy.
>
> Best Regards and Thank You
>
>
> On Tue, Oct 28, 2008 at 8:45 PM, stack <st...@duboce.net> wrote:
>
>   
>> Slava Gorelik wrote:
>>
>>     
>>> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is
>>> about
>>> 30GB. This is after total re-format of the HDFS that was made a hour
>>> before.
>>>
>>> BTW, the logs i sent are from the first exception that i found in them.
>>> Best Regards.
>>>
>>>
>>>       
>> Please enable DEBUG and retry.  Send me all logs.  What does the fsck on
>> HDFS say?  There is something seriously wrong with your cluster that you are
>> having so much trouble getting it running.  Lets try and figure it.
>>
>> St.Ack
>>
>>
>>
>>
>>
>>     
>>> On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:
>>>
>>>
>>>
>>>       
>>>> I took a quick look Slava (Thanks for sending the files).   Here's a few
>>>> notes:
>>>>
>>>> + The logs are from after the damage is done; the transition from good to
>>>> bad is missing.  If I could see that, that would help
>>>> + But what seems to be plain is that that your HDFS is very sick.  See
>>>> this
>>>> from head of one of the regionserver logs:
>>>>
>>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:
>>>> DataStreamer
>>>> Exception: java.io.IOException: Unable to create new block.
>>>>  at
>>>>
>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>>  at
>>>>
>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>>  at
>>>>
>>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>>
>>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
>>>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
>>>> 2008-10-27 23:41:12,685 ERROR
>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>>>> failed for region
>>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>>> java.io.IOException: Could not get block locations. Aborting...
>>>>
>>>>
>>>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
>>>> themselves to protect themselves against damaging or losing data:
>>>>
>>>> 2008-10-27 23:41:12,688 FATAL
>>>> org.apache.hadoop.hbase.regionserver.Flusher:
>>>> Replay of hlog required. Forcing server restart
>>>>
>>>> So, whats up with your HDFS?  Not enough space alloted?  What happens if
>>>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
>>>> happened?  Dig in the datanode and namenode logs.  Look for where the
>>>> exceptions start.  It might give you a clue.
>>>>
>>>> + The suse regionserver log had garbage in it.
>>>>
>>>> St.Ack
>>>>
>>>>
>>>> Slava Gorelik wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> Hi.
>>>>> My happiness was very short :-( After i successfully added 1M rows (50k
>>>>> each row) i tried to add 10M rows.
>>>>> And after 3-4 working hours it started to dying. First one region server
>>>>> is died, after another one and eventually all cluster is dead.
>>>>>
>>>>> I attached log files (relevant part, archived) from region servers and
>>>>> from the master.
>>>>>
>>>>> Best Regards.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>>> slava.gorelik@gmail.com<mailto:
>>>>> slava.gorelik@gmail.com>> wrote:
>>>>>
>>>>>   Hi.
>>>>>   So far so good, after changing the file descriptors
>>>>>   and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>>>>   my cluster works stable.
>>>>>   Thank You and Best Regards.
>>>>>
>>>>>   P.S. Regarding deleting multiple columns missing functionality i
>>>>>   filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>>>>
>>>>>
>>>>>
>>>>>   On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>>>>   <ma...@duboce.net>> wrote:
>>>>>
>>>>>       Slava Gorelik wrote:
>>>>>
>>>>>           Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>>>>           general cluster is
>>>>>           working well, the problems begins if i'm trying to add 10M
>>>>>           rows, after 1.2M
>>>>>           if happened.
>>>>>
>>>>>       Anything else running beside the regionserver or datanodes
>>>>>       that would suck resources?  When datanodes begin to slow, we
>>>>>       begin to see the issue Jean-Adrien's configurations address.
>>>>>        Are you uploading using MapReduce?  Are TTs running on same
>>>>>       nodes as the datanode and regionserver?  How are you doing the
>>>>>       upload?  Describe what your uploader looks like (Sorry if
>>>>>       you've already done this).
>>>>>
>>>>>
>>>>>            I already changed the limit of files descriptors,
>>>>>
>>>>>       Good.
>>>>>
>>>>>
>>>>>            I'll try
>>>>>           to change the properties:
>>>>>            <property> <name>dfs.datanode.socket.write.timeout</name>
>>>>>            <value>0</value>
>>>>>           </property>
>>>>>
>>>>>           <property>
>>>>>            <name>dfs.datanode.max.xcievers</name>
>>>>>            <value>1023</value>
>>>>>           </property>
>>>>>
>>>>>
>>>>>       Yeah, try it.
>>>>>
>>>>>
>>>>>           And let you know, is any other prescriptions ? Did i miss
>>>>>           something ?
>>>>>
>>>>>           BTW, off topic, but i sent e-mail recently to the list and
>>>>>           i can't see it:
>>>>>           Is it possible to delete multiple columns in any way by
>>>>>           regex : for example
>>>>>           colum_name_* ?
>>>>>
>>>>>       Not that I know of.  If its not in the API, it should be.
>>>>>        Mind filing a JIRA?
>>>>>
>>>>>       Thanks Slava.
>>>>>       St.Ack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>       
>>     
>
>

Re: Regionserver fails to serve region

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Slava,

http://wiki.apache.org/hadoop/Hbase/FAQ#5

J-D

On Tue, Oct 28, 2008 at 3:31 PM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.First of all i want to say thank you for you assistance !!!
>
> DEBUG on hadoop or hbase ? And how can i enable ?
> fsck said that HDFS is healthy.
>
> Best Regards and Thank You
>
>
> On Tue, Oct 28, 2008 at 8:45 PM, stack <st...@duboce.net> wrote:
>
> > Slava Gorelik wrote:
> >
> >> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is
> >> about
> >> 30GB. This is after total re-format of the HDFS that was made a hour
> >> before.
> >>
> >> BTW, the logs i sent are from the first exception that i found in them.
> >> Best Regards.
> >>
> >>
> > Please enable DEBUG and retry.  Send me all logs.  What does the fsck on
> > HDFS say?  There is something seriously wrong with your cluster that you
> are
> > having so much trouble getting it running.  Lets try and figure it.
> >
> > St.Ack
> >
> >
> >
> >
> >
> >> On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:
> >>
> >>
> >>
> >>> I took a quick look Slava (Thanks for sending the files).   Here's a
> few
> >>> notes:
> >>>
> >>> + The logs are from after the damage is done; the transition from good
> to
> >>> bad is missing.  If I could see that, that would help
> >>> + But what seems to be plain is that that your HDFS is very sick.  See
> >>> this
> >>> from head of one of the regionserver logs:
> >>>
> >>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:
> >>> DataStreamer
> >>> Exception: java.io.IOException: Unable to create new block.
> >>>  at
> >>>
> >>>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
> >>>  at
> >>>
> >>>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
> >>>  at
> >>>
> >>>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
> >>>
> >>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
> >>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
> >>> 2008-10-27 23:41:12,685 ERROR
> >>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split
> >>> failed for region
> >>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
> >>> java.io.IOException: Could not get block locations. Aborting...
> >>>
> >>>
> >>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
> >>> themselves to protect themselves against damaging or losing data:
> >>>
> >>> 2008-10-27 23:41:12,688 FATAL
> >>> org.apache.hadoop.hbase.regionserver.Flusher:
> >>> Replay of hlog required. Forcing server restart
> >>>
> >>> So, whats up with your HDFS?  Not enough space alloted?  What happens
> if
> >>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
> >>> happened?  Dig in the datanode and namenode logs.  Look for where the
> >>> exceptions start.  It might give you a clue.
> >>>
> >>> + The suse regionserver log had garbage in it.
> >>>
> >>> St.Ack
> >>>
> >>>
> >>> Slava Gorelik wrote:
> >>>
> >>>
> >>>
> >>>> Hi.
> >>>> My happiness was very short :-( After i successfully added 1M rows
> (50k
> >>>> each row) i tried to add 10M rows.
> >>>> And after 3-4 working hours it started to dying. First one region
> server
> >>>> is died, after another one and eventually all cluster is dead.
> >>>>
> >>>> I attached log files (relevant part, archived) from region servers and
> >>>> from the master.
> >>>>
> >>>> Best Regards.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
> >>>> slava.gorelik@gmail.com<mailto:
> >>>> slava.gorelik@gmail.com>> wrote:
> >>>>
> >>>>   Hi.
> >>>>   So far so good, after changing the file descriptors
> >>>>   and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
> >>>>   my cluster works stable.
> >>>>   Thank You and Best Regards.
> >>>>
> >>>>   P.S. Regarding deleting multiple columns missing functionality i
> >>>>   filled jira : https://issues.apache.org/jira/browse/HBASE-961
> >>>>
> >>>>
> >>>>
> >>>>   On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
> >>>>   <ma...@duboce.net>> wrote:
> >>>>
> >>>>       Slava Gorelik wrote:
> >>>>
> >>>>           Hi.Haven't tried yet them, i'll try tomorrow morning. In
> >>>>           general cluster is
> >>>>           working well, the problems begins if i'm trying to add 10M
> >>>>           rows, after 1.2M
> >>>>           if happened.
> >>>>
> >>>>       Anything else running beside the regionserver or datanodes
> >>>>       that would suck resources?  When datanodes begin to slow, we
> >>>>       begin to see the issue Jean-Adrien's configurations address.
> >>>>        Are you uploading using MapReduce?  Are TTs running on same
> >>>>       nodes as the datanode and regionserver?  How are you doing the
> >>>>       upload?  Describe what your uploader looks like (Sorry if
> >>>>       you've already done this).
> >>>>
> >>>>
> >>>>            I already changed the limit of files descriptors,
> >>>>
> >>>>       Good.
> >>>>
> >>>>
> >>>>            I'll try
> >>>>           to change the properties:
> >>>>            <property> <name>dfs.datanode.socket.write.timeout</name>
> >>>>            <value>0</value>
> >>>>           </property>
> >>>>
> >>>>           <property>
> >>>>            <name>dfs.datanode.max.xcievers</name>
> >>>>            <value>1023</value>
> >>>>           </property>
> >>>>
> >>>>
> >>>>       Yeah, try it.
> >>>>
> >>>>
> >>>>           And let you know, is any other prescriptions ? Did i miss
> >>>>           something ?
> >>>>
> >>>>           BTW, off topic, but i sent e-mail recently to the list and
> >>>>           i can't see it:
> >>>>           Is it possible to delete multiple columns in any way by
> >>>>           regex : for example
> >>>>           colum_name_* ?
> >>>>
> >>>>       Not that I know of.  If its not in the API, it should be.
> >>>>        Mind filing a JIRA?
> >>>>
> >>>>       Thanks Slava.
> >>>>       St.Ack
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.First of all i want to say thank you for you assistance !!!

DEBUG on hadoop or hbase ? And how can i enable ?
fsck said that HDFS is healthy.

Best Regards and Thank You


On Tue, Oct 28, 2008 at 8:45 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is
>> about
>> 30GB. This is after total re-format of the HDFS that was made a hour
>> before.
>>
>> BTW, the logs i sent are from the first exception that i found in them.
>> Best Regards.
>>
>>
> Please enable DEBUG and retry.  Send me all logs.  What does the fsck on
> HDFS say?  There is something seriously wrong with your cluster that you are
> having so much trouble getting it running.  Lets try and figure it.
>
> St.Ack
>
>
>
>
>
>> On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:
>>
>>
>>
>>> I took a quick look Slava (Thanks for sending the files).   Here's a few
>>> notes:
>>>
>>> + The logs are from after the damage is done; the transition from good to
>>> bad is missing.  If I could see that, that would help
>>> + But what seems to be plain is that that your HDFS is very sick.  See
>>> this
>>> from head of one of the regionserver logs:
>>>
>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:
>>> DataStreamer
>>> Exception: java.io.IOException: Unable to create new block.
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>>  at
>>>
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>>
>>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
>>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
>>> 2008-10-27 23:41:12,685 ERROR
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>>> failed for region
>>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>>> java.io.IOException: Could not get block locations. Aborting...
>>>
>>>
>>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
>>> themselves to protect themselves against damaging or losing data:
>>>
>>> 2008-10-27 23:41:12,688 FATAL
>>> org.apache.hadoop.hbase.regionserver.Flusher:
>>> Replay of hlog required. Forcing server restart
>>>
>>> So, whats up with your HDFS?  Not enough space alloted?  What happens if
>>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
>>> happened?  Dig in the datanode and namenode logs.  Look for where the
>>> exceptions start.  It might give you a clue.
>>>
>>> + The suse regionserver log had garbage in it.
>>>
>>> St.Ack
>>>
>>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi.
>>>> My happiness was very short :-( After i successfully added 1M rows (50k
>>>> each row) i tried to add 10M rows.
>>>> And after 3-4 working hours it started to dying. First one region server
>>>> is died, after another one and eventually all cluster is dead.
>>>>
>>>> I attached log files (relevant part, archived) from region servers and
>>>> from the master.
>>>>
>>>> Best Regards.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
>>>> slava.gorelik@gmail.com<mailto:
>>>> slava.gorelik@gmail.com>> wrote:
>>>>
>>>>   Hi.
>>>>   So far so good, after changing the file descriptors
>>>>   and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>>>   my cluster works stable.
>>>>   Thank You and Best Regards.
>>>>
>>>>   P.S. Regarding deleting multiple columns missing functionality i
>>>>   filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>>>
>>>>
>>>>
>>>>   On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>>>   <ma...@duboce.net>> wrote:
>>>>
>>>>       Slava Gorelik wrote:
>>>>
>>>>           Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>>>           general cluster is
>>>>           working well, the problems begins if i'm trying to add 10M
>>>>           rows, after 1.2M
>>>>           if happened.
>>>>
>>>>       Anything else running beside the regionserver or datanodes
>>>>       that would suck resources?  When datanodes begin to slow, we
>>>>       begin to see the issue Jean-Adrien's configurations address.
>>>>        Are you uploading using MapReduce?  Are TTs running on same
>>>>       nodes as the datanode and regionserver?  How are you doing the
>>>>       upload?  Describe what your uploader looks like (Sorry if
>>>>       you've already done this).
>>>>
>>>>
>>>>            I already changed the limit of files descriptors,
>>>>
>>>>       Good.
>>>>
>>>>
>>>>            I'll try
>>>>           to change the properties:
>>>>            <property> <name>dfs.datanode.socket.write.timeout</name>
>>>>            <value>0</value>
>>>>           </property>
>>>>
>>>>           <property>
>>>>            <name>dfs.datanode.max.xcievers</name>
>>>>            <value>1023</value>
>>>>           </property>
>>>>
>>>>
>>>>       Yeah, try it.
>>>>
>>>>
>>>>           And let you know, is any other prescriptions ? Did i miss
>>>>           something ?
>>>>
>>>>           BTW, off topic, but i sent e-mail recently to the list and
>>>>           i can't see it:
>>>>           Is it possible to delete multiple columns in any way by
>>>>           regex : for example
>>>>           colum_name_* ?
>>>>
>>>>       Not that I know of.  If its not in the API, it should be.
>>>>        Mind filing a JIRA?
>>>>
>>>>       Thanks Slava.
>>>>       St.Ack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is about
> 30GB. This is after total re-format of the HDFS that was made a hour before.
>
> BTW, the logs i sent are from the first exception that i found in them.
> Best Regards.
>   
Please enable DEBUG and retry.  Send me all logs.  What does the fsck on 
HDFS say?  There is something seriously wrong with your cluster that you 
are having so much trouble getting it running.  Lets try and figure it.

St.Ack



>
> On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:
>
>   
>> I took a quick look Slava (Thanks for sending the files).   Here's a few
>> notes:
>>
>> + The logs are from after the damage is done; the transition from good to
>> bad is missing.  If I could see that, that would help
>> + But what seems to be plain is that that your HDFS is very sick.  See this
>> from head of one of the regionserver logs:
>>
>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer
>> Exception: java.io.IOException: Unable to create new block.
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>>
>> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
>> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
>> 2008-10-27 23:41:12,685 ERROR
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
>> failed for region
>> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
>> java.io.IOException: Could not get block locations. Aborting...
>>
>>
>> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
>> themselves to protect themselves against damaging or losing data:
>>
>> 2008-10-27 23:41:12,688 FATAL org.apache.hadoop.hbase.regionserver.Flusher:
>> Replay of hlog required. Forcing server restart
>>
>> So, whats up with your HDFS?  Not enough space alloted?  What happens if
>> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
>> happened?  Dig in the datanode and namenode logs.  Look for where the
>> exceptions start.  It might give you a clue.
>>
>> + The suse regionserver log had garbage in it.
>>
>> St.Ack
>>
>>
>> Slava Gorelik wrote:
>>
>>     
>>> Hi.
>>> My happiness was very short :-( After i successfully added 1M rows (50k
>>> each row) i tried to add 10M rows.
>>> And after 3-4 working hours it started to dying. First one region server
>>> is died, after another one and eventually all cluster is dead.
>>>
>>> I attached log files (relevant part, archived) from region servers and
>>> from the master.
>>>
>>> Best Regards.
>>>
>>>
>>>
>>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <slava.gorelik@gmail.com<mailto:
>>> slava.gorelik@gmail.com>> wrote:
>>>
>>>    Hi.
>>>    So far so good, after changing the file descriptors
>>>    and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>>    my cluster works stable.
>>>    Thank You and Best Regards.
>>>
>>>    P.S. Regarding deleting multiple columns missing functionality i
>>>    filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>>
>>>
>>>
>>>    On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>>    <ma...@duboce.net>> wrote:
>>>
>>>        Slava Gorelik wrote:
>>>
>>>            Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>>            general cluster is
>>>            working well, the problems begins if i'm trying to add 10M
>>>            rows, after 1.2M
>>>            if happened.
>>>
>>>        Anything else running beside the regionserver or datanodes
>>>        that would suck resources?  When datanodes begin to slow, we
>>>        begin to see the issue Jean-Adrien's configurations address.
>>>         Are you uploading using MapReduce?  Are TTs running on same
>>>        nodes as the datanode and regionserver?  How are you doing the
>>>        upload?  Describe what your uploader looks like (Sorry if
>>>        you've already done this).
>>>
>>>
>>>             I already changed the limit of files descriptors,
>>>
>>>        Good.
>>>
>>>
>>>             I'll try
>>>            to change the properties:
>>>             <property> <name>dfs.datanode.socket.write.timeout</name>
>>>             <value>0</value>
>>>            </property>
>>>
>>>            <property>
>>>             <name>dfs.datanode.max.xcievers</name>
>>>             <value>1023</value>
>>>            </property>
>>>
>>>
>>>        Yeah, try it.
>>>
>>>
>>>            And let you know, is any other prescriptions ? Did i miss
>>>            something ?
>>>
>>>            BTW, off topic, but i sent e-mail recently to the list and
>>>            i can't see it:
>>>            Is it possible to delete multiple columns in any way by
>>>            regex : for example
>>>            colum_name_* ?
>>>
>>>        Not that I know of.  If its not in the API, it should be.
>>>         Mind filing a JIRA?
>>>
>>>        Thanks Slava.
>>>        St.Ack
>>>
>>>
>>>
>>>
>>>       
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is about
30GB. This is after total re-format of the HDFS that was made a hour before.

BTW, the logs i sent are from the first exception that i found in them.
Best Regards.


On Tue, Oct 28, 2008 at 7:12 PM, stack <st...@duboce.net> wrote:

> I took a quick look Slava (Thanks for sending the files).   Here's a few
> notes:
>
> + The logs are from after the damage is done; the transition from good to
> bad is missing.  If I could see that, that would help
> + But what seems to be plain is that that your HDFS is very sick.  See this
> from head of one of the regionserver logs:
>
> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer
> Exception: java.io.IOException: Unable to create new block.
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>
> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
> 2008-10-27 23:41:12,685 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
> failed for region
> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
> java.io.IOException: Could not get block locations. Aborting...
>
>
> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
> themselves to protect themselves against damaging or losing data:
>
> 2008-10-27 23:41:12,688 FATAL org.apache.hadoop.hbase.regionserver.Flusher:
> Replay of hlog required. Forcing server restart
>
> So, whats up with your HDFS?  Not enough space alloted?  What happens if
> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
> happened?  Dig in the datanode and namenode logs.  Look for where the
> exceptions start.  It might give you a clue.
>
> + The suse regionserver log had garbage in it.
>
> St.Ack
>
>
> Slava Gorelik wrote:
>
>> Hi.
>> My happiness was very short :-( After i successfully added 1M rows (50k
>> each row) i tried to add 10M rows.
>> And after 3-4 working hours it started to dying. First one region server
>> is died, after another one and eventually all cluster is dead.
>>
>> I attached log files (relevant part, archived) from region servers and
>> from the master.
>>
>> Best Regards.
>>
>>
>>
>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <slava.gorelik@gmail.com<mailto:
>> slava.gorelik@gmail.com>> wrote:
>>
>>    Hi.
>>    So far so good, after changing the file descriptors
>>    and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>    my cluster works stable.
>>    Thank You and Best Regards.
>>
>>    P.S. Regarding deleting multiple columns missing functionality i
>>    filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>
>>
>>
>>    On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>>    <ma...@duboce.net>> wrote:
>>
>>        Slava Gorelik wrote:
>>
>>            Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>            general cluster is
>>            working well, the problems begins if i'm trying to add 10M
>>            rows, after 1.2M
>>            if happened.
>>
>>        Anything else running beside the regionserver or datanodes
>>        that would suck resources?  When datanodes begin to slow, we
>>        begin to see the issue Jean-Adrien's configurations address.
>>         Are you uploading using MapReduce?  Are TTs running on same
>>        nodes as the datanode and regionserver?  How are you doing the
>>        upload?  Describe what your uploader looks like (Sorry if
>>        you've already done this).
>>
>>
>>             I already changed the limit of files descriptors,
>>
>>        Good.
>>
>>
>>             I'll try
>>            to change the properties:
>>             <property> <name>dfs.datanode.socket.write.timeout</name>
>>             <value>0</value>
>>            </property>
>>
>>            <property>
>>             <name>dfs.datanode.max.xcievers</name>
>>             <value>1023</value>
>>            </property>
>>
>>
>>        Yeah, try it.
>>
>>
>>            And let you know, is any other prescriptions ? Did i miss
>>            something ?
>>
>>            BTW, off topic, but i sent e-mail recently to the list and
>>            i can't see it:
>>            Is it possible to delete multiple columns in any way by
>>            regex : for example
>>            colum_name_* ?
>>
>>        Not that I know of.  If its not in the API, it should be.
>>         Mind filing a JIRA?
>>
>>        Thanks Slava.
>>        St.Ack
>>
>>
>>
>>
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

I took a quick look Slava (Thanks for sending the files).   Here's a few 
notes:

+ The logs are from after the damage is done; the transition from good 
to bad is missing.  If I could see that, that would help
+ But what seems to be plain is that that your HDFS is very sick.  See 
this from head of one of the regionserver logs:

2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: 
DataStreamer Exception: java.io.IOException: Unable to create new block.
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error 
Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
2008-10-27 23:41:12,685 ERROR 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: 
Compaction/Split failed for region 
BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
java.io.IOException: Could not get block locations. Aborting...


If HDFS is ailing, hbase is too.  In fact, the regionservers will shut 
themselves to protect themselves against damaging or losing data:

2008-10-27 23:41:12,688 FATAL 
org.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required. 
Forcing server restart

So, whats up with your HDFS?  Not enough space alloted?  What happens if 
you run "./bin/hadoop fsck /"?  Does that give you a clue as to what 
happened?  Dig in the datanode and namenode logs.  Look for where the 
exceptions start.  It might give you a clue.

+ The suse regionserver log had garbage in it.

St.Ack


Slava Gorelik wrote:
> Hi.
> My happiness was very short :-( After i successfully added 1M rows 
> (50k each row) i tried to add 10M rows.
> And after 3-4 working hours it started to dying. First one region 
> server is died, after another one and eventually all cluster is dead.
>
> I attached log files (relevant part, archived) from region servers and 
> from the master.
>
> Best Regards.
>
>
>
> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik 
> <slava.gorelik@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi.
>     So far so good, after changing the file descriptors
>     and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>     my cluster works stable. 
>
>     Thank You and Best Regards.
>
>     P.S. Regarding deleting multiple columns missing functionality i
>     filled jira : https://issues.apache.org/jira/browse/HBASE-961
>
>
>
>     On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <stack@duboce.net
>     <ma...@duboce.net>> wrote:
>
>         Slava Gorelik wrote:
>
>             Hi.Haven't tried yet them, i'll try tomorrow morning. In
>             general cluster is
>             working well, the problems begins if i'm trying to add 10M
>             rows, after 1.2M
>             if happened.
>
>         Anything else running beside the regionserver or datanodes
>         that would suck resources?  When datanodes begin to slow, we
>         begin to see the issue Jean-Adrien's configurations address.
>          Are you uploading using MapReduce?  Are TTs running on same
>         nodes as the datanode and regionserver?  How are you doing the
>         upload?  Describe what your uploader looks like (Sorry if
>         you've already done this).
>
>
>              I already changed the limit of files descriptors,
>
>         Good.
>
>
>              I'll try
>             to change the properties:
>              <property> <name>dfs.datanode.socket.write.timeout</name>
>              <value>0</value>
>             </property>
>
>             <property>
>              <name>dfs.datanode.max.xcievers</name>
>              <value>1023</value>
>             </property>
>
>              
>
>         Yeah, try it.
>
>
>             And let you know, is any other prescriptions ? Did i miss
>             something ?
>
>             BTW, off topic, but i sent e-mail recently to the list and
>             i can't see it:
>             Is it possible to delete multiple columns in any way by
>             regex : for example
>             colum_name_* ?
>              
>
>         Not that I know of.  If its not in the API, it should be.
>          Mind filing a JIRA?
>
>         Thanks Slava.
>         St.Ack
>
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.My happiness was very short :-( After i successfully added 1M rows (50k
each row) i tried to add 10M rows.
And after 3-4 working hours it started to dying. First one region server is
died, after another one and eventually all cluster is dead.

I attached log files (relevant part, archived) from region servers and from
the master.

Best Regards.



On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <sl...@gmail.com>wrote:

> Hi.So far so good, after changing the file descriptors
> and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers my cluster
> works stable.
>
> Thank You and Best Regards.
>
> P.S. Regarding deleting multiple columns missing functionality i filled
> jira : https://issues.apache.org/jira/browse/HBASE-961
>
>
>
> On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <st...@duboce.net> wrote:
>
>> Slava Gorelik wrote:
>>
>>> Hi.Haven't tried yet them, i'll try tomorrow morning. In general cluster
>>> is
>>> working well, the problems begins if i'm trying to add 10M rows, after
>>> 1.2M
>>> if happened.
>>>
>> Anything else running beside the regionserver or datanodes that would suck
>> resources?  When datanodes begin to slow, we begin to see the issue
>> Jean-Adrien's configurations address.  Are you uploading using MapReduce?
>>  Are TTs running on same nodes as the datanode and regionserver?  How are
>> you doing the upload?  Describe what your uploader looks like (Sorry if
>> you've already done this).
>>
>>   I already changed the limit of files descriptors,
>>>
>> Good.
>>
>>   I'll try
>>> to change the properties:
>>>  <property> <name>dfs.datanode.socket.write.timeout</name>
>>>  <value>0</value>
>>> </property>
>>>
>>> <property>
>>>  <name>dfs.datanode.max.xcievers</name>
>>>  <value>1023</value>
>>> </property>
>>>
>>>
>>>
>> Yeah, try it.
>>
>>  And let you know, is any other prescriptions ? Did i miss something ?
>>>
>>> BTW, off topic, but i sent e-mail recently to the list and i can't see
>>> it:
>>> Is it possible to delete multiple columns in any way by regex : for
>>> example
>>> colum_name_* ?
>>>
>>>
>> Not that I know of.  If its not in the API, it should be.  Mind filing a
>> JIRA?
>>
>> Thanks Slava.
>> St.Ack
>>
>
>

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.So far so good, after changing the file descriptors
and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers my cluster
works stable.

Thank You and Best Regards.

P.S. Regarding deleting multiple columns missing functionality i filled jira
: https://issues.apache.org/jira/browse/HBASE-961



On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.Haven't tried yet them, i'll try tomorrow morning. In general cluster
>> is
>> working well, the problems begins if i'm trying to add 10M rows, after
>> 1.2M
>> if happened.
>>
> Anything else running beside the regionserver or datanodes that would suck
> resources?  When datanodes begin to slow, we begin to see the issue
> Jean-Adrien's configurations address.  Are you uploading using MapReduce?
>  Are TTs running on same nodes as the datanode and regionserver?  How are
> you doing the upload?  Describe what your uploader looks like (Sorry if
> you've already done this).
>
>   I already changed the limit of files descriptors,
>>
> Good.
>
>   I'll try
>> to change the properties:
>>  <property> <name>dfs.datanode.socket.write.timeout</name>
>>  <value>0</value>
>> </property>
>>
>> <property>
>>  <name>dfs.datanode.max.xcievers</name>
>>  <value>1023</value>
>> </property>
>>
>>
>>
> Yeah, try it.
>
>  And let you know, is any other prescriptions ? Did i miss something ?
>>
>> BTW, off topic, but i sent e-mail recently to the list and i can't see it:
>> Is it possible to delete multiple columns in any way by regex : for
>> example
>> colum_name_* ?
>>
>>
> Not that I know of.  If its not in the API, it should be.  Mind filing a
> JIRA?
>
> Thanks Slava.
> St.Ack
>

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.Haven't tried yet them, i'll try tomorrow morning. In general cluster is
> working well, the problems begins if i'm trying to add 10M rows, after 1.2M
> if happened.
Anything else running beside the regionserver or datanodes that would 
suck resources?  When datanodes begin to slow, we begin to see the issue 
Jean-Adrien's configurations address.  Are you uploading using 
MapReduce?  Are TTs running on same nodes as the datanode and 
regionserver?  How are you doing the upload?  Describe what your 
uploader looks like (Sorry if you've already done this).

>  I already changed the limit of files descriptors,
Good.

>  I'll try
> to change the properties:
>  <property> <name>dfs.datanode.socket.write.timeout</name>
>  <value>0</value>
> </property>
>
> <property>
>   <name>dfs.datanode.max.xcievers</name>
>   <value>1023</value>
> </property>
>
>   
Yeah, try it.

> And let you know, is any other prescriptions ? Did i miss something ?
>
> BTW, off topic, but i sent e-mail recently to the list and i can't see it:
> Is it possible to delete multiple columns in any way by regex : for example
> colum_name_* ?
>   
Not that I know of.  If its not in the API, it should be.  Mind filing a 
JIRA?

Thanks Slava.
St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.Haven't tried yet them, i'll try tomorrow morning. In general cluster is
working well, the problems begins if i'm trying to add 10M rows, after 1.2M
if happened. I already changed the limit of files descriptors, I'll try
to change the properties:
 <property> <name>dfs.datanode.socket.write.timeout</name>
 <value>0</value>
</property>

<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>1023</value>
</property>

And let you know, is any other prescriptions ? Did i miss something ?

BTW, off topic, but i sent e-mail recently to the list and i can't see it:
Is it possible to delete multiple columns in any way by regex : for example
colum_name_* ?

Best Regards.

On Sun, Oct 26, 2008 at 12:03 AM, Michael Stack <st...@duboce.net> wrote:

> Does your cluster still not work Slava?  Have you seen the recent
> prescription from Jean-Adrien on the list for a problem that looks related?
> St.Ack
>
>
> Slava Gorelik wrote:
>
>> Hi.Most of the time i get Premeture [sic] EOF from inputStream , some
>> times
>> it also "No live nodes contain current block".
>> No, I don't have memory issue.
>>
>> Best Regards.
>>
>> On Mon, Oct 20, 2008 at 7:46 PM, stack <st...@duboce.net> wrote:
>>
>>
>>
>>> Slava Gorelik wrote:
>>>
>>>
>>>
>>>> Hi.I have similar problem.
>>>> My configuration is 8 machines with 4gb ram with default heap size for
>>>> hbase.
>>>>
>>>>
>>>>
>>>>
>>> Which part Slava?  You ran out of disk and you started to get "Premeture
>>> [sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live
>>> nodes
>>> contain current block"?  You don't have J-A's memory issues I presume?
>>>
>>> St.Ack
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Does your cluster still not work Slava?  Have you seen the recent 
prescription from Jean-Adrien on the list for a problem that looks related?
St.Ack

Slava Gorelik wrote:
> Hi.Most of the time i get Premeture [sic] EOF from inputStream , some times
> it also "No live nodes contain current block".
> No, I don't have memory issue.
>
> Best Regards.
>
> On Mon, Oct 20, 2008 at 7:46 PM, stack <st...@duboce.net> wrote:
>
>   
>> Slava Gorelik wrote:
>>
>>     
>>> Hi.I have similar problem.
>>> My configuration is 8 machines with 4gb ram with default heap size for
>>> hbase.
>>>
>>>
>>>       
>> Which part Slava?  You ran out of disk and you started to get "Premeture
>> [sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live nodes
>> contain current block"?  You don't have J-A's memory issues I presume?
>>
>> St.Ack
>>
>>     
>
>

Re: Regionserver fails to serve region

Posted by Jean-Adrien <ad...@jeanjean.ch>.

Hello again.

stack-3 wrote:
> 
> 
>  
> I could be wrong, but I don't see how.  You are running start-dfs.sh 
> over in HADOOP_HOME, not in HBASE_HOME.  Unless you somehow have 
> CLASSPATHs intermingled, datanode startup should not be picking up 
> content of HBASE_HOME/conf.
> 
> 

Yes, in fact that was my mistake: I haven't run finalizeUpgrade admin
command on Hadoop. Rollback files took a lot of space, maybe related. Anyway
all other hadoop settings are working.

stack-3 wrote:
> 
> 
> I owe you other answers/support.  In particular, I need to try running 
> dfs.datanode.socket.write.timeout = 0 to see if I get same problem as 
> you.  Let me know if anything else you'd have me try.
> 
> 

Finally I set the parameter increasing the limit of Xcievers (HADOOP-3633 /
HADOOP-3859) , because this limit is reached during HBase startup if I
disable channel timeout, as I wrote in my previous message

Then, using both properties:

<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>0</value>
</property>

<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>1023</value>
</property>

Hadoop is running with this configuration since 12h without any trace of
HADOOP-3831.
And therefore no "Premeture" in HBase. 

I don't know why in my configuration (probably the hardware is in cause)
this timeout happens without these parameters.

What could be done on the HBase side is maybe to try to reproduce such a
Hadoop failure. (e.g. giving low values to these Hadoop parameters) and to
see why the regions which suffer of one of these failure are not accessible
anymore before HBase is restarted.

Another thing I changed, and which is maybe linked to Hadoop behaviour: I
add the -server parameter to the jvm. Since the default ran client vm on a
1Gb RAM / single processor machine.

stack-3 wrote:
> 
> 
> Thanks for all the excellent diagnosis.
> St.Ack
> 
> 

You're welcome. It's a pleasure to contribute. Hope that helps.
Have a nice day.

-- J.-A.

-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20126171.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Regionserver fails to serve region

Posted by Michael Stack <st...@duboce.net>.

Jean-Adrien wrote:
> ..
> Stack, you ask me if my hard disks were full. I said one is. Why did you
> link the above problem with that. Because of the du problem noticed in
> HADOOP-3232 ? I don't think I'm affected by this problem, my BlockReport
> process duration is less than a second. 
>   
We were seeing HADOOP-3831 on our cluster (hadoop 0.18.0 and hbase 
0.18.1RC1).  After a rebalance of the hdfs content, brought on by the 
observation that loading was lopsided, the issue went away.   Thought -- 
not proven -- is that the lopsidedness was causing disks to fill which 
eventually led to 3831.

...
> Another question by the way:
> We saw that the hadoop-default.xml is used by hbase client, it overrides the
> replication factor; ok. But could it override the dfs.datanode.du.reserved /
> dfs.datanode.pct properties ? (which sounds to be policy of datanode rather
> than client). I said that my settings doesn't seem to affect the behaviour
> of datanodes.
>   
I could be wrong, but I don't see how.  You are running start-dfs.sh 
over in HADOOP_HOME, not in HBASE_HOME.  Unless you somehow have 
CLASSPATHs intermingled, datanode startup should not be picking up 
content of HBASE_HOME/conf.

I owe you other answers/support.  In particular, I need to try running 
dfs.datanode.socket.write.timeout = 0 to see if I get same problem as 
you.  Let me know if anything else you'd have me try.

Thanks for all the excellent diagnosis.
St.Ack

Re: Regionserver fails to serve region

Posted by Jean-Adrien <ad...@jeanjean.ch>.

I made more tests.

Regarding to HADOOP-3831, it is possible to disable the channel timeout
using the property

<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>

I tried this, but I couldn't launch hbase anymore:
During the startup phase a lot of accesses are made on the mapfile of the
-ROOT- region, and it is like the sockets are not closed: The regionserver
responsible to serve the -ROOT- region suddenly fails to get the concerned
mapfile blocks because of:

2008-10-21 14:21:09,212 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.11:50010,
storageID=DS-316339081-192.168.1.11-50010-1218034818875, infoPort=50075,
ipcPort=50020):DataXceiver: java.io.IOException: xceiverCount 257 exceeds
the limit of concurrent xcievers 256
at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1030)
at java.lang.Thread.run(Thread.java:619)

Which correspond to the limit introduced in HADOOP-3633.

Maybe the "Premeture" error caused by the channel timeout comes from a
socket that is not closed by HBase regionserver ?

It is possible to configure this xcievers limit (see HADOOP-3859) but the
name of the parameter is some kind of secret. Anyway, depends the cause of
the high number of concurrent access, maybe it is useless to increase this
parameter. Then I returned to my previous configuration, removing the
dfs.datanode.socket.write.timeout=0 property.

I noticed there was a lot of stuff done at startup time of HBase. I was
thinking that it was only regionservers that open regions, but it seems that
longer the cluster was running more are the files to be processed. Is that
correct ?

Have a nice day.

-- J.-A.

--
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20094637.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Regionserver fails to serve region

Posted by Jean-Adrien <ad...@jeanjean.ch>.

Hi,

I made some more observations about the (my) "Premeture" problem.
It is clearly the problem of HADOOP-3831
http://issues.apache.org/jira/browse/HADOOP-3831

The datanodes times up when they open the channel with hbase (8 min). (see
datanode log below)
Sometimes this error is reported to my client, when it happens during one of
my request, but it happens in several other occasions as seen in the
regionserver log (about every 10 minutes).

I said that restarting hbase yield to render access to region, restarting my
client is enough in fact.

Since my client (nor hbase) shouldn't prepare data for 8 minutes, I believe
it is either
- a I/O throughput problem in my case. 
- a kind of dead lock in channel; but other people would have noticed it. 

I monitor the I/O and CPU using iostats (10 seconds interval) and the hadoop
datanode log, and I have:


---- datanode log ----
2008-10-21 11:11:56,766 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):Got exception while serving blk_-321855630121782024_300805 to
/192.168.1.10:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:44764]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

2008-10-21 11:11:56,767 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:44764]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

[...]

2008-10-21 11:15:52,614 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):Got exception while serving blk_6873767988458539960_302970 to
/192.168.1.10:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:45482]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

2008-10-21 11:15:52,615 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:45482]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

I have between these messages bursts of some kind of normal operation: And
breaks of 2-3 minutes. 

e.g.
2008-10-21 11:18:23,126 INFO org.apache.hadoop.dfs.DataNode: Received block
blk_5598767900914020531_303199 of size 9 from /192.168.1.11
2008-10-21 11:18:23,126 INFO org.apache.hadoop.dfs.DataNode: PacketResponder
0 for block blk_5598767900914020531_303199 terminating
2008-10-21 11:18:23,372 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_1637433000135864223_303201 src: /192.168.1.13:60729 dest: /192.

If we look at the iostat during the same period:
I observe typical

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.40    0.00    3.50    0.00    0.00   82.10

i.e. no iowait, and mostly idle during the last 10 minutes. Which make me
think that it is not a problem of performances.

Then I was thinking about what is special in my cluster, and I can notice
that I often update values in HBase tables by doing batch update with an
existing timestamp; but I cannot feel the correlation with our problem,
moreover the update works and data are not corrupted.

Another difference is that I havn't finalized my hadoop upgrade yet... Once
again, I can't see any correlation. Anyway that could be clues 

To be continued...


Stack, you ask me if my hard disks were full. I said one is. Why did you
link the above problem with that. Because of the du problem noticed in
HADOOP-3232 ? I don't think I'm affected by this problem, my BlockReport
process duration is less than a second. 

Note that on the other nodes there is hard disk space remaining, and I
observe the same Channel timeout problem.
Next step for me is to monitor the IO activity, and to try to see if there
is a correlation between the failures and some potential overload.

Another question by the way:
We saw that the hadoop-default.xml is used by hbase client, it overrides the
replication factor; ok. But could it override the dfs.datanode.du.reserved /
dfs.datanode.pct properties ? (which sounds to be policy of datanode rather
than client). I said that my settings doesn't seem to affect the behaviour
of datanodes.


Have a good day.
-- Jean-Adrien


Slava Gorelik wrote:
> 
> Hi.Most of the time i get Premeture [sic] EOF from inputStream , some
> times
> it also "No live nodes contain current block".
> No, I don't have memory issue.
> 
> Best Regards.
> 
> On Mon, Oct 20, 2008 at 7:46 PM, stack <st...@duboce.net> wrote:
> 
>> Slava Gorelik wrote:
>>
>>> Hi.I have similar problem.
>>> My configuration is 8 machines with 4gb ram with default heap size for
>>> hbase.
>>>
>>>
>>
>> Which part Slava?  You ran out of disk and you started to get "Premeture
>> [sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live
>> nodes
>> contain current block"?  You don't have J-A's memory issues I presume?
>>
>> St.Ack
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20086165.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.Most of the time i get Premeture [sic] EOF from inputStream , some times
it also "No live nodes contain current block".
No, I don't have memory issue.

Best Regards.

On Mon, Oct 20, 2008 at 7:46 PM, stack <st...@duboce.net> wrote:

> Slava Gorelik wrote:
>
>> Hi.I have similar problem.
>> My configuration is 8 machines with 4gb ram with default heap size for
>> hbase.
>>
>>
>
> Which part Slava?  You ran out of disk and you started to get "Premeture
> [sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live nodes
> contain current block"?  You don't have J-A's memory issues I presume?
>
> St.Ack
>

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

Slava Gorelik wrote:
> Hi.I have similar problem.
> My configuration is 8 machines with 4gb ram with default heap size for
> hbase.
>   

Which part Slava?  You ran out of disk and you started to get "Premeture 
[sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live 
nodes contain current block"?  You don't have J-A's memory issues I presume?

St.Ack

Re: Regionserver fails to serve region

Posted by Slava Gorelik <sl...@gmail.com>.

Hi.I have similar problem.
My configuration is 8 machines with 4gb ram with default heap size for
hbase.


On Mon, Oct 20, 2008 at 11:38 AM, Jean-Adrien <ad...@jeanjean.ch> wrote:

>
>
> stack-3 wrote:
> >
> > First, see the Jon Gray response.  His postulate that the root of your
> > issues are machines swapping seems likely to me.
> >
> >
> > See below for some particular answers to your queries (thanks for the
> > detail).
> >
> > Jean-Adrien wrote:
> >> The attempts of above can be:
> >> 1.
> >> java.io.IOException: java.io.IOException: Premeture EOF from inputStream
> >>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> >>
> >
> > Did you say your disks had filled?  If so, this is likely cause of above
> > (but on our cluster here, we've also been seeing the above and are
> > looking at HADOOP-3831)
> >
> >
>
> Yes one is.
>
>
> stack-3 wrote:
> >
> >> 2-10
> >> java.io.IOException: java.io.IOException: java.lang.NullPointerException
> >>         at
> >> org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
> >>
> >>
> > Was there more stacktrace on this error?  May I see it?  Above should
> > never happen (smile).
> >
>
> Sure. Enjoy. Take in account that it's happen after the above Premeture
> EOF.
>
>
> 2008-10-14 14:23:55,705 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 7 on 60020, call getRow([B@17dc1ef, [B@1474316, null,
> 9223372036854775807, -1) from 192.168.1.10:49676: error:
> java.io.IOException: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
>        at
>
> org.apache.hadoop.hbase.HStoreKey$HStoreKeyWritableComparator.compare(HStoreKey.java:593)
>        at
> org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:436)
>        at org.apache.hadoop.io.MapFile$Reader.getClosest(MapFile.java:558)
>        at org.apache.hadoop.io.MapFile$Reader.getClosest(MapFile.java:541)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStoreFile$BloomFilterMapFile$Reader.getClosest(HStoreFile.java:761)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.getFullFromMapFile(HStore.java:1179)
>        at
> org.apache.hadoop.hbase.regionserver.HStore.getFull(HStore.java:1160)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.getFull(HRegion.java:1221)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRow(HRegionServer.java:1036)
>        at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:554)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>
>
> stack-3 wrote:
> >
> >
> >> Another 10 attempts scenario I have seen:
> >> 1-10:
> >> IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
> >> 1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
> >> Cannot open filename
> >> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
> >> java.io.IOException: Cannot open filename
> >> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
> >>         at
> >>
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)
> >>
> >> Preceded, in concerned regionsserver log, by the line:
> >>
> >> 2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
> >> obtain block blk_-3759213227484579481_226277 from any node:
> >> java.io.IOException: No live nodes contain current block
> >>
> >>
> > hdfs is hosed; it lost a block from the named file.  If hdfs is hosed,
> > so is hbase.
> >
> >
> >> If I look for this block in the hadoop master log I can find
> >>
> >> 2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> >> ask
> >> 192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277
> [...]
> >> (many more blocks)
> >>
> >
> > This is interesting.  I wonder why hdfs is deleting a block that
> > subsequently a regionserver is trying to use?   Can you correlate the
> > blocks' story with hbase actions?  (Thats probably an unfair question to
> > ask since it would require deep detective work on hbase logs trying to
> > trace the file whose block is missing and its hosting region as it moved
> > around the cluster).
> >
> >
>
> I have noticed no correlation for now. I'll try to play the detective a
> bit.
> If I notice something, I'll post it there.
>
>
> stack-3 wrote:
> >
> >
> >
> >> about 16 min before.
> >> In both cases the regionserver fails to serve the concerned region until
> >> I
> >> restart hbase (not hadoop).
> >>
> >>
> > Not hadoop?  And if you ran an fsck on the filesystem, its healthy?
> >
> >
>
> Not hadoop. Fsck says it's healthly.
>
>
> stack-3 wrote:
> >
> >
> >> One last question by the way:
> >> Why the replication factor of my hbase files in dfs is 3, when my hadoop
> >> cluster is configured to keep only 2 copies ?
> >>
> > See http://wiki.apache.org/hadoop/Hbase/FAQ#12.
> >
> >> Is it because the default (hadoop-default.xml) config file of the hadoop
> >> client, which is embedded in hbase distrib overrides the cluster
> >> configuration for the mapfiles created ?
> > Yes.
> >
> > Thanks for the questions J-A.
> > St.Ack
> >
> >
>
> Thank you too.
>
> --
> View this message in context:
> http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20066104.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Regionserver fails to serve region

Posted by Jean-Adrien <ad...@jeanjean.ch>.


stack-3 wrote:
> 
> First, see the Jon Gray response.  His postulate that the root of your 
> issues are machines swapping seems likely to me.
> 
> 
> See below for some particular answers to your queries (thanks for the 
> detail).
> 
> Jean-Adrien wrote:
>> The attempts of above can be:
>> 1.
>> java.io.IOException: java.io.IOException: Premeture EOF from inputStream
>>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>   
> 
> Did you say your disks had filled?  If so, this is likely cause of above 
> (but on our cluster here, we've also been seeing the above and are 
> looking at HADOOP-3831)
> 
> 

Yes one is. 


stack-3 wrote:
> 
>> 2-10
>> java.io.IOException: java.io.IOException: java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
>>
>>   
> Was there more stacktrace on this error?  May I see it?  Above should 
> never happen (smile).
> 

Sure. Enjoy. Take in account that it's happen after the above Premeture EOF.


2008-10-14 14:23:55,705 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call getRow([B@17dc1ef, [B@1474316, null,
9223372036854775807, -1) from 192.168.1.10:49676: error:
java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
        at
org.apache.hadoop.hbase.HStoreKey$HStoreKeyWritableComparator.compare(HStoreKey.java:593)
        at
org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:436)
        at org.apache.hadoop.io.MapFile$Reader.getClosest(MapFile.java:558)
        at org.apache.hadoop.io.MapFile$Reader.getClosest(MapFile.java:541)
        at
org.apache.hadoop.hbase.regionserver.HStoreFile$BloomFilterMapFile$Reader.getClosest(HStoreFile.java:761)
        at
org.apache.hadoop.hbase.regionserver.HStore.getFullFromMapFile(HStore.java:1179)
        at
org.apache.hadoop.hbase.regionserver.HStore.getFull(HStore.java:1160)
        at
org.apache.hadoop.hbase.regionserver.HRegion.getFull(HRegion.java:1221)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRow(HRegionServer.java:1036)
        at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:554)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)


stack-3 wrote:
> 
> 
>> Another 10 attempts scenario I have seen:
>> 1-10:
>> IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
>> 1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
>> Cannot open filename
>> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
>> java.io.IOException: Cannot open filename
>> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
>>         at
>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)
>>
>> Preceded, in concerned regionsserver log, by the line:
>>
>> 2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
>> obtain block blk_-3759213227484579481_226277 from any node: 
>> java.io.IOException: No live nodes contain current block
>>
>>   
> hdfs is hosed; it lost a block from the named file.  If hdfs is hosed, 
> so is hbase.
> 
> 
>> If I look for this block in the hadoop master log I can find
>>
>> 2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
>> ask
>> 192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
>> (many more blocks)
>>   
> 
> This is interesting.  I wonder why hdfs is deleting a block that 
> subsequently a regionserver is trying to use?   Can you correlate the 
> blocks' story with hbase actions?  (Thats probably an unfair question to 
> ask since it would require deep detective work on hbase logs trying to 
> trace the file whose block is missing and its hosting region as it moved 
> around the cluster).
> 
> 

I have noticed no correlation for now. I'll try to play the detective a bit.
If I notice something, I'll post it there.


stack-3 wrote:
> 
> 
> 
>> about 16 min before.
>> In both cases the regionserver fails to serve the concerned region until
>> I
>> restart hbase (not hadoop).
>>
>>   
> Not hadoop?  And if you ran an fsck on the filesystem, its healthy?
> 
> 

Not hadoop. Fsck says it's healthly. 


stack-3 wrote:
> 
> 
>> One last question by the way:
>> Why the replication factor of my hbase files in dfs is 3, when my hadoop
>> cluster is configured to keep only 2 copies ?
>>   
> See http://wiki.apache.org/hadoop/Hbase/FAQ#12.
> 
>> Is it because the default (hadoop-default.xml) config file of the hadoop
>> client, which is embedded in hbase distrib overrides the cluster
>> configuration for the mapfiles created ?
> Yes.
> 
> Thanks for the questions J-A.
> St.Ack
> 
> 

Thank you too.

-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20066104.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Regionserver fails to serve region

Posted by stack <st...@duboce.net>.

First, see the Jon Gray response.  His postulate that the root of your 
issues are machines swapping seems likely to me.

See below for some particular answers to your queries (thanks for the 
detail).

Jean-Adrien wrote:
> The attempts of above can be:
> 1.
> java.io.IOException: java.io.IOException: Premeture EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>   

Did you say your disks had filled?  If so, this is likely cause of above 
(but on our cluster here, we've also been seeing the above and are 
looking at HADOOP-3831)

> 2-10
> java.io.IOException: java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
>
>   
Was there more stacktrace on this error?  May I see it?  Above should 
never happen (smile).

...

> Another 10 attempts scenario I have seen:
> 1-10:
> IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
> 1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
> Cannot open filename
> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
> java.io.IOException: Cannot open filename
> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
>         at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)
>
> Preceded, in concerned regionsserver log, by the line:
>
> 2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
> obtain block blk_-3759213227484579481_226277 from any node: 
> java.io.IOException: No live nodes contain current block
>
>   
hdfs is hosed; it lost a block from the named file.  If hdfs is hosed, 
so is hbase.


> If I look for this block in the hadoop master log I can find
>
> 2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask
> 192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
> (many more blocks)
>   

This is interesting.  I wonder why hdfs is deleting a block that 
subsequently a regionserver is trying to use?   Can you correlate the 
blocks' story with hbase actions?  (Thats probably an unfair question to 
ask since it would require deep detective work on hbase logs trying to 
trace the file whose block is missing and its hosting region as it moved 
around the cluster).
> about 16 min before.
> In both cases the regionserver fails to serve the concerned region until I
> restart hbase (not hadoop).
>
>   
Not hadoop?  And if you ran an fsck on the filesystem, its healthy?

> One last question by the way:
> Why the replication factor of my hbase files in dfs is 3, when my hadoop
> cluster is configured to keep only 2 copies ?
>   
See http://wiki.apache.org/hadoop/Hbase/FAQ#12.

> Is it because the default (hadoop-default.xml) config file of the hadoop
> client, which is embedded in hbase distrib overrides the cluster
> configuration for the mapfiles created ?
Yes.

Thanks for the questions J-A.
St.Ack

RE: Regionserver fails to serve region

Posted by Jean-Adrien <ad...@jeanjean.ch>.

Sure I saw, excuse me, I wrote this one one day before I posted it, and your
answered my first mail in the meantime.
Anyway, I'll fix my cluster setup in order to have a better memory
allocation, and ensure there is no swap which overuses IO.

Thanks. 
Have a nice day




Jonathan Gray-8 wrote:
> 
> Jean-Adrien,
> 
> Did you see my reply to your previous email?
> 
> I think your machines are underpowered for your current setup and it's
> creating all kinds of problems.  If you have swapping going on in a
> regionserver/datanode, that must be addressed because it usually leads to
> odd behavior in hdfs, timeouts, starvation, etc...
> 
> Decrease your allotted heap sizes to fit within available memory, or add
> more memory.
> 
> JG
> 
> -----Original Message-----
> From: Jean-Adrien [mailto:adv1@jeanjean.ch] 
> Sent: Friday, October 17, 2008 1:02 AM
> To: hbase-user@hadoop.apache.org
> Subject: Regionserver fails to serve region
> 
> 
> Hello again.
> This is my last message for today
> 
> I have often an exception in my HBase client. A regionserver fails to
> serve
> a region when the client get a row on the HBase cluster.
> 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact
> region server 192.168.1.15:60020 for region
> table-0.3,:testrow79063200,1223872616091, row ':testrow22102600', but
> failed
> after 10 attempts.
> 
> The attempts of above can be:
> 1.
> java.io.IOException: java.io.IOException: Premeture EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> 2-10
> java.io.IOException: java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
> 
> After what. Every time the client try to reach the same region the attemps
> 1-10 are
> java.io.IOException: java.io.IOException: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)
> 
> In this case, if the client try to reach the same region again, all next
> 10
> attemps are the NPE.
> 
> Another 10 attempts scenario I have seen:
> 1-10:
> IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
> 1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
> Cannot open filename
> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
> java.io.IOException: Cannot open filename
> /hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
>         at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)
> 
> Preceded, in concerned regionsserver log, by the line:
> 
> 2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
> obtain block blk_-3759213227484579481_226277 from any node: 
> java.io.IOException: No live nodes contain current block
> 
> If I look for this block in the hadoop master log I can find
> 
> 2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask
> 192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
> (many more blocks)
> 
> about 16 min before.
> In both cases the regionserver fails to serve the concerned region until I
> restart hbase (not hadoop).
> 
> I have no clue to know if such a failure is temporary (how long) or I
> really
> need to restart. But I noticed that the failure doesn't recover in the
> next
> 3-4 hours.
> 
> One last question by the way:
> Why the replication factor of my hbase files in dfs is 3, when my hadoop
> cluster is configured to keep only 2 copies ?
> Is it because the default (hadoop-default.xml) config file of the hadoop
> client, which is embedded in hbase distrib overrides the cluster
> configuration for the mapfiles created ? 
> Is that a good configuration scheme, or is it preferable to allow the
> hbase
> hadoop client to load the hadoop-site.xml file I have set for the running
> instance of hadoop server, adding the hadoop conf directory in the hbase
> classpath,
> and therefore having the same configuration in client than in server ?
> 
> Have a nice day.
> Thank you for your advises.
> 
> -- Jean-Adrien
> 
> Cluster setup:
> 4 regionsservers / datanodes
> 1 is master / namenode as well.
> java-6-sun
> Total size of hdfs: 81.98 GB (replication factor 3)
> fsck -> healthy
> hadoop: 0.18.1
> hbase: 0.18.0 (jar of hadoop replaced with 0.18.1)
> 1Gb ram per node
> 
> 
> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20028553
> .html
> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20065884.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: Regionserver fails to serve region

Posted by Jonathan Gray <jl...@streamy.com>.

Jean-Adrien,

Did you see my reply to your previous email?

I think your machines are underpowered for your current setup and it's
creating all kinds of problems.  If you have swapping going on in a
regionserver/datanode, that must be addressed because it usually leads to
odd behavior in hdfs, timeouts, starvation, etc...

Decrease your allotted heap sizes to fit within available memory, or add
more memory.

JG

-----Original Message-----
From: Jean-Adrien [mailto:adv1@jeanjean.ch] 
Sent: Friday, October 17, 2008 1:02 AM
To: hbase-user@hadoop.apache.org
Subject: Regionserver fails to serve region


Hello again.
This is my last message for today

I have often an exception in my HBase client. A regionserver fails to serve
a region when the client get a row on the HBase cluster.

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server 192.168.1.15:60020 for region
table-0.3,:testrow79063200,1223872616091, row ':testrow22102600', but failed
after 10 attempts.

The attempts of above can be:
1.
java.io.IOException: java.io.IOException: Premeture EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
2-10
java.io.IOException: java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)

After what. Every time the client try to reach the same region the attemps
1-10 are
java.io.IOException: java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hbase.HStoreKey.compareTo(HStoreKey.java:354)

In this case, if the client try to reach the same region again, all next 10
attemps are the NPE.

Another 10 attempts scenario I have seen:
1-10:
IPC Server handler 3 on 60020, call getRow([B@1ec7483, [B@d54a92, null,
1224105427910, -1) from 192.168.1.11:55371: error: java.io.IOException:
Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
java.io.IOException: Cannot open filename
/hbase/table-0.3/1739432898/header/mapfiles/4558585535524295446/data
        at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1171)

Preceded, in concerned regionsserver log, by the line:

2008-10-15 23:19:30,461 INFO org.apache.hadoop.dfs.DFSClient: Could not
obtain block blk_-3759213227484579481_226277 from any node: 
java.io.IOException: No live nodes contain current block

If I look for this block in the hadoop master log I can find

2008-10-15 23:03:45,276 INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask
192.168.1.13:50010 to delete  [...] blk_-3759213227484579481_226277 [...]
(many more blocks)

about 16 min before.
In both cases the regionserver fails to serve the concerned region until I
restart hbase (not hadoop).

I have no clue to know if such a failure is temporary (how long) or I really
need to restart. But I noticed that the failure doesn't recover in the next
3-4 hours.

One last question by the way:
Why the replication factor of my hbase files in dfs is 3, when my hadoop
cluster is configured to keep only 2 copies ?
Is it because the default (hadoop-default.xml) config file of the hadoop
client, which is embedded in hbase distrib overrides the cluster
configuration for the mapfiles created ? 
Is that a good configuration scheme, or is it preferable to allow the hbase
hadoop client to load the hadoop-site.xml file I have set for the running
instance of hadoop server, adding the hadoop conf directory in the hbase
classpath,
and therefore having the same configuration in client than in server ?

Have a nice day.
Thank you for your advises.

-- Jean-Adrien

Cluster setup:
4 regionsservers / datanodes
1 is master / namenode as well.
java-6-sun
Total size of hdfs: 81.98 GB (replication factor 3)
fsck -> healthy
hadoop: 0.18.1
hbase: 0.18.0 (jar of hadoop replaced with 0.18.1)
1Gb ram per node




-- 
View this message in context:
http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20028553
.html
Sent from the HBase User mailing list archive at Nabble.com.