You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michael Segel <mi...@hotmail.com> on 2010/03/03 17:43:47 UTC

Trying to understand HBase/ZooKeeper Logs

Hi,

I'm trying to debug an issue where I am getting 'partial' failures. For some reason the region servers seem to end up with multiple 'live' servers on a node. (We start with 3 servers and the next morning we see 4,5 or 6 servers where a server has multiple servers 'live'. ) Yet if you do a list or a scan on a table, an exception gets thrown. (The next time we have a failure I'll include the exception....)

I've set all of the logging to Debug so I should be picking up as much information.

The master log shows the following:
2010-03-02 20:05:40,712 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2010-03-02 20:05:45,000 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 3ms
2010-03-02 20:06:05,032 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 35ms
2010-03-02 20:06:24,998 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 0ms
2010-03-02 20:06:39,563 INFO org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, average load 1.6666666666666667
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.8.237.230:60020, regionname: -ROOT-,,0, startKey: <>}
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.8.237.232:60020, regionname: .META.,,1, startKey: <>}

(Hopefully this formats ok...)

I'm trying to understand what I'm seeing.
Am I correct when I say that this is where the master node is pinging the lead zookeeper as a way to maintain a heartbeat to see if zookeeper is alive?

On the region servers I see every node with roughly the following:
2010-03-03 09:31:52,086 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=2.9179459MB (3059688), Free=352.64456MB (369774616), Max=355.5625MB (372834304), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-03-03 09:31:52,222 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:12,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:32,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms


Going through the logs, I see 0-1ms response time from the region servers to zookeeper.

I'm trying to track down why I'm having partial failures.
That is, on a region server, I see multiple live servers, where one is actually alive.
(This problem is intermittent and I haven't seen a failure [yet] since I turned on the debugging.)

Is it normal to see pings as long as 50ms when a master pings zookeeper?

Thx

-Mike






 		 	   		  
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/201469229/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Patrick Hunt <ph...@apache.org>.

Also check the ZK server logs and see if you notice any session 
expirations (esp during this timeframe). "grep -i expir <zk server logs>"

Patrick

Jean-Daniel Cryans wrote:
> Michael,
> 
> Grep your master log for "Received report from unknown server" and if
> you do find it, it means that you have DNS flapping. This may explain
> why you see a "new instance" which in this case would be the master
> registering the region server a second or third time. This patch in
> this jira fixes this issue
> https://issues.apache.org/jira/browse/HBASE-2174
> 
> J-D
> 
> On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <mi...@hotmail.com> wrote:
>>
>>
>>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>>> From: phunt@apache.org
>>> To: hbase-user@hadoop.apache.org
>>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> [SNIP]
>>> There are a few issues involved with the ping time:
>>>
>>> 1) the network (obv :-) )
>>> 2) the zk server - if the server is highly loaded the pings may take
>>> longer. The heartbeat is also a "health check" that the client is doing
>>> against the server (as much as it is a "health check" for the server
>>> that the client is still live). The HB is routed "all the way" through
>>> the ZK server, ie through the processing pipeline. So if the server were
>>> stalled it would not respond immediately (vs say reading the HB at the
>>> thread that reads data from the client). You can see the min/max/avg
>>> request latencies on the zk server by using the "stat" 4letter word. See
>>> the ZK admin docs on this http://bit.ly/dglVld
>>> 3) the zk client - clients can only process HB responses if they are
>>> running. Say the JVM GC runs in blocking mode, this will block all
>>> client threads (incl the zk client thread) and the HB response will sit
>>> until the GC is finished. This is why HBase RSs typically use very very
>>> large (from our, zk, perspective) session timeouts.
>>>
>>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>>
>>> I can't shed directly light on this (ie what's the problem in hbase that
>>> could cause your issue). I'll let jd/stack comment on that.
>>>
>>> Patrick
>>>
>> Thanks for the quick response.
>>
>> I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>>
>> What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live.
>>
>> From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance.
>>
>> Is this a good description?
>>
>> Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say.
>>
>> Which process is the most sensitive to network latency issues?
>>
>> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure.
>>
>> Thx
>>
>> -Mike
>>
>>
>>
>>
>>
>> _________________________________________________________________
>> Hotmail: Powerful Free email with security by Microsoft.
>> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Michael,

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue
https://issues.apache.org/jira/browse/HBASE-2174

J-D

On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <mi...@hotmail.com> wrote:
>
>
>
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: phunt@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>>
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>>
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>>
>> Patrick
>>
>
> Thanks for the quick response.
>
> I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>
> What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live.
>
> From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance.
>
> Is this a good description?
>
> Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say.
>
> Which process is the most sensitive to network latency issues?
>
> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure.
>
> Thx
>
> -Mike
>
>
>
>
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Chad Metcalf <ch...@cloudera.com>.

The src rpms we package are always available. In this case its in the same
repo.

http://archive.cloudera.com/redhat/cdh/contrib/hbase-0.20-0.20.3-1.cloudera.src.rpm

<http://archive.cloudera.com/redhat/cdh/contrib/hbase-0.20-0.20.3-1.cloudera.src.rpm>
Cheers
Chad

On Wed, Mar 3, 2010 at 3:19 PM, Andrew Purtell <ap...@apache.org> wrote:

> I built the HBase RPMs for Cloudera. Just for future reference if someone
> needs patched
> versions of those RPMs, it's easy enough for me to spin them for you. Just
> drop me a
> note.
>
> And/or you may want to send a note to Cloudera explaining your needs.
>
> I put together a version of Cloudera-ized HBase with HBASE-2174 and
> HBASE-2180 applied
> and put it here:
>
>    https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
>
> Seems like a generally useful thing. To use:
>
>   % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
>   % cd /usr/src/redhat/SPECS
>   % rpmbuild -bb --target noarch hbase.spec
>
> and install or update using the resulting HBase binary RPMs in
> /usr/src/redhat/RPMS/noarch.
>
>   - Andy
>
>
>
> ----- Original Message ----
> > From: Michael Segel <mi...@hotmail.com>
> > To: hbase-user@hadoop.apache.org
> > Sent: Thu, March 4, 2010 3:08:49 AM
> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >
> >
> > Thanks.
> >
> > Cloudera's release doesn't ship with the source code, but luckily we have
> the
> > source when we wanted to test 20.3 code.
> >
> > Thanks again!
>
>
>
>
>