You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Michael Segel <mi...@hotmail.com> on 2010/03/03 17:43:47 UTC

Trying to understand HBase/ZooKeeper Logs

Hi,

I'm trying to debug an issue where I am getting 'partial' failures. For some reason the region servers seem to end up with multiple 'live' servers on a node. (We start with 3 servers and the next morning we see 4,5 or 6 servers where a server has multiple servers 'live'. ) Yet if you do a list or a scan on a table, an exception gets thrown. (The next time we have a failure I'll include the exception....)

I've set all of the logging to Debug so I should be picking up as much information.

The master log shows the following:
2010-03-02 20:05:40,712 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2010-03-02 20:05:45,000 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 3ms
2010-03-02 20:06:05,032 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 35ms
2010-03-02 20:06:24,998 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 0ms
2010-03-02 20:06:39,563 INFO org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, average load 1.6666666666666667
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.8.237.230:60020, regionname: -ROOT-,,0, startKey: <>}
2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.8.237.232:60020, regionname: .META.,,1, startKey: <>}

(Hopefully this formats ok...)

I'm trying to understand what I'm seeing.
Am I correct when I say that this is where the master node is pinging the lead zookeeper as a way to maintain a heartbeat to see if zookeeper is alive?

On the region servers I see every node with roughly the following:
2010-03-03 09:31:52,086 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=2.9179459MB (3059688), Free=352.64456MB (369774616), Max=355.5625MB (372834304), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
2010-03-03 09:31:52,222 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:12,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
2010-03-03 09:32:32,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms


Going through the logs, I see 0-1ms response time from the region servers to zookeeper.

I'm trying to track down why I'm having partial failures.
That is, on a region server, I see multiple live servers, where one is actually alive.
(This problem is intermittent and I haven't seen a failure [yet] since I turned on the debugging.)

Is it normal to see pings as long as 50ms when a master pings zookeeper?

Thx

-Mike






 		 	   		  
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/201469229/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Patrick Hunt <ph...@apache.org>.
Also check the ZK server logs and see if you notice any session 
expirations (esp during this timeframe). "grep -i expir <zk server logs>"

Patrick

Jean-Daniel Cryans wrote:
> Michael,
> 
> Grep your master log for "Received report from unknown server" and if
> you do find it, it means that you have DNS flapping. This may explain
> why you see a "new instance" which in this case would be the master
> registering the region server a second or third time. This patch in
> this jira fixes this issue
> https://issues.apache.org/jira/browse/HBASE-2174
> 
> J-D
> 
> On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <mi...@hotmail.com> wrote:
>>
>>
>>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>>> From: phunt@apache.org
>>> To: hbase-user@hadoop.apache.org
>>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> [SNIP]
>>> There are a few issues involved with the ping time:
>>>
>>> 1) the network (obv :-) )
>>> 2) the zk server - if the server is highly loaded the pings may take
>>> longer. The heartbeat is also a "health check" that the client is doing
>>> against the server (as much as it is a "health check" for the server
>>> that the client is still live). The HB is routed "all the way" through
>>> the ZK server, ie through the processing pipeline. So if the server were
>>> stalled it would not respond immediately (vs say reading the HB at the
>>> thread that reads data from the client). You can see the min/max/avg
>>> request latencies on the zk server by using the "stat" 4letter word. See
>>> the ZK admin docs on this http://bit.ly/dglVld
>>> 3) the zk client - clients can only process HB responses if they are
>>> running. Say the JVM GC runs in blocking mode, this will block all
>>> client threads (incl the zk client thread) and the HB response will sit
>>> until the GC is finished. This is why HBase RSs typically use very very
>>> large (from our, zk, perspective) session timeouts.
>>>
>>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>>
>>> I can't shed directly light on this (ie what's the problem in hbase that
>>> could cause your issue). I'll let jd/stack comment on that.
>>>
>>> Patrick
>>>
>> Thanks for the quick response.
>>
>> I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>>
>> What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live.
>>
>> From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance.
>>
>> Is this a good description?
>>
>> Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say.
>>
>> Which process is the most sensitive to network latency issues?
>>
>> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure.
>>
>> Thx
>>
>> -Mike
>>
>>
>>
>>
>>
>> _________________________________________________________________
>> Hotmail: Powerful Free email with security by Microsoft.
>> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Michael,

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue
https://issues.apache.org/jira/browse/HBASE-2174

J-D

On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <mi...@hotmail.com> wrote:
>
>
>
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: phunt@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>>
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>>
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>>
>> Patrick
>>
>
> Thanks for the quick response.
>
> I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-(
>
> What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live.
>
> From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance.
>
> Is this a good description?
>
> Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say.
>
> Which process is the most sensitive to network latency issues?
>
> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure.
>
> Thx
>
> -Mike
>
>
>
>
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Chad Metcalf <ch...@cloudera.com>.
The src rpms we package are always available. In this case its in the same
repo.

http://archive.cloudera.com/redhat/cdh/contrib/hbase-0.20-0.20.3-1.cloudera.src.rpm

<http://archive.cloudera.com/redhat/cdh/contrib/hbase-0.20-0.20.3-1.cloudera.src.rpm>
Cheers
Chad

On Wed, Mar 3, 2010 at 3:19 PM, Andrew Purtell <ap...@apache.org> wrote:

> I built the HBase RPMs for Cloudera. Just for future reference if someone
> needs patched
> versions of those RPMs, it's easy enough for me to spin them for you. Just
> drop me a
> note.
>
> And/or you may want to send a note to Cloudera explaining your needs.
>
> I put together a version of Cloudera-ized HBase with HBASE-2174 and
> HBASE-2180 applied
> and put it here:
>
>    https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
>
> Seems like a generally useful thing. To use:
>
>   % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
>   % cd /usr/src/redhat/SPECS
>   % rpmbuild -bb --target noarch hbase.spec
>
> and install or update using the resulting HBase binary RPMs in
> /usr/src/redhat/RPMS/noarch.
>
>   - Andy
>
>
>
> ----- Original Message ----
> > From: Michael Segel <mi...@hotmail.com>
> > To: hbase-user@hadoop.apache.org
> > Sent: Thu, March 4, 2010 3:08:49 AM
> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >
> >
> > Thanks.
> >
> > Cloudera's release doesn't ship with the source code, but luckily we have
> the
> > source when we wanted to test 20.3 code.
> >
> > Thanks again!
>
>
>
>
>

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Stack <st...@duboce.net>.
Want to paste the complaint?  Maybe someone like Chad has seen it before?
Thanks Michael
St.Ack

On Fri, Mar 5, 2010 at 7:34 AM, Michael Segel <mi...@hotmail.com> wrote:
>
> Yes,
>
> I could find the lzo-devel brought it down, found it but it complained about a dependency
> which I thought was weird.
>
> Unfortunately I'm running out of time on this... got other fun stuff to do.
>
>
>
>> Date: Thu, 4 Mar 2010 20:17:31 -0800
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> From: stack@duboce.net
>> To: hbase-user@hadoop.apache.org
>>
>> If you look for lzo-devel up in the usual repos, can you not find it?
>> St.Ack
>>
>> On Thu, Mar 4, 2010 at 8:04 PM, Michael Segel <mi...@hotmail.com> wrote:
>> >
>> > Ok,
>> >
>> > Ran in to a little problem....
>> > # rpmbuild -bb --target noarch hbase.spec
>> > Building target platforms: noarch
>> > Building for target noarch
>> > error: Failed build dependencies:
>> >        lzo-devel is needed by hbase-0.20-0.20.3-2.cloudera.noarch
>> >        /usr/bin/git is needed by hbase-0.20-0.20.3-2.cloudera.noarch
>> >
>> > So I did download and set up git, but the lzo-devel is a bit problematic.
>> >
>> > Is it possible to get the RPM from you, or if you know of a good work around, I'd appreciate it.
>> >
>> > Thx
>> >
>> > -Mike
>> >
>> >
>> >> Date: Wed, 3 Mar 2010 15:19:58 -0800
>> >> From: apurtell@apache.org
>> >> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> >> To: hbase-user@hadoop.apache.org
>> >> CC: chad@cloudera.com
>> >>
>> >> I built the HBase RPMs for Cloudera. Just for future reference if someone needs patched
>> >> versions of those RPMs, it's easy enough for me to spin them for you. Just drop me a
>> >> note.
>> >>
>> >> And/or you may want to send a note to Cloudera explaining your needs.
>> >>
>> >> I put together a version of Cloudera-ized HBase with HBASE-2174 and HBASE-2180 applied
>> >> and put it here:
>> >>
>> >>     https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
>> >>
>> >> Seems like a generally useful thing. To use:
>> >>
>> >>    % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
>> >>    % cd /usr/src/redhat/SPECS
>> >>    % rpmbuild -bb --target noarch hbase.spec
>> >>
>> >> and install or update using the resulting HBase binary RPMs in /usr/src/redhat/RPMS/noarch.
>> >>
>> >>    - Andy
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: Michael Segel <mi...@hotmail.com>
>> >> > To: hbase-user@hadoop.apache.org
>> >> > Sent: Thu, March 4, 2010 3:08:49 AM
>> >> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
>> >> >
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Cloudera's release doesn't ship with the source code, but luckily we have the
>> >> > source when we wanted to test 20.3 code.
>> >> >
>> >> > Thanks again!
>> >>
>> >>
>> >>
>> >>
>> >
>> > _________________________________________________________________
>> > Hotmail: Free, trusted and rich email service.
>> > http://clk.atdmt.com/GBL/go/201469228/direct/01/
>
> _________________________________________________________________
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> http://clk.atdmt.com/GBL/go/201469229/direct/01/

RE: Trying to understand HBase/ZooKeeper Logs

Posted by Michael Segel <mi...@hotmail.com>.
Yes,

I could find the lzo-devel brought it down, found it but it complained about a dependency
which I thought was weird.

Unfortunately I'm running out of time on this... got other fun stuff to do.



> Date: Thu, 4 Mar 2010 20:17:31 -0800
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> From: stack@duboce.net
> To: hbase-user@hadoop.apache.org
> 
> If you look for lzo-devel up in the usual repos, can you not find it?
> St.Ack
> 
> On Thu, Mar 4, 2010 at 8:04 PM, Michael Segel <mi...@hotmail.com> wrote:
> >
> > Ok,
> >
> > Ran in to a little problem....
> > # rpmbuild -bb --target noarch hbase.spec
> > Building target platforms: noarch
> > Building for target noarch
> > error: Failed build dependencies:
> >        lzo-devel is needed by hbase-0.20-0.20.3-2.cloudera.noarch
> >        /usr/bin/git is needed by hbase-0.20-0.20.3-2.cloudera.noarch
> >
> > So I did download and set up git, but the lzo-devel is a bit problematic.
> >
> > Is it possible to get the RPM from you, or if you know of a good work around, I'd appreciate it.
> >
> > Thx
> >
> > -Mike
> >
> >
> >> Date: Wed, 3 Mar 2010 15:19:58 -0800
> >> From: apurtell@apache.org
> >> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> >> To: hbase-user@hadoop.apache.org
> >> CC: chad@cloudera.com
> >>
> >> I built the HBase RPMs for Cloudera. Just for future reference if someone needs patched
> >> versions of those RPMs, it's easy enough for me to spin them for you. Just drop me a
> >> note.
> >>
> >> And/or you may want to send a note to Cloudera explaining your needs.
> >>
> >> I put together a version of Cloudera-ized HBase with HBASE-2174 and HBASE-2180 applied
> >> and put it here:
> >>
> >>     https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
> >>
> >> Seems like a generally useful thing. To use:
> >>
> >>    % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
> >>    % cd /usr/src/redhat/SPECS
> >>    % rpmbuild -bb --target noarch hbase.spec
> >>
> >> and install or update using the resulting HBase binary RPMs in /usr/src/redhat/RPMS/noarch.
> >>
> >>    - Andy
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Michael Segel <mi...@hotmail.com>
> >> > To: hbase-user@hadoop.apache.org
> >> > Sent: Thu, March 4, 2010 3:08:49 AM
> >> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >> >
> >> >
> >> > Thanks.
> >> >
> >> > Cloudera's release doesn't ship with the source code, but luckily we have the
> >> > source when we wanted to test 20.3 code.
> >> >
> >> > Thanks again!
> >>
> >>
> >>
> >>
> >
> > _________________________________________________________________
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/201469228/direct/01/
 		 	   		  
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/201469229/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Stack <st...@duboce.net>.
If you look for lzo-devel up in the usual repos, can you not find it?
St.Ack

On Thu, Mar 4, 2010 at 8:04 PM, Michael Segel <mi...@hotmail.com> wrote:
>
> Ok,
>
> Ran in to a little problem....
> # rpmbuild -bb --target noarch hbase.spec
> Building target platforms: noarch
> Building for target noarch
> error: Failed build dependencies:
>        lzo-devel is needed by hbase-0.20-0.20.3-2.cloudera.noarch
>        /usr/bin/git is needed by hbase-0.20-0.20.3-2.cloudera.noarch
>
> So I did download and set up git, but the lzo-devel is a bit problematic.
>
> Is it possible to get the RPM from you, or if you know of a good work around, I'd appreciate it.
>
> Thx
>
> -Mike
>
>
>> Date: Wed, 3 Mar 2010 15:19:58 -0800
>> From: apurtell@apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> To: hbase-user@hadoop.apache.org
>> CC: chad@cloudera.com
>>
>> I built the HBase RPMs for Cloudera. Just for future reference if someone needs patched
>> versions of those RPMs, it's easy enough for me to spin them for you. Just drop me a
>> note.
>>
>> And/or you may want to send a note to Cloudera explaining your needs.
>>
>> I put together a version of Cloudera-ized HBase with HBASE-2174 and HBASE-2180 applied
>> and put it here:
>>
>>     https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
>>
>> Seems like a generally useful thing. To use:
>>
>>    % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
>>    % cd /usr/src/redhat/SPECS
>>    % rpmbuild -bb --target noarch hbase.spec
>>
>> and install or update using the resulting HBase binary RPMs in /usr/src/redhat/RPMS/noarch.
>>
>>    - Andy
>>
>>
>>
>> ----- Original Message ----
>> > From: Michael Segel <mi...@hotmail.com>
>> > To: hbase-user@hadoop.apache.org
>> > Sent: Thu, March 4, 2010 3:08:49 AM
>> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
>> >
>> >
>> > Thanks.
>> >
>> > Cloudera's release doesn't ship with the source code, but luckily we have the
>> > source when we wanted to test 20.3 code.
>> >
>> > Thanks again!
>>
>>
>>
>>
>
> _________________________________________________________________
> Hotmail: Free, trusted and rich email service.
> http://clk.atdmt.com/GBL/go/201469228/direct/01/

RE: Trying to understand HBase/ZooKeeper Logs

Posted by Michael Segel <mi...@hotmail.com>.
Ok,

Ran in to a little problem....
# rpmbuild -bb --target noarch hbase.spec
Building target platforms: noarch
Building for target noarch
error: Failed build dependencies:
        lzo-devel is needed by hbase-0.20-0.20.3-2.cloudera.noarch
        /usr/bin/git is needed by hbase-0.20-0.20.3-2.cloudera.noarch

So I did download and set up git, but the lzo-devel is a bit problematic.

Is it possible to get the RPM from you, or if you know of a good work around, I'd appreciate it.

Thx

-Mike


> Date: Wed, 3 Mar 2010 15:19:58 -0800
> From: apurtell@apache.org
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> To: hbase-user@hadoop.apache.org
> CC: chad@cloudera.com
> 
> I built the HBase RPMs for Cloudera. Just for future reference if someone needs patched
> versions of those RPMs, it's easy enough for me to spin them for you. Just drop me a
> note.
> 
> And/or you may want to send a note to Cloudera explaining your needs.
> 
> I put together a version of Cloudera-ized HBase with HBASE-2174 and HBASE-2180 applied
> and put it here: 
> 
>     https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm
> 
> Seems like a generally useful thing. To use:
> 
>    % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
>    % cd /usr/src/redhat/SPECS
>    % rpmbuild -bb --target noarch hbase.spec
> 
> and install or update using the resulting HBase binary RPMs in /usr/src/redhat/RPMS/noarch.
> 
>    - Andy
> 
> 
> 
> ----- Original Message ----
> > From: Michael Segel <mi...@hotmail.com>
> > To: hbase-user@hadoop.apache.org
> > Sent: Thu, March 4, 2010 3:08:49 AM
> > Subject: RE: Trying to understand HBase/ZooKeeper Logs
> > 
> > 
> > Thanks.
> > 
> > Cloudera's release doesn't ship with the source code, but luckily we have the 
> > source when we wanted to test 20.3 code.
> > 
> > Thanks again!
> 
> 
>       
> 
 		 	   		  
_________________________________________________________________
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/201469228/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Andrew Purtell <ap...@apache.org>.
I built the HBase RPMs for Cloudera. Just for future reference if someone needs patched
versions of those RPMs, it's easy enough for me to spin them for you. Just drop me a
note.

And/or you may want to send a note to Cloudera explaining your needs.

I put together a version of Cloudera-ized HBase with HBASE-2174 and HBASE-2180 applied
and put it here: 

    https://hbase.s3.amazonaws.com/cdh/hbase-0.20-0.20.3-2.cloudera.src.rpm

Seems like a generally useful thing. To use:

   % rpm -Uvh hbase-0.20-0.20.3-2.cloudera.src.rpm
   % cd /usr/src/redhat/SPECS
   % rpmbuild -bb --target noarch hbase.spec

and install or update using the resulting HBase binary RPMs in /usr/src/redhat/RPMS/noarch.

   - Andy



----- Original Message ----
> From: Michael Segel <mi...@hotmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Thu, March 4, 2010 3:08:49 AM
> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> 
> 
> Thanks.
> 
> Cloudera's release doesn't ship with the source code, but luckily we have the 
> source when we wanted to test 20.3 code.
> 
> Thanks again!


      


RE: Trying to understand HBase/ZooKeeper Logs

Posted by Michael Segel <mi...@hotmail.com>.
Thanks.

Cloudera's release doesn't ship with the source code, but luckily we have the source when we wanted to test 20.3 code.

Thanks again!


> Date: Wed, 3 Mar 2010 10:54:26 -0800
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> From: jdcryans@apache.org
> To: hbase-user@hadoop.apache.org
> 
> So get the patch in your hbase root, on linux do: wget
> https://issues.apache.org/jira/secure/attachment/12436659/HBASE-2174_0.20.3.patch
> 
> then run: patch -p0 < HBASE-2174_0.20.3.patch
> 
> finally compile: ant tar
> 
> The new tar will be in build/
> 
> J-D
> 
> On Wed, Mar 3, 2010 at 10:52 AM, Michael Segel
> <mi...@hotmail.com> wrote:
> >
> > Hey!
> >
> > Thanks for the responses.
> > It looks like the patch I was pointed to may solve the issue.
> >
> > We've had some network latency issues. Again the 50ms was something I found quickly in the logs and if I had a failure after turning on all of the debugging, I think I could have drilled down to the issue.
> >
> > I don't manage the DNS setup, so I can't say what's 'strange' or different.
> > The only think that I know that we did was set up a CNAME alias to the Namenode and JobTracker to make it easier to 'hide' the cloud and then give developers an easy to remember name for them to point to. I don't think that should cause it, although it could be something in how they set up their reverse DNS. If the patch works, I'll be happy.
> >
> > Now for the $64,000 question.
> > Any pointers on how to apply the patch?
> > I'm just used to pulling the distro from the website...
> >
> > Thanks again!
> >
> > -Mike
> >
> >
> >> From: jlist@streamy.com
> >> To: hbase-user@hadoop.apache.org
> >> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >> Date: Wed, 3 Mar 2010 10:21:10 -0800
> >>
> >> What version of HBase are you running?  There were some recent fixes related
> >> to DNS issues causing regionservers to check-in to the master as a different
> >> name.  Anything strange about the network or DNS setup of your cluster?
> >>
> >> ZooKeeper is sensitive to causes and network latency, as would any
> >> fault-tolerant distributed system.  ZK and HBase must determine when
> >> something has "failed", and the primary way is that it has not responded
> >> within some period of time.  50ms is negligible from a fault-detection
> >> standpoint, but 50 seconds is not.
> >>
> >> -----Original Message-----
> >> From: Michael Segel [mailto:michael_segel@hotmail.com]
> >> Sent: Wednesday, March 03, 2010 9:29 AM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> >>
> >>
> >>
> >>
> >> > Date: Wed, 3 Mar 2010 09:17:06 -0800
> >> > From: phunt@apache.org
> >> > To: hbase-user@hadoop.apache.org
> >> > Subject: Re: Trying to understand HBase/ZooKeeper Logs
> >> [SNIP]
> >> > There are a few issues involved with the ping time:
> >> >
> >> > 1) the network (obv :-) )
> >> > 2) the zk server - if the server is highly loaded the pings may take
> >> > longer. The heartbeat is also a "health check" that the client is doing
> >> > against the server (as much as it is a "health check" for the server
> >> > that the client is still live). The HB is routed "all the way" through
> >> > the ZK server, ie through the processing pipeline. So if the server were
> >> > stalled it would not respond immediately (vs say reading the HB at the
> >> > thread that reads data from the client). You can see the min/max/avg
> >> > request latencies on the zk server by using the "stat" 4letter word. See
> >> > the ZK admin docs on this http://bit.ly/dglVld
> >> > 3) the zk client - clients can only process HB responses if they are
> >> > running. Say the JVM GC runs in blocking mode, this will block all
> >> > client threads (incl the zk client thread) and the HB response will sit
> >> > until the GC is finished. This is why HBase RSs typically use very very
> >> > large (from our, zk, perspective) session timeouts.
> >> >
> >> > 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> >> >
> >> > I can't shed directly light on this (ie what's the problem in hbase that
> >> > could cause your issue). I'll let jd/stack comment on that.
> >> >
> >> > Patrick
> >> >
> >>
> >> Thanks for the quick response.
> >>
> >> I'm trying to track down the issue of why we're getting a lot of 'partial'
> >> failures. Unfortunately this is currently a lot like watching a pot boil.
> >> :-(
> >>
> >> What I am calling a 'partial failure' is that the region servers are
> >> spawning second or even third instances where only the last one appears to
> >> be live.
> >>
> >> >From what I can tell is that there's a spike of network activity that
> >> causes one of the processes to think that there is something wrong and spawn
> >> a new instance.
> >>
> >> Is this a good description?
> >>
> >> Because some of the failures occur late at night with no load on the system,
> >> I suspect that we have issues with the network but I can't definitively say.
> >>
> >> Which process is the most sensitive to network latency issues?
> >>
> >> Sorry, still relatively new to HBase and I'm trying to track down a nasty
> >> issue that cause Hbase to fail on an almost regular basis. I think its a
> >> networking issue, but I can't be sure.
> >>
> >> Thx
> >>
> >> -Mike
> >>
> >>
> >>
> >>
> >>
> >> _________________________________________________________________
> >> Hotmail: Powerful Free email with security by Microsoft.
> >> http://clk.atdmt.com/GBL/go/201469230/direct/01/
> >>
> >
> > _________________________________________________________________
> > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> > http://clk.atdmt.com/GBL/go/201469229/direct/01/
 		 	   		  
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Jean-Daniel Cryans <jd...@apache.org>.
So get the patch in your hbase root, on linux do: wget
https://issues.apache.org/jira/secure/attachment/12436659/HBASE-2174_0.20.3.patch

then run: patch -p0 < HBASE-2174_0.20.3.patch

finally compile: ant tar

The new tar will be in build/

J-D

On Wed, Mar 3, 2010 at 10:52 AM, Michael Segel
<mi...@hotmail.com> wrote:
>
> Hey!
>
> Thanks for the responses.
> It looks like the patch I was pointed to may solve the issue.
>
> We've had some network latency issues. Again the 50ms was something I found quickly in the logs and if I had a failure after turning on all of the debugging, I think I could have drilled down to the issue.
>
> I don't manage the DNS setup, so I can't say what's 'strange' or different.
> The only think that I know that we did was set up a CNAME alias to the Namenode and JobTracker to make it easier to 'hide' the cloud and then give developers an easy to remember name for them to point to. I don't think that should cause it, although it could be something in how they set up their reverse DNS. If the patch works, I'll be happy.
>
> Now for the $64,000 question.
> Any pointers on how to apply the patch?
> I'm just used to pulling the distro from the website...
>
> Thanks again!
>
> -Mike
>
>
>> From: jlist@streamy.com
>> To: hbase-user@hadoop.apache.org
>> Subject: RE: Trying to understand HBase/ZooKeeper Logs
>> Date: Wed, 3 Mar 2010 10:21:10 -0800
>>
>> What version of HBase are you running?  There were some recent fixes related
>> to DNS issues causing regionservers to check-in to the master as a different
>> name.  Anything strange about the network or DNS setup of your cluster?
>>
>> ZooKeeper is sensitive to causes and network latency, as would any
>> fault-tolerant distributed system.  ZK and HBase must determine when
>> something has "failed", and the primary way is that it has not responded
>> within some period of time.  50ms is negligible from a fault-detection
>> standpoint, but 50 seconds is not.
>>
>> -----Original Message-----
>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>> Sent: Wednesday, March 03, 2010 9:29 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: RE: Trying to understand HBase/ZooKeeper Logs
>>
>>
>>
>>
>> > Date: Wed, 3 Mar 2010 09:17:06 -0800
>> > From: phunt@apache.org
>> > To: hbase-user@hadoop.apache.org
>> > Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> [SNIP]
>> > There are a few issues involved with the ping time:
>> >
>> > 1) the network (obv :-) )
>> > 2) the zk server - if the server is highly loaded the pings may take
>> > longer. The heartbeat is also a "health check" that the client is doing
>> > against the server (as much as it is a "health check" for the server
>> > that the client is still live). The HB is routed "all the way" through
>> > the ZK server, ie through the processing pipeline. So if the server were
>> > stalled it would not respond immediately (vs say reading the HB at the
>> > thread that reads data from the client). You can see the min/max/avg
>> > request latencies on the zk server by using the "stat" 4letter word. See
>> > the ZK admin docs on this http://bit.ly/dglVld
>> > 3) the zk client - clients can only process HB responses if they are
>> > running. Say the JVM GC runs in blocking mode, this will block all
>> > client threads (incl the zk client thread) and the HB response will sit
>> > until the GC is finished. This is why HBase RSs typically use very very
>> > large (from our, zk, perspective) session timeouts.
>> >
>> > 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>> >
>> > I can't shed directly light on this (ie what's the problem in hbase that
>> > could cause your issue). I'll let jd/stack comment on that.
>> >
>> > Patrick
>> >
>>
>> Thanks for the quick response.
>>
>> I'm trying to track down the issue of why we're getting a lot of 'partial'
>> failures. Unfortunately this is currently a lot like watching a pot boil.
>> :-(
>>
>> What I am calling a 'partial failure' is that the region servers are
>> spawning second or even third instances where only the last one appears to
>> be live.
>>
>> >From what I can tell is that there's a spike of network activity that
>> causes one of the processes to think that there is something wrong and spawn
>> a new instance.
>>
>> Is this a good description?
>>
>> Because some of the failures occur late at night with no load on the system,
>> I suspect that we have issues with the network but I can't definitively say.
>>
>> Which process is the most sensitive to network latency issues?
>>
>> Sorry, still relatively new to HBase and I'm trying to track down a nasty
>> issue that cause Hbase to fail on an almost regular basis. I think its a
>> networking issue, but I can't be sure.
>>
>> Thx
>>
>> -Mike
>>
>>
>>
>>
>>
>> _________________________________________________________________
>> Hotmail: Powerful Free email with security by Microsoft.
>> http://clk.atdmt.com/GBL/go/201469230/direct/01/
>>
>
> _________________________________________________________________
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> http://clk.atdmt.com/GBL/go/201469229/direct/01/

RE: Trying to understand HBase/ZooKeeper Logs

Posted by Michael Segel <mi...@hotmail.com>.
Hey!

Thanks for the responses.
It looks like the patch I was pointed to may solve the issue.

We've had some network latency issues. Again the 50ms was something I found quickly in the logs and if I had a failure after turning on all of the debugging, I think I could have drilled down to the issue.

I don't manage the DNS setup, so I can't say what's 'strange' or different. 
The only think that I know that we did was set up a CNAME alias to the Namenode and JobTracker to make it easier to 'hide' the cloud and then give developers an easy to remember name for them to point to. I don't think that should cause it, although it could be something in how they set up their reverse DNS. If the patch works, I'll be happy.

Now for the $64,000 question.
Any pointers on how to apply the patch? 
I'm just used to pulling the distro from the website...

Thanks again!

-Mike


> From: jlist@streamy.com
> To: hbase-user@hadoop.apache.org
> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> Date: Wed, 3 Mar 2010 10:21:10 -0800
> 
> What version of HBase are you running?  There were some recent fixes related
> to DNS issues causing regionservers to check-in to the master as a different
> name.  Anything strange about the network or DNS setup of your cluster?
> 
> ZooKeeper is sensitive to causes and network latency, as would any
> fault-tolerant distributed system.  ZK and HBase must determine when
> something has "failed", and the primary way is that it has not responded
> within some period of time.  50ms is negligible from a fault-detection
> standpoint, but 50 seconds is not.
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Wednesday, March 03, 2010 9:29 AM
> To: hbase-user@hadoop.apache.org
> Subject: RE: Trying to understand HBase/ZooKeeper Logs
> 
> 
> 
> 
> > Date: Wed, 3 Mar 2010 09:17:06 -0800
> > From: phunt@apache.org
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
> > There are a few issues involved with the ping time:
> > 
> > 1) the network (obv :-) )
> > 2) the zk server - if the server is highly loaded the pings may take 
> > longer. The heartbeat is also a "health check" that the client is doing 
> > against the server (as much as it is a "health check" for the server 
> > that the client is still live). The HB is routed "all the way" through 
> > the ZK server, ie through the processing pipeline. So if the server were 
> > stalled it would not respond immediately (vs say reading the HB at the 
> > thread that reads data from the client). You can see the min/max/avg 
> > request latencies on the zk server by using the "stat" 4letter word. See 
> > the ZK admin docs on this http://bit.ly/dglVld
> > 3) the zk client - clients can only process HB responses if they are 
> > running. Say the JVM GC runs in blocking mode, this will block all 
> > client threads (incl the zk client thread) and the HB response will sit 
> > until the GC is finished. This is why HBase RSs typically use very very 
> > large (from our, zk, perspective) session timeouts.
> > 
> > 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> > 
> > I can't shed directly light on this (ie what's the problem in hbase that 
> > could cause your issue). I'll let jd/stack comment on that.
> > 
> > Patrick
> > 
> 
> Thanks for the quick response.
> 
> I'm trying to track down the issue of why we're getting a lot of 'partial'
> failures. Unfortunately this is currently a lot like watching a pot boil.
> :-( 
> 
> What I am calling a 'partial failure' is that the region servers are
> spawning second or even third instances where only the last one appears to
> be live.
> 
> >From what I can tell is that there's a spike of network activity that
> causes one of the processes to think that there is something wrong and spawn
> a new instance.
> 
> Is this a good description?
> 
> Because some of the failures occur late at night with no load on the system,
> I suspect that we have issues with the network but I can't definitively say.
> 
> Which process is the most sensitive to network latency issues?
> 
> Sorry, still relatively new to HBase and I'm trying to track down a nasty
> issue that cause Hbase to fail on an almost regular basis. I think its a
> networking issue, but I can't be sure.
> 
> Thx
> 
> -Mike
> 
> 
> 
> 
>  		 	   		  
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/
> 
 		 	   		  
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
http://clk.atdmt.com/GBL/go/201469229/direct/01/

RE: Trying to understand HBase/ZooKeeper Logs

Posted by Jonathan Gray <jl...@streamy.com>.
What version of HBase are you running?  There were some recent fixes related
to DNS issues causing regionservers to check-in to the master as a different
name.  Anything strange about the network or DNS setup of your cluster?

ZooKeeper is sensitive to causes and network latency, as would any
fault-tolerant distributed system.  ZK and HBase must determine when
something has "failed", and the primary way is that it has not responded
within some period of time.  50ms is negligible from a fault-detection
standpoint, but 50 seconds is not.

-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Wednesday, March 03, 2010 9:29 AM
To: hbase-user@hadoop.apache.org
Subject: RE: Trying to understand HBase/ZooKeeper Logs




> Date: Wed, 3 Mar 2010 09:17:06 -0800
> From: phunt@apache.org
> To: hbase-user@hadoop.apache.org
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
[SNIP]
> There are a few issues involved with the ping time:
> 
> 1) the network (obv :-) )
> 2) the zk server - if the server is highly loaded the pings may take 
> longer. The heartbeat is also a "health check" that the client is doing 
> against the server (as much as it is a "health check" for the server 
> that the client is still live). The HB is routed "all the way" through 
> the ZK server, ie through the processing pipeline. So if the server were 
> stalled it would not respond immediately (vs say reading the HB at the 
> thread that reads data from the client). You can see the min/max/avg 
> request latencies on the zk server by using the "stat" 4letter word. See 
> the ZK admin docs on this http://bit.ly/dglVld
> 3) the zk client - clients can only process HB responses if they are 
> running. Say the JVM GC runs in blocking mode, this will block all 
> client threads (incl the zk client thread) and the HB response will sit 
> until the GC is finished. This is why HBase RSs typically use very very 
> large (from our, zk, perspective) session timeouts.
> 
> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> 
> I can't shed directly light on this (ie what's the problem in hbase that 
> could cause your issue). I'll let jd/stack comment on that.
> 
> Patrick
> 

Thanks for the quick response.

I'm trying to track down the issue of why we're getting a lot of 'partial'
failures. Unfortunately this is currently a lot like watching a pot boil.
:-( 

What I am calling a 'partial failure' is that the region servers are
spawning second or even third instances where only the last one appears to
be live.

>>From what I can tell is that there's a spike of network activity that
causes one of the processes to think that there is something wrong and spawn
a new instance.

Is this a good description?

Because some of the failures occur late at night with no load on the system,
I suspect that we have issues with the network but I can't definitively say.

Which process is the most sensitive to network latency issues?

Sorry, still relatively new to HBase and I'm trying to track down a nasty
issue that cause Hbase to fail on an almost regular basis. I think its a
networking issue, but I can't be sure.

Thx

-Mike




 		 	   		  
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/


RE: Trying to understand HBase/ZooKeeper Logs

Posted by Michael Segel <mi...@hotmail.com>.


> Date: Wed, 3 Mar 2010 09:17:06 -0800
> From: phunt@apache.org
> To: hbase-user@hadoop.apache.org
> Subject: Re: Trying to understand HBase/ZooKeeper Logs
[SNIP]
> There are a few issues involved with the ping time:
> 
> 1) the network (obv :-) )
> 2) the zk server - if the server is highly loaded the pings may take 
> longer. The heartbeat is also a "health check" that the client is doing 
> against the server (as much as it is a "health check" for the server 
> that the client is still live). The HB is routed "all the way" through 
> the ZK server, ie through the processing pipeline. So if the server were 
> stalled it would not respond immediately (vs say reading the HB at the 
> thread that reads data from the client). You can see the min/max/avg 
> request latencies on the zk server by using the "stat" 4letter word. See 
> the ZK admin docs on this http://bit.ly/dglVld
> 3) the zk client - clients can only process HB responses if they are 
> running. Say the JVM GC runs in blocking mode, this will block all 
> client threads (incl the zk client thread) and the HB response will sit 
> until the GC is finished. This is why HBase RSs typically use very very 
> large (from our, zk, perspective) session timeouts.
> 
> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
> 
> I can't shed directly light on this (ie what's the problem in hbase that 
> could cause your issue). I'll let jd/stack comment on that.
> 
> Patrick
> 

Thanks for the quick response.

I'm trying to track down the issue of why we're getting a lot of 'partial' failures. Unfortunately this is currently a lot like watching a pot boil. :-( 

What I am calling a 'partial failure' is that the region servers are spawning second or even third instances where only the last one appears to be live.

>From what I can tell is that there's a spike of network activity that causes one of the processes to think that there is something wrong and spawn a new instance.

Is this a good description?

Because some of the failures occur late at night with no load on the system, I suspect that we have issues with the network but I can't definitively say.

Which process is the most sensitive to network latency issues?

Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't be sure.

Thx

-Mike




 		 	   		  
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

Re: Trying to understand HBase/ZooKeeper Logs

Posted by Patrick Hunt <ph...@apache.org>.
Michael Segel wrote:
> Hi,
> 
> I'm trying to debug an issue where I am getting 'partial' failures. For some reason the region servers seem to end up with multiple 'live' servers on a node. (We start with 3 servers and the next morning we see 4,5 or 6 servers where a server has multiple servers 'live'. ) Yet if you do a list or a scan on a table, an exception gets thrown. (The next time we have a failure I'll include the exception....)
> 
> I've set all of the logging to Debug so I should be picking up as much information.
> 
> The master log shows the following:
> 2010-03-02 20:05:40,712 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
> 2010-03-02 20:05:45,000 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 3ms
> 2010-03-02 20:06:05,032 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 35ms
> 2010-03-02 20:06:24,998 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x2720dcdb350000 after 0ms
> 2010-03-02 20:06:39,563 INFO org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, average load 1.6666666666666667
> 2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.8.237.230:60020, regionname: -ROOT-,,0, startKey: <>}
> 2010-03-02 20:06:40,705 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.8.237.232:60020, regionname: .META.,,1, startKey: <>}
> 
> (Hopefully this formats ok...)
> 
> I'm trying to understand what I'm seeing.
> Am I correct when I say that this is where the master node is pinging the lead zookeeper as a way to maintain a heartbeat to see if zookeeper is alive?

Yes, ZK clients (hbase region servers) maintain persistent tcp 
connections to the ZK server. Heartbeats are used to maintain the 
liveness of the ZK session for that client. This is done by the ZK 
client lib, not by hbase directly.

> 
> On the region servers I see every node with roughly the following:
> 2010-03-03 09:31:52,086 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=2.9179459MB (3059688), Free=352.64456MB (369774616), Max=355.5625MB (372834304), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
> 2010-03-03 09:31:52,222 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
> 2010-03-03 09:32:12,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
> 2010-03-03 09:32:32,223 DEBUG org.apache.zookeeper.ClientCnxn: Got ping response for sessionid:0x12722bb961d0001 after 0ms
> 
> 
> Going through the logs, I see 0-1ms response time from the region servers to zookeeper.
> 
> I'm trying to track down why I'm having partial failures.
> That is, on a region server, I see multiple live servers, where one is actually alive.
> (This problem is intermittent and I haven't seen a failure [yet] since I turned on the debugging.)
> 
> Is it normal to see pings as long as 50ms when a master pings zookeeper?

There are a few issues involved with the ping time:

1) the network (obv :-) )
2) the zk server - if the server is highly loaded the pings may take 
longer. The heartbeat is also a "health check" that the client is doing 
against the server (as much as it is a "health check" for the server 
that the client is still live). The HB is routed "all the way" through 
the ZK server, ie through the processing pipeline. So if the server were 
stalled it would not respond immediately (vs say reading the HB at the 
thread that reads data from the client). You can see the min/max/avg 
request latencies on the zk server by using the "stat" 4letter word. See 
the ZK admin docs on this http://bit.ly/dglVld
3) the zk client - clients can only process HB responses if they are 
running. Say the JVM GC runs in blocking mode, this will block all 
client threads (incl the zk client thread) and the HB response will sit 
until the GC is finished. This is why HBase RSs typically use very very 
large (from our, zk, perspective) session timeouts.

50ms is not long btw. I believe that RS are using >> 30sec timeouts.

I can't shed directly light on this (ie what's the problem in hbase that 
could cause your issue). I'll let jd/stack comment on that.

Patrick


> 
> 
> 
> 
> 
> 
>  		 	   		  
> _________________________________________________________________
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> http://clk.atdmt.com/GBL/go/201469229/direct/01/