You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kazuki Ohta <ka...@gmail.com> on 2011/04/20 20:41:30 UTC

massive zk expirations under heavy network load

Hi,

I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
The configuraiton is below.

hdp0: zk + master + region + nn + dn + jt + tt
hdp1: zk + master + region + snn + dn + tt
hdp2: zk + region + dn + tt
hdp3 to hdp15: region + dn + tt

Usually, it works really well. But once the user throws MapReduce
job which requires massive network transfer in the shuffle phase,
the master got the zk session timeout exception, and fails-over to
another master.

The problem is that shuffle network transfer dominates the switch,
and important zk packets are not transferred properly at that time.

Even ganglia monitoring seems to stop at that time. And mr task
attempts also got zk session timeouts and dies altogether (about
100 tasks dies at the same time. input and output are both hbase).

This is the potential problem running MapReduce job alongside
with HBase. Does anyone know any good solution for this
phenomenon?

Of course I should isolate hbase-master from task tracker. This
could avoid hbase-master failover problem, but cannot avoid mr
tasks to get zk session expiration all together.

Thanks
Kazuki

-- 
--------------------------------------------------
Kazuki Ohta: http://kzk9.net/

Re: massive zk expirations under heavy network load

Posted by Ted Dunning <td...@maprtech.com>.

This is your problem.  Sounds like a very deficient switch.

On Wed, Apr 20, 2011 at 11:41 AM, Kazuki Ohta <ka...@gmail.com> wrote:

> The problem is that shuffle network transfer dominates the switch,
> and important zk packets are not transferred properly at that time.
>

Re: massive zk expirations under heavy network load

Posted by Kazuki Ohta <ka...@gmail.com>.

Hi, Todd

Thx for replying in Japanese :-) lspci shows the following hardware.

$ lspci  | grep Ether
01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

> http://ark.intel.com/Product.aspx?id=32209

Regards
Kazuki

2011/5/19 Todd Lipcon <to...@cloudera.com>:
> こんいちは大田さん　：） (still practicing Japanese over here)
>
> Could you paste the output from "lspci" to the list? It would be
> useful to know which particular hardware you had this problem with, so
> we can watch out for it and know to upgrade e1000
>
> Thanks
> -Todd
>
> On Wed, May 18, 2011 at 9:02 PM, Kazuki Ohta <ka...@gmail.com> wrote:
>> Hi, all
>>
>> Finally got my cluster stable by upgrading network driver, instead of
>> changing the
>> switch.
>>
>> We used e1000e driver on CentOS 5.5, then upgrading to e1000e-1.3.10a.tar.gz,
>> the most recent one. This dramatically reduces the # of dropped packets under
>> heavy load.
>>
>>> http://bit.ly/lTJsV1
>>
>> Thanks for the help!
>> Kazuki
>>
>> On Thu, Apr 21, 2011 at 11:24 AM, Kazuki Ohta <ka...@gmail.com> wrote:
>>> Hi, All
>>>
>>> Thanks for the helpful comments!
>>> Nice to see this happens rarely in other environments.
>>>
>>> Actually I've changed the configuration not to run the task on the master node,
>>> but the same problem happened.
>>>
>>> So at first, upgrade the switch. Report again if the problem will be fixed.
>>>
>>> Thanks
>>> Kazuki
>>>
>>> On Thu, Apr 21, 2011 at 5:32 AM, Gary Helmling <gh...@gmail.com> wrote:
>>>>> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
>>>>> The configuraiton is below.
>>>>>
>>>>> hdp0: zk + master + region + nn + dn + jt + tt
>>>>> hdp1: zk + master + region + snn + dn + tt
>>>>> hdp2: zk + region + dn + tt
>>>>> hdp3 to hdp15: region + dn + tt
>>>>>
>>>>>
>>>> I would also look at the memory configuration for your servers and the
>>>> amount of heap allocated to each process.  Is it possible hdp0 is swapping
>>>> when running a MR job?  Swapping will cause big headaches and is often a
>>>> culprit for zk session timeouts.
>>>>
>>>> Between the 7 processes it has plus any child tasks started, it's not hard
>>>> to picture overcommitting memory.
>>>>
>>>> Regardless of whether the core problem lies in network hardware or here, I
>>>> would remove the region server, data node, and task tracker processes from
>>>> hdp0 and hdp1 for smoother operation.
>>>>
>>>> --gh
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------------------------------
>>> Kazuki Ohta: http://kzk9.net/
>>> CTO at Preferred Infrastructure: http://preferred.jp/
>>>
>>
>>
>>
>> --
>> --------------------------------------------------
>> Kazuki Ohta: http://kzk9.net/
>> CTO at Preferred Infrastructure: http://preferred.jp/
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
--------------------------------------------------
Kazuki Ohta: http://kzk9.net/
CTO at Preferred Infrastructure: http://preferred.jp/

Re: massive zk expirations under heavy network load

Posted by Todd Lipcon <to...@cloudera.com>.

こんいちは大田さん　：） (still practicing Japanese over here)

Could you paste the output from "lspci" to the list? It would be
useful to know which particular hardware you had this problem with, so
we can watch out for it and know to upgrade e1000

Thanks
-Todd

On Wed, May 18, 2011 at 9:02 PM, Kazuki Ohta <ka...@gmail.com> wrote:
> Hi, all
>
> Finally got my cluster stable by upgrading network driver, instead of
> changing the
> switch.
>
> We used e1000e driver on CentOS 5.5, then upgrading to e1000e-1.3.10a.tar.gz,
> the most recent one. This dramatically reduces the # of dropped packets under
> heavy load.
>
>> http://bit.ly/lTJsV1
>
> Thanks for the help!
> Kazuki
>
> On Thu, Apr 21, 2011 at 11:24 AM, Kazuki Ohta <ka...@gmail.com> wrote:
>> Hi, All
>>
>> Thanks for the helpful comments!
>> Nice to see this happens rarely in other environments.
>>
>> Actually I've changed the configuration not to run the task on the master node,
>> but the same problem happened.
>>
>> So at first, upgrade the switch. Report again if the problem will be fixed.
>>
>> Thanks
>> Kazuki
>>
>> On Thu, Apr 21, 2011 at 5:32 AM, Gary Helmling <gh...@gmail.com> wrote:
>>>> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
>>>> The configuraiton is below.
>>>>
>>>> hdp0: zk + master + region + nn + dn + jt + tt
>>>> hdp1: zk + master + region + snn + dn + tt
>>>> hdp2: zk + region + dn + tt
>>>> hdp3 to hdp15: region + dn + tt
>>>>
>>>>
>>> I would also look at the memory configuration for your servers and the
>>> amount of heap allocated to each process.  Is it possible hdp0 is swapping
>>> when running a MR job?  Swapping will cause big headaches and is often a
>>> culprit for zk session timeouts.
>>>
>>> Between the 7 processes it has plus any child tasks started, it's not hard
>>> to picture overcommitting memory.
>>>
>>> Regardless of whether the core problem lies in network hardware or here, I
>>> would remove the region server, data node, and task tracker processes from
>>> hdp0 and hdp1 for smoother operation.
>>>
>>> --gh
>>>
>>
>>
>>
>> --
>> --------------------------------------------------
>> Kazuki Ohta: http://kzk9.net/
>> CTO at Preferred Infrastructure: http://preferred.jp/
>>
>
>
>
> --
> --------------------------------------------------
> Kazuki Ohta: http://kzk9.net/
> CTO at Preferred Infrastructure: http://preferred.jp/
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: massive zk expirations under heavy network load

Posted by Kazuki Ohta <ka...@gmail.com>.

Hi, all

Finally got my cluster stable by upgrading network driver, instead of
changing the
switch.

We used e1000e driver on CentOS 5.5, then upgrading to e1000e-1.3.10a.tar.gz,
the most recent one. This dramatically reduces the # of dropped packets under
heavy load.

> http://bit.ly/lTJsV1

Thanks for the help!
Kazuki

On Thu, Apr 21, 2011 at 11:24 AM, Kazuki Ohta <ka...@gmail.com> wrote:
> Hi, All
>
> Thanks for the helpful comments!
> Nice to see this happens rarely in other environments.
>
> Actually I've changed the configuration not to run the task on the master node,
> but the same problem happened.
>
> So at first, upgrade the switch. Report again if the problem will be fixed.
>
> Thanks
> Kazuki
>
> On Thu, Apr 21, 2011 at 5:32 AM, Gary Helmling <gh...@gmail.com> wrote:
>>> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
>>> The configuraiton is below.
>>>
>>> hdp0: zk + master + region + nn + dn + jt + tt
>>> hdp1: zk + master + region + snn + dn + tt
>>> hdp2: zk + region + dn + tt
>>> hdp3 to hdp15: region + dn + tt
>>>
>>>
>> I would also look at the memory configuration for your servers and the
>> amount of heap allocated to each process.  Is it possible hdp0 is swapping
>> when running a MR job?  Swapping will cause big headaches and is often a
>> culprit for zk session timeouts.
>>
>> Between the 7 processes it has plus any child tasks started, it's not hard
>> to picture overcommitting memory.
>>
>> Regardless of whether the core problem lies in network hardware or here, I
>> would remove the region server, data node, and task tracker processes from
>> hdp0 and hdp1 for smoother operation.
>>
>> --gh
>>
>
>
>
> --
> --------------------------------------------------
> Kazuki Ohta: http://kzk9.net/
> CTO at Preferred Infrastructure: http://preferred.jp/
>



-- 
--------------------------------------------------
Kazuki Ohta: http://kzk9.net/
CTO at Preferred Infrastructure: http://preferred.jp/

Re: massive zk expirations under heavy network load

Posted by Kazuki Ohta <ka...@gmail.com>.

Hi, All

Thanks for the helpful comments!
Nice to see this happens rarely in other environments.

Actually I've changed the configuration not to run the task on the master node,
but the same problem happened.

So at first, upgrade the switch. Report again if the problem will be fixed.

Thanks
Kazuki

On Thu, Apr 21, 2011 at 5:32 AM, Gary Helmling <gh...@gmail.com> wrote:
>> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
>> The configuraiton is below.
>>
>> hdp0: zk + master + region + nn + dn + jt + tt
>> hdp1: zk + master + region + snn + dn + tt
>> hdp2: zk + region + dn + tt
>> hdp3 to hdp15: region + dn + tt
>>
>>
> I would also look at the memory configuration for your servers and the
> amount of heap allocated to each process.  Is it possible hdp0 is swapping
> when running a MR job?  Swapping will cause big headaches and is often a
> culprit for zk session timeouts.
>
> Between the 7 processes it has plus any child tasks started, it's not hard
> to picture overcommitting memory.
>
> Regardless of whether the core problem lies in network hardware or here, I
> would remove the region server, data node, and task tracker processes from
> hdp0 and hdp1 for smoother operation.
>
> --gh
>



-- 
--------------------------------------------------
Kazuki Ohta: http://kzk9.net/
CTO at Preferred Infrastructure: http://preferred.jp/

Re: massive zk expirations under heavy network load

Posted by Gary Helmling <gh...@gmail.com>.

> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
> The configuraiton is below.
>
> hdp0: zk + master + region + nn + dn + jt + tt
> hdp1: zk + master + region + snn + dn + tt
> hdp2: zk + region + dn + tt
> hdp3 to hdp15: region + dn + tt
>
>
I would also look at the memory configuration for your servers and the
amount of heap allocated to each process.  Is it possible hdp0 is swapping
when running a MR job?  Swapping will cause big headaches and is often a
culprit for zk session timeouts.

Between the 7 processes it has plus any child tasks started, it's not hard
to picture overcommitting memory.

Regardless of whether the core problem lies in network hardware or here, I
would remove the region server, data node, and task tracker processes from
hdp0 and hdp1 for smoother operation.

--gh

Re: massive zk expirations under heavy network load

Posted by Andrew Purtell <ap...@apache.org>.

Kazuki-san,

Setting the ZK timeout to a large value will stop the expirations but may not provide sufficiently fast failure detection for your use case of course.

However if even Ganglia stops working during a large mapreduce job, I think you need to question the adequacy of the network hardware.

   - Andy

> From: Kazuki Ohta <ka...@gmail.com>
> Subject: massive zk expirations under heavy network load
> To: user@hbase.apache.org
> Cc: kazuki.ohta@gmail.com
> Date: Wednesday, April 20, 2011, 11:41 AM
> Hi,
> 
> I'm now using CDH3u0 at 16nodes cluster (hdp0-hdp15).
> The configuraiton is below.
> 
> hdp0: zk + master + region + nn + dn + jt + tt
> hdp1: zk + master + region + snn + dn + tt
> hdp2: zk + region + dn + tt
> hdp3 to hdp15: region + dn + tt
> 
> Usually, it works really well. But once the user throws MapReduce
> job which requires massive network transfer in the shuffle phase,
> the master got the zk session timeout exception, and fails-over to
> another master.
> 
> The problem is that shuffle network transfer dominates the switch,
> and important zk packets are not transferred properly at
> that time.
> 
> Even ganglia monitoring seems to stop at that time. And mr task
> attempts also got zk session timeouts and dies altogether (about
> 100 tasks dies at the same time. input and output are both hbase).
> 
> This is the potential problem running MapReduce job alongside
> with HBase. Does anyone know any good solution for this
> phenomenon?