You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Eremikhin Alexey <a....@corp.badoo.com> on 2013/05/24 15:10:04 UTC

Please help me with heartbeat storm

Hi all,
I have 29 servers hadoop cluster in almost default configuration.
After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
I started stracing its behaviour and found that some TT send heartbeats 
in an unlimited ways.
It means hundreds in a second.

Daemon restart solves the issue, but even easiest Hive MR returns issue 
back.

Here is the filtered strace of heartbeating process

hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 | 
grep write


[pid  6065] 13:07:34.801106 write(70, 
"\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30", 
284) = 284
[pid  6065] 13:07:34.807968 write(70, 
"\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31", 
284 <unfinished ...>
[pid  6065] 13:07:34.808080 <... write resumed> ) = 284
[pid  6065] 13:07:34.814473 write(70, 
"\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32", 
284 <unfinished ...>
[pid  6065] 13:07:34.814595 <... write resumed> ) = 284
[pid  6065] 13:07:34.820960 write(70, 
"\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33", 
284 <unfinished ...>


Please help me to stop this storming 8(

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:
> Hi Alexey,
>
> I don't know the solution to this problem, but I can second this, I'm 
> seeing nearly the same:
> My TaskTrackers are flooding the JobTracker with heartbeats, this 
> starts after the first mapred job and can be repaired by restarting 
> the TaskTracker.
> The TT nodes have high system cpu usage stats, the JT is not suffering 
> from this.
>
> my environment:
> debian 6.0.7
> hadoop 1.0.4
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> What's your environment?
>
> --Roland
>
>
> On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:
> Hi Alexey,
>
> I don't know the solution to this problem, but I can second this, I'm 
> seeing nearly the same:
> My TaskTrackers are flooding the JobTracker with heartbeats, this 
> starts after the first mapred job and can be repaired by restarting 
> the TaskTracker.
> The TT nodes have high system cpu usage stats, the JT is not suffering 
> from this.
>
> my environment:
> debian 6.0.7
> hadoop 1.0.4
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> What's your environment?
>
> --Roland
>
>
> On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:
> Hi Alexey,
>
> I don't know the solution to this problem, but I can second this, I'm 
> seeing nearly the same:
> My TaskTrackers are flooding the JobTracker with heartbeats, this 
> starts after the first mapred job and can be repaired by restarting 
> the TaskTracker.
> The TT nodes have high system cpu usage stats, the JT is not suffering 
> from this.
>
> my environment:
> debian 6.0.7
> hadoop 1.0.4
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> What's your environment?
>
> --Roland
>
>
> On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi Roland

Here are my conf.
SLES11 SP1
hadoop 1.0.4
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

It seems nothing repeats but hadoop version 8)

On 25.05.2013 19:44, Roland von Herget wrote:
> Hi Alexey,
>
> I don't know the solution to this problem, but I can second this, I'm 
> seeing nearly the same:
> My TaskTrackers are flooding the JobTracker with heartbeats, this 
> starts after the first mapred job and can be repaired by restarting 
> the TaskTracker.
> The TT nodes have high system cpu usage stats, the JT is not suffering 
> from this.
>
> my environment:
> debian 6.0.7
> hadoop 1.0.4
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> What's your environment?
>
> --Roland
>
>
> On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Alexey,

I don't know the solution to this problem, but I can second this, I'm
seeing nearly the same:
My TaskTrackers are flooding the JobTracker with heartbeats, this starts
after the first mapred job and can be repaired by restarting the
TaskTracker.
The TT nodes have high system cpu usage stats, the JT is not suffering from
this.

my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland


On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
> Hi Philippe,
>
> thanks a lot, that's the solution. I've disable 
> *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!
>
> Thanks again,
> Roland
>
>
> On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret 
> <philippe.signoret@gmail.com <ma...@gmail.com>> wrote:
>
>     This might be relevant:
>     https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
>         "There are two configuration items to control the
>         TaskTracker's heartbeat interval. One is
>         *mapreduce.tasktracker.outofband.heartbeat*. The other
>         is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we
>         set *mapreduce.tasktracker.outofband.heartbeat* with true and
>         set*mapreduce.tasktracker.outofband.heartbeat.damper* with
>         default value (1000000), TaskTracker may send heartbeat
>         without any interval."
>
>
>     Philippe
>
>     -------------------------------
>     *Philippe Signoret*
>
>
>     On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan
>     <rajesh.balamohan@gmail.com <ma...@gmail.com>>
>     wrote:
>
>         Default value of CLUSTER_INCREMENT is 100. Math.max(1000*
>         29/100, 3000) = 3000 always. This is the reason why you are
>         seeing so many heartbeats. *You might want to set it to 1 or
>         5.* This would increase the time taken to send the heartbeat
>         from TT to JT.
>
>
>         ~Rajesh.B
>
>
>         On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey
>         <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>>
>         wrote:
>
>             Hi!
>
>             Tried 5 seconds. Less number of nodes get into storm, but
>             still they do.
>             Additionaly update of ntp service helped a little.
>
>             Initially almost 50% got into storming each MR job. But
>             after ntp update and and increasing heart-beatto 5 seconds
>             level is around 10%.
>
>
>             On 26/05/13 10:43, murali adireddy wrote:
>>             Hi ,
>>
>>             Just try this one.
>>
>>             in the file "hdfs-site.xml" try to add the below property
>>             "dfs.heartbeat.interval" and value  in seconds.
>>
>>             Default value is '3' seconds. In your case increase value.
>>
>>             <property>
>>              <name>dfs.heartbeat.interval</name>
>>              <value>3</value>
>>             </property>
>>
>>             You can find more properties and default values in the
>>             below link.
>>
>>             http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>             Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>             On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey
>>             <a.eremihin@corp.badoo.com
>>             <ma...@corp.badoo.com>> wrote:
>>
>>                 Hi all,
>>                 I have 29 servers hadoop cluster in almost default
>>                 configuration.
>>                 After installing Hadoop 1.0.4 I've noticed that JT
>>                 and some TT waste CPU.
>>                 I started stracing its behaviour and found that some
>>                 TT send heartbeats in an unlimited ways.
>>                 It means hundreds in a second.
>>
>>                 Daemon restart solves the issue, but even easiest
>>                 Hive MR returns issue back.
>>
>>                 Here is the filtered strace of heartbeating process
>>
>>                 hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
>>                 2>&1  | grep 6065 | grep write
>>
>>
>>                 [pid  6065] 13:07:34.801106 write(70,
>>                 "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>                 284) = 284
>>                 [pid  6065] 13:07:34.807968 write(70,
>>                 "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.814473 write(70,
>>                 "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.820960 write(70,
>>                 "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>                 284 <unfinished ...>
>>
>>
>>                 Please help me to stop this storming 8(
>>
>>
>
>
>
>
>         -- 
>         ~Rajesh.B
>
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
> Hi Philippe,
>
> thanks a lot, that's the solution. I've disable 
> *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!
>
> Thanks again,
> Roland
>
>
> On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret 
> <philippe.signoret@gmail.com <ma...@gmail.com>> wrote:
>
>     This might be relevant:
>     https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
>         "There are two configuration items to control the
>         TaskTracker's heartbeat interval. One is
>         *mapreduce.tasktracker.outofband.heartbeat*. The other
>         is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we
>         set *mapreduce.tasktracker.outofband.heartbeat* with true and
>         set*mapreduce.tasktracker.outofband.heartbeat.damper* with
>         default value (1000000), TaskTracker may send heartbeat
>         without any interval."
>
>
>     Philippe
>
>     -------------------------------
>     *Philippe Signoret*
>
>
>     On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan
>     <rajesh.balamohan@gmail.com <ma...@gmail.com>>
>     wrote:
>
>         Default value of CLUSTER_INCREMENT is 100. Math.max(1000*
>         29/100, 3000) = 3000 always. This is the reason why you are
>         seeing so many heartbeats. *You might want to set it to 1 or
>         5.* This would increase the time taken to send the heartbeat
>         from TT to JT.
>
>
>         ~Rajesh.B
>
>
>         On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey
>         <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>>
>         wrote:
>
>             Hi!
>
>             Tried 5 seconds. Less number of nodes get into storm, but
>             still they do.
>             Additionaly update of ntp service helped a little.
>
>             Initially almost 50% got into storming each MR job. But
>             after ntp update and and increasing heart-beatto 5 seconds
>             level is around 10%.
>
>
>             On 26/05/13 10:43, murali adireddy wrote:
>>             Hi ,
>>
>>             Just try this one.
>>
>>             in the file "hdfs-site.xml" try to add the below property
>>             "dfs.heartbeat.interval" and value  in seconds.
>>
>>             Default value is '3' seconds. In your case increase value.
>>
>>             <property>
>>              <name>dfs.heartbeat.interval</name>
>>              <value>3</value>
>>             </property>
>>
>>             You can find more properties and default values in the
>>             below link.
>>
>>             http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>             Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>             On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey
>>             <a.eremihin@corp.badoo.com
>>             <ma...@corp.badoo.com>> wrote:
>>
>>                 Hi all,
>>                 I have 29 servers hadoop cluster in almost default
>>                 configuration.
>>                 After installing Hadoop 1.0.4 I've noticed that JT
>>                 and some TT waste CPU.
>>                 I started stracing its behaviour and found that some
>>                 TT send heartbeats in an unlimited ways.
>>                 It means hundreds in a second.
>>
>>                 Daemon restart solves the issue, but even easiest
>>                 Hive MR returns issue back.
>>
>>                 Here is the filtered strace of heartbeating process
>>
>>                 hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
>>                 2>&1  | grep 6065 | grep write
>>
>>
>>                 [pid  6065] 13:07:34.801106 write(70,
>>                 "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>                 284) = 284
>>                 [pid  6065] 13:07:34.807968 write(70,
>>                 "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.814473 write(70,
>>                 "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.820960 write(70,
>>                 "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>                 284 <unfinished ...>
>>
>>
>>                 Please help me to stop this storming 8(
>>
>>
>
>
>
>
>         -- 
>         ~Rajesh.B
>
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
> Hi Philippe,
>
> thanks a lot, that's the solution. I've disable 
> *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!
>
> Thanks again,
> Roland
>
>
> On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret 
> <philippe.signoret@gmail.com <ma...@gmail.com>> wrote:
>
>     This might be relevant:
>     https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
>         "There are two configuration items to control the
>         TaskTracker's heartbeat interval. One is
>         *mapreduce.tasktracker.outofband.heartbeat*. The other
>         is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we
>         set *mapreduce.tasktracker.outofband.heartbeat* with true and
>         set*mapreduce.tasktracker.outofband.heartbeat.damper* with
>         default value (1000000), TaskTracker may send heartbeat
>         without any interval."
>
>
>     Philippe
>
>     -------------------------------
>     *Philippe Signoret*
>
>
>     On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan
>     <rajesh.balamohan@gmail.com <ma...@gmail.com>>
>     wrote:
>
>         Default value of CLUSTER_INCREMENT is 100. Math.max(1000*
>         29/100, 3000) = 3000 always. This is the reason why you are
>         seeing so many heartbeats. *You might want to set it to 1 or
>         5.* This would increase the time taken to send the heartbeat
>         from TT to JT.
>
>
>         ~Rajesh.B
>
>
>         On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey
>         <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>>
>         wrote:
>
>             Hi!
>
>             Tried 5 seconds. Less number of nodes get into storm, but
>             still they do.
>             Additionaly update of ntp service helped a little.
>
>             Initially almost 50% got into storming each MR job. But
>             after ntp update and and increasing heart-beatto 5 seconds
>             level is around 10%.
>
>
>             On 26/05/13 10:43, murali adireddy wrote:
>>             Hi ,
>>
>>             Just try this one.
>>
>>             in the file "hdfs-site.xml" try to add the below property
>>             "dfs.heartbeat.interval" and value  in seconds.
>>
>>             Default value is '3' seconds. In your case increase value.
>>
>>             <property>
>>              <name>dfs.heartbeat.interval</name>
>>              <value>3</value>
>>             </property>
>>
>>             You can find more properties and default values in the
>>             below link.
>>
>>             http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>             Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>             On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey
>>             <a.eremihin@corp.badoo.com
>>             <ma...@corp.badoo.com>> wrote:
>>
>>                 Hi all,
>>                 I have 29 servers hadoop cluster in almost default
>>                 configuration.
>>                 After installing Hadoop 1.0.4 I've noticed that JT
>>                 and some TT waste CPU.
>>                 I started stracing its behaviour and found that some
>>                 TT send heartbeats in an unlimited ways.
>>                 It means hundreds in a second.
>>
>>                 Daemon restart solves the issue, but even easiest
>>                 Hive MR returns issue back.
>>
>>                 Here is the filtered strace of heartbeating process
>>
>>                 hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
>>                 2>&1  | grep 6065 | grep write
>>
>>
>>                 [pid  6065] 13:07:34.801106 write(70,
>>                 "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>                 284) = 284
>>                 [pid  6065] 13:07:34.807968 write(70,
>>                 "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.814473 write(70,
>>                 "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.820960 write(70,
>>                 "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>                 284 <unfinished ...>
>>
>>
>>                 Please help me to stop this storming 8(
>>
>>
>
>
>
>
>         -- 
>         ~Rajesh.B
>
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

The same has helped me.
Thanks a lot!!

On 30.05.2013 17:00, Roland von Herget wrote:
> Hi Philippe,
>
> thanks a lot, that's the solution. I've disable 
> *mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!
>
> Thanks again,
> Roland
>
>
> On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret 
> <philippe.signoret@gmail.com <ma...@gmail.com>> wrote:
>
>     This might be relevant:
>     https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
>         "There are two configuration items to control the
>         TaskTracker's heartbeat interval. One is
>         *mapreduce.tasktracker.outofband.heartbeat*. The other
>         is*mapreduce.tasktracker.outofband.heartbeat.damper*. If we
>         set *mapreduce.tasktracker.outofband.heartbeat* with true and
>         set*mapreduce.tasktracker.outofband.heartbeat.damper* with
>         default value (1000000), TaskTracker may send heartbeat
>         without any interval."
>
>
>     Philippe
>
>     -------------------------------
>     *Philippe Signoret*
>
>
>     On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan
>     <rajesh.balamohan@gmail.com <ma...@gmail.com>>
>     wrote:
>
>         Default value of CLUSTER_INCREMENT is 100. Math.max(1000*
>         29/100, 3000) = 3000 always. This is the reason why you are
>         seeing so many heartbeats. *You might want to set it to 1 or
>         5.* This would increase the time taken to send the heartbeat
>         from TT to JT.
>
>
>         ~Rajesh.B
>
>
>         On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey
>         <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>>
>         wrote:
>
>             Hi!
>
>             Tried 5 seconds. Less number of nodes get into storm, but
>             still they do.
>             Additionaly update of ntp service helped a little.
>
>             Initially almost 50% got into storming each MR job. But
>             after ntp update and and increasing heart-beatto 5 seconds
>             level is around 10%.
>
>
>             On 26/05/13 10:43, murali adireddy wrote:
>>             Hi ,
>>
>>             Just try this one.
>>
>>             in the file "hdfs-site.xml" try to add the below property
>>             "dfs.heartbeat.interval" and value  in seconds.
>>
>>             Default value is '3' seconds. In your case increase value.
>>
>>             <property>
>>              <name>dfs.heartbeat.interval</name>
>>              <value>3</value>
>>             </property>
>>
>>             You can find more properties and default values in the
>>             below link.
>>
>>             http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>             Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>             On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey
>>             <a.eremihin@corp.badoo.com
>>             <ma...@corp.badoo.com>> wrote:
>>
>>                 Hi all,
>>                 I have 29 servers hadoop cluster in almost default
>>                 configuration.
>>                 After installing Hadoop 1.0.4 I've noticed that JT
>>                 and some TT waste CPU.
>>                 I started stracing its behaviour and found that some
>>                 TT send heartbeats in an unlimited ways.
>>                 It means hundreds in a second.
>>
>>                 Daemon restart solves the issue, but even easiest
>>                 Hive MR returns issue back.
>>
>>                 Here is the filtered strace of heartbeating process
>>
>>                 hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032
>>                 2>&1  | grep 6065 | grep write
>>
>>
>>                 [pid  6065] 13:07:34.801106 write(70,
>>                 "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>                 284) = 284
>>                 [pid  6065] 13:07:34.807968 write(70,
>>                 "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.814473 write(70,
>>                 "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>                 284 <unfinished ...>
>>                 [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>                 [pid  6065] 13:07:34.820960 write(70,
>>                 "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>>                 <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>                 284 <unfinished ...>
>>
>>
>>                 Please help me to stop this storming 8(
>>
>>
>
>
>
>
>         -- 
>         ~Rajesh.B
>
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
philippe.signoret@gmail.com> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> = 3000 always. This is the reason why you are seeing so many heartbeats.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp update
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> a.eremihin@corp.badoo.com> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeats
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) = 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
philippe.signoret@gmail.com> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> = 3000 always. This is the reason why you are seeing so many heartbeats.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp update
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> a.eremihin@corp.badoo.com> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeats
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) = 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
philippe.signoret@gmail.com> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> = 3000 always. This is the reason why you are seeing so many heartbeats.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp update
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> a.eremihin@corp.badoo.com> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeats
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) = 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Philippe,

thanks a lot, that's the solution. I've disable *
mapreduce.tasktracker.outofband.heartbeat* and now everything is fine!

Thanks again,
Roland


On Wed, May 29, 2013 at 4:00 PM, Philippe Signoret <
philippe.signoret@gmail.com> wrote:

> This might be relevant:
> https://issues.apache.org/jira/browse/MAPREDUCE-4478
>
> "There are two configuration items to control the TaskTracker's heartbeat
> interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is
> *mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
> mapreduce.tasktracker.outofband.heartbeat* with true and set*
> mapreduce.tasktracker.outofband.heartbeat.damper* with default value
> (1000000), TaskTracker may send heartbeat without any interval."
>
>
> Philippe
>
> -------------------------------
> *Philippe Signoret*
>
>
> On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
>> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000)
>> = 3000 always. This is the reason why you are seeing so many heartbeats.
>> *You might want to set it to 1 or 5.* This would increase the time taken
>> to send the heartbeat from TT to JT.
>>
>>
>> ~Rajesh.B
>>
>>
>> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>>  Hi!
>>>
>>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>>> Additionaly update of ntp service helped a little.
>>>
>>> Initially almost 50% got into storming each MR job. But after ntp update
>>> and and increasing heart-beatto 5 seconds level is around 10%.
>>>
>>>
>>> On 26/05/13 10:43, murali adireddy wrote:
>>>
>>> Hi ,
>>>
>>>  Just try this one.
>>>
>>>  in the file "hdfs-site.xml" try to add the below property
>>> "dfs.heartbeat.interval" and value  in seconds.
>>>
>>>  Default value is '3' seconds. In your case increase value.
>>>
>>>  <property>
>>>  <name>dfs.heartbeat.interval</name>
>>>  <value>3</value>
>>> </property>
>>>
>>>  You can find more properties and default values in the below link.
>>>
>>>
>>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>>
>>>
>>>  Please let me know is the above solution worked for you ..?
>>>
>>>
>>>
>>>
>>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>>> a.eremihin@corp.badoo.com> wrote:
>>>
>>>> Hi all,
>>>> I have 29 servers hadoop cluster in almost default configuration.
>>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste
>>>> CPU.
>>>> I started stracing its behaviour and found that some TT send heartbeats
>>>> in an unlimited ways.
>>>> It means hundreds in a second.
>>>>
>>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>>> back.
>>>>
>>>> Here is the filtered strace of heartbeating process
>>>>
>>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>>> grep write
>>>>
>>>>
>>>> [pid  6065] 13:07:34.801106 write(70,
>>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>>> 284) = 284
>>>> [pid  6065] 13:07:34.807968 write(70,
>>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.814473 write(70,
>>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>>> 284 <unfinished ...>
>>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>>> [pid  6065] 13:07:34.820960 write(70,
>>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>>> 284 <unfinished ...>
>>>>
>>>>
>>>> Please help me to stop this storming 8(
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Re: Please help me with heartbeat storm

Posted by Philippe Signoret <ph...@gmail.com>.

This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat
interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is*
mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
mapreduce.tasktracker.outofband.heartbeat* with true and set*
mapreduce.tasktracker.outofband.heartbeat.damper* with default value
(1000000), TaskTracker may send heartbeat without any interval."


Philippe

-------------------------------
*Philippe Signoret*


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
> 3000 always. This is the reason why you are seeing so many heartbeats. *You
> might want to set it to 1 or 5.* This would increase the time taken to
> send the heartbeat from TT to JT.
>
>
> ~Rajesh.B
>
>
> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>>  Hi!
>>
>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>> Additionaly update of ntp service helped a little.
>>
>> Initially almost 50% got into storming each MR job. But after ntp update
>> and and increasing heart-beatto 5 seconds level is around 10%.
>>
>>
>> On 26/05/13 10:43, murali adireddy wrote:
>>
>> Hi ,
>>
>>  Just try this one.
>>
>>  in the file "hdfs-site.xml" try to add the below property
>> "dfs.heartbeat.interval" and value  in seconds.
>>
>>  Default value is '3' seconds. In your case increase value.
>>
>>  <property>
>>  <name>dfs.heartbeat.interval</name>
>>  <value>3</value>
>> </property>
>>
>>  You can find more properties and default values in the below link.
>>
>>
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>  Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>> Hi all,
>>> I have 29 servers hadoop cluster in almost default configuration.
>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>>> I started stracing its behaviour and found that some TT send heartbeats
>>> in an unlimited ways.
>>> It means hundreds in a second.
>>>
>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>> back.
>>>
>>> Here is the filtered strace of heartbeating process
>>>
>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>> grep write
>>>
>>>
>>> [pid  6065] 13:07:34.801106 write(70,
>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>> 284) = 284
>>> [pid  6065] 13:07:34.807968 write(70,
>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.814473 write(70,
>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.820960 write(70,
>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>> 284 <unfinished ...>
>>>
>>>
>>> Please help me to stop this storming 8(
>>>
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: Please help me with heartbeat storm

Posted by Philippe Signoret <ph...@gmail.com>.

This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat
interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is*
mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
mapreduce.tasktracker.outofband.heartbeat* with true and set*
mapreduce.tasktracker.outofband.heartbeat.damper* with default value
(1000000), TaskTracker may send heartbeat without any interval."


Philippe

-------------------------------
*Philippe Signoret*


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
> 3000 always. This is the reason why you are seeing so many heartbeats. *You
> might want to set it to 1 or 5.* This would increase the time taken to
> send the heartbeat from TT to JT.
>
>
> ~Rajesh.B
>
>
> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>>  Hi!
>>
>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>> Additionaly update of ntp service helped a little.
>>
>> Initially almost 50% got into storming each MR job. But after ntp update
>> and and increasing heart-beatto 5 seconds level is around 10%.
>>
>>
>> On 26/05/13 10:43, murali adireddy wrote:
>>
>> Hi ,
>>
>>  Just try this one.
>>
>>  in the file "hdfs-site.xml" try to add the below property
>> "dfs.heartbeat.interval" and value  in seconds.
>>
>>  Default value is '3' seconds. In your case increase value.
>>
>>  <property>
>>  <name>dfs.heartbeat.interval</name>
>>  <value>3</value>
>> </property>
>>
>>  You can find more properties and default values in the below link.
>>
>>
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>  Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>> Hi all,
>>> I have 29 servers hadoop cluster in almost default configuration.
>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>>> I started stracing its behaviour and found that some TT send heartbeats
>>> in an unlimited ways.
>>> It means hundreds in a second.
>>>
>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>> back.
>>>
>>> Here is the filtered strace of heartbeating process
>>>
>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>> grep write
>>>
>>>
>>> [pid  6065] 13:07:34.801106 write(70,
>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>> 284) = 284
>>> [pid  6065] 13:07:34.807968 write(70,
>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.814473 write(70,
>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.820960 write(70,
>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>> 284 <unfinished ...>
>>>
>>>
>>> Please help me to stop this storming 8(
>>>
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: Please help me with heartbeat storm

Posted by Philippe Signoret <ph...@gmail.com>.

This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat
interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is*
mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
mapreduce.tasktracker.outofband.heartbeat* with true and set*
mapreduce.tasktracker.outofband.heartbeat.damper* with default value
(1000000), TaskTracker may send heartbeat without any interval."


Philippe

-------------------------------
*Philippe Signoret*


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
> 3000 always. This is the reason why you are seeing so many heartbeats. *You
> might want to set it to 1 or 5.* This would increase the time taken to
> send the heartbeat from TT to JT.
>
>
> ~Rajesh.B
>
>
> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>>  Hi!
>>
>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>> Additionaly update of ntp service helped a little.
>>
>> Initially almost 50% got into storming each MR job. But after ntp update
>> and and increasing heart-beatto 5 seconds level is around 10%.
>>
>>
>> On 26/05/13 10:43, murali adireddy wrote:
>>
>> Hi ,
>>
>>  Just try this one.
>>
>>  in the file "hdfs-site.xml" try to add the below property
>> "dfs.heartbeat.interval" and value  in seconds.
>>
>>  Default value is '3' seconds. In your case increase value.
>>
>>  <property>
>>  <name>dfs.heartbeat.interval</name>
>>  <value>3</value>
>> </property>
>>
>>  You can find more properties and default values in the below link.
>>
>>
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>  Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>> Hi all,
>>> I have 29 servers hadoop cluster in almost default configuration.
>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>>> I started stracing its behaviour and found that some TT send heartbeats
>>> in an unlimited ways.
>>> It means hundreds in a second.
>>>
>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>> back.
>>>
>>> Here is the filtered strace of heartbeating process
>>>
>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>> grep write
>>>
>>>
>>> [pid  6065] 13:07:34.801106 write(70,
>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>> 284) = 284
>>> [pid  6065] 13:07:34.807968 write(70,
>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.814473 write(70,
>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.820960 write(70,
>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>> 284 <unfinished ...>
>>>
>>>
>>> Please help me to stop this storming 8(
>>>
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: Please help me with heartbeat storm

Posted by Philippe Signoret <ph...@gmail.com>.

This might be relevant: https://issues.apache.org/jira/browse/MAPREDUCE-4478

"There are two configuration items to control the TaskTracker's heartbeat
interval. One is *mapreduce.tasktracker.outofband.heartbeat*. The other is*
mapreduce.tasktracker.outofband.heartbeat.damper*. If we set *
mapreduce.tasktracker.outofband.heartbeat* with true and set*
mapreduce.tasktracker.outofband.heartbeat.damper* with default value
(1000000), TaskTracker may send heartbeat without any interval."


Philippe

-------------------------------
*Philippe Signoret*


On Tue, May 28, 2013 at 4:44 AM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
> 3000 always. This is the reason why you are seeing so many heartbeats. *You
> might want to set it to 1 or 5.* This would increase the time taken to
> send the heartbeat from TT to JT.
>
>
> ~Rajesh.B
>
>
> On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>>  Hi!
>>
>> Tried 5 seconds. Less number of nodes get into storm, but still they do.
>> Additionaly update of ntp service helped a little.
>>
>> Initially almost 50% got into storming each MR job. But after ntp update
>> and and increasing heart-beatto 5 seconds level is around 10%.
>>
>>
>> On 26/05/13 10:43, murali adireddy wrote:
>>
>> Hi ,
>>
>>  Just try this one.
>>
>>  in the file "hdfs-site.xml" try to add the below property
>> "dfs.heartbeat.interval" and value  in seconds.
>>
>>  Default value is '3' seconds. In your case increase value.
>>
>>  <property>
>>  <name>dfs.heartbeat.interval</name>
>>  <value>3</value>
>> </property>
>>
>>  You can find more properties and default values in the below link.
>>
>>
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>>
>>
>>  Please let me know is the above solution worked for you ..?
>>
>>
>>
>>
>>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
>> a.eremihin@corp.badoo.com> wrote:
>>
>>> Hi all,
>>> I have 29 servers hadoop cluster in almost default configuration.
>>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>>> I started stracing its behaviour and found that some TT send heartbeats
>>> in an unlimited ways.
>>> It means hundreds in a second.
>>>
>>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>>> back.
>>>
>>> Here is the filtered strace of heartbeating process
>>>
>>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>>> grep write
>>>
>>>
>>> [pid  6065] 13:07:34.801106 write(70,
>>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>>> 284) = 284
>>> [pid  6065] 13:07:34.807968 write(70,
>>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.814473 write(70,
>>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>>> 284 <unfinished ...>
>>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>>> [pid  6065] 13:07:34.820960 write(70,
>>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>>> 284 <unfinished ...>
>>>
>>>
>>> Please help me to stop this storming 8(
>>>
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: Please help me with heartbeat storm

Posted by Rajesh Balamohan <ra...@gmail.com>.

Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
3000 always. This is the reason why you are seeing so many heartbeats. *You
might want to set it to 1 or 5.* This would increase the time taken to send
the heartbeat from TT to JT.


~Rajesh.B


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

>  Hi!
>
> Tried 5 seconds. Less number of nodes get into storm, but still they do.
> Additionaly update of ntp service helped a little.
>
> Initially almost 50% got into storming each MR job. But after ntp update
> and and increasing heart-beatto 5 seconds level is around 10%.
>
>
> On 26/05/13 10:43, murali adireddy wrote:
>
> Hi ,
>
>  Just try this one.
>
>  in the file "hdfs-site.xml" try to add the below property
> "dfs.heartbeat.interval" and value  in seconds.
>
>  Default value is '3' seconds. In your case increase value.
>
>  <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
>  You can find more properties and default values in the below link.
>
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>  Please let me know is the above solution worked for you ..?
>
>
>
>
>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>> Hi all,
>> I have 29 servers hadoop cluster in almost default configuration.
>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>> I started stracing its behaviour and found that some TT send heartbeats
>> in an unlimited ways.
>> It means hundreds in a second.
>>
>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>> back.
>>
>> Here is the filtered strace of heartbeating process
>>
>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>> grep write
>>
>>
>> [pid  6065] 13:07:34.801106 write(70,
>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>> 284) = 284
>> [pid  6065] 13:07:34.807968 write(70,
>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.814473 write(70,
>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.820960 write(70,
>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>> 284 <unfinished ...>
>>
>>
>> Please help me to stop this storming 8(
>>
>>
>
>


-- 
~Rajesh.B

Re: Please help me with heartbeat storm

Posted by Rajesh Balamohan <ra...@gmail.com>.

Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
3000 always. This is the reason why you are seeing so many heartbeats. *You
might want to set it to 1 or 5.* This would increase the time taken to send
the heartbeat from TT to JT.


~Rajesh.B


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

>  Hi!
>
> Tried 5 seconds. Less number of nodes get into storm, but still they do.
> Additionaly update of ntp service helped a little.
>
> Initially almost 50% got into storming each MR job. But after ntp update
> and and increasing heart-beatto 5 seconds level is around 10%.
>
>
> On 26/05/13 10:43, murali adireddy wrote:
>
> Hi ,
>
>  Just try this one.
>
>  in the file "hdfs-site.xml" try to add the below property
> "dfs.heartbeat.interval" and value  in seconds.
>
>  Default value is '3' seconds. In your case increase value.
>
>  <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
>  You can find more properties and default values in the below link.
>
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>  Please let me know is the above solution worked for you ..?
>
>
>
>
>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>> Hi all,
>> I have 29 servers hadoop cluster in almost default configuration.
>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>> I started stracing its behaviour and found that some TT send heartbeats
>> in an unlimited ways.
>> It means hundreds in a second.
>>
>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>> back.
>>
>> Here is the filtered strace of heartbeating process
>>
>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>> grep write
>>
>>
>> [pid  6065] 13:07:34.801106 write(70,
>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>> 284) = 284
>> [pid  6065] 13:07:34.807968 write(70,
>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.814473 write(70,
>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.820960 write(70,
>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>> 284 <unfinished ...>
>>
>>
>> Please help me to stop this storming 8(
>>
>>
>
>


-- 
~Rajesh.B

Re: Please help me with heartbeat storm

Posted by Rajesh Balamohan <ra...@gmail.com>.

Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
3000 always. This is the reason why you are seeing so many heartbeats. *You
might want to set it to 1 or 5.* This would increase the time taken to send
the heartbeat from TT to JT.


~Rajesh.B


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

>  Hi!
>
> Tried 5 seconds. Less number of nodes get into storm, but still they do.
> Additionaly update of ntp service helped a little.
>
> Initially almost 50% got into storming each MR job. But after ntp update
> and and increasing heart-beatto 5 seconds level is around 10%.
>
>
> On 26/05/13 10:43, murali adireddy wrote:
>
> Hi ,
>
>  Just try this one.
>
>  in the file "hdfs-site.xml" try to add the below property
> "dfs.heartbeat.interval" and value  in seconds.
>
>  Default value is '3' seconds. In your case increase value.
>
>  <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
>  You can find more properties and default values in the below link.
>
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>  Please let me know is the above solution worked for you ..?
>
>
>
>
>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>> Hi all,
>> I have 29 servers hadoop cluster in almost default configuration.
>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>> I started stracing its behaviour and found that some TT send heartbeats
>> in an unlimited ways.
>> It means hundreds in a second.
>>
>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>> back.
>>
>> Here is the filtered strace of heartbeating process
>>
>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>> grep write
>>
>>
>> [pid  6065] 13:07:34.801106 write(70,
>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>> 284) = 284
>> [pid  6065] 13:07:34.807968 write(70,
>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.814473 write(70,
>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.820960 write(70,
>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>> 284 <unfinished ...>
>>
>>
>> Please help me to stop this storming 8(
>>
>>
>
>


-- 
~Rajesh.B

Re: Please help me with heartbeat storm

Posted by Rajesh Balamohan <ra...@gmail.com>.

Default value of CLUSTER_INCREMENT is 100. Math.max(1000* 29/100, 3000) =
3000 always. This is the reason why you are seeing so many heartbeats. *You
might want to set it to 1 or 5.* This would increase the time taken to send
the heartbeat from TT to JT.


~Rajesh.B


On Mon, May 27, 2013 at 2:12 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

>  Hi!
>
> Tried 5 seconds. Less number of nodes get into storm, but still they do.
> Additionaly update of ntp service helped a little.
>
> Initially almost 50% got into storming each MR job. But after ntp update
> and and increasing heart-beatto 5 seconds level is around 10%.
>
>
> On 26/05/13 10:43, murali adireddy wrote:
>
> Hi ,
>
>  Just try this one.
>
>  in the file "hdfs-site.xml" try to add the below property
> "dfs.heartbeat.interval" and value  in seconds.
>
>  Default value is '3' seconds. In your case increase value.
>
>  <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
>  You can find more properties and default values in the below link.
>
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
>  Please let me know is the above solution worked for you ..?
>
>
>
>
>  On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <
> a.eremihin@corp.badoo.com> wrote:
>
>> Hi all,
>> I have 29 servers hadoop cluster in almost default configuration.
>> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
>> I started stracing its behaviour and found that some TT send heartbeats
>> in an unlimited ways.
>> It means hundreds in a second.
>>
>> Daemon restart solves the issue, but even easiest Hive MR returns issue
>> back.
>>
>> Here is the filtered strace of heartbeating process
>>
>> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
>> grep write
>>
>>
>> [pid  6065] 13:07:34.801106 write(70,
>> "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>> 284) = 284
>> [pid  6065] 13:07:34.807968 write(70,
>> "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.814473 write(70,
>> "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>> 284 <unfinished ...>
>> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>> [pid  6065] 13:07:34.820960 write(70,
>> "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/
>> 127.0.0.1:52355\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>> 284 <unfinished ...>
>>
>>
>> Please help me to stop this storming 8(
>>
>>
>
>


-- 
~Rajesh.B

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update 
and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
> Hi ,
>
> Just try this one.
>
> in the file "hdfs-site.xml" try to add the below property 
> "dfs.heartbeat.interval" and value  in seconds.
>
> Default value is '3' seconds. In your case increase value.
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
> You can find more properties and default values in the below link.
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
> Please let me know is the above solution worked for you ..?
>
>
>
>
> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update 
and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
> Hi ,
>
> Just try this one.
>
> in the file "hdfs-site.xml" try to add the below property 
> "dfs.heartbeat.interval" and value  in seconds.
>
> Default value is '3' seconds. In your case increase value.
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
> You can find more properties and default values in the below link.
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
> Please let me know is the above solution worked for you ..?
>
>
>
>
> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update 
and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
> Hi ,
>
> Just try this one.
>
> in the file "hdfs-site.xml" try to add the below property 
> "dfs.heartbeat.interval" and value  in seconds.
>
> Default value is '3' seconds. In your case increase value.
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
> You can find more properties and default values in the below link.
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
> Please let me know is the above solution worked for you ..?
>
>
>
>
> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Eremikhin Alexey <a....@corp.badoo.com>.

Hi!

Tried 5 seconds. Less number of nodes get into storm, but still they do.
Additionaly update of ntp service helped a little.

Initially almost 50% got into storming each MR job. But after ntp update 
and and increasing heart-beatto 5 seconds level is around 10%.


On 26/05/13 10:43, murali adireddy wrote:
> Hi ,
>
> Just try this one.
>
> in the file "hdfs-site.xml" try to add the below property 
> "dfs.heartbeat.interval" and value  in seconds.
>
> Default value is '3' seconds. In your case increase value.
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>3</value>
> </property>
>
> You can find more properties and default values in the below link.
>
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
>
>
> Please let me know is the above solution worked for you ..?
>
>
>
>
> On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey 
> <a.eremihin@corp.badoo.com <ma...@corp.badoo.com>> wrote:
>
>     Hi all,
>     I have 29 servers hadoop cluster in almost default configuration.
>     After installing Hadoop 1.0.4 I've noticed that JT and some TT
>     waste CPU.
>     I started stracing its behaviour and found that some TT send
>     heartbeats in an unlimited ways.
>     It means hundreds in a second.
>
>     Daemon restart solves the issue, but even easiest Hive MR returns
>     issue back.
>
>     Here is the filtered strace of heartbeating process
>
>     hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep
>     6065 | grep write
>
>
>     [pid  6065] 13:07:34.801106 write(70,
>     "\0\0\1\30\0:\316N\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\300\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\30",
>     284) = 284
>     [pid  6065] 13:07:34.807968 write(70,
>     "\0\0\1\30\0:\316O\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\312\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\31",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.814473 write(70,
>     "\0\0\1\30\0:\316P\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\32",
>     284 <unfinished ...>
>     [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
>     [pid  6065] 13:07:34.820960 write(70,
>     "\0\0\1\30\0:\316Q\0\theartbeat\0\0\0\5\0*org.apache.hadoop.mapred.TaskTrackerStatus\0*org.apache.hadoop.mapred.TaskTrackerStatus.tracker_hadoop9.mlan:localhost/127.0.0.1:52355
>     <http://127.0.0.1:52355>\fhadoop9.mlan\0\0\303\214\0\0\0\0\0\0\0\2\0\0\0\2\213\1\367\373\200\0\214\367\223\220\0\213\1\341p\220\0\214\341\351\200\0\377\377\213\6\243\253\200\0\214q\r\33\336\215$\205\266\4B\16\333n\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\7boolean\0\0\7boolean\0\0\7boolean\1\0\5short\316\33",
>     284 <unfinished ...>
>
>
>     Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by murali adireddy <mu...@gmail.com>.

Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property
"dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Alexey,

I don't know the solution to this problem, but I can second this, I'm
seeing nearly the same:
My TaskTrackers are flooding the JobTracker with heartbeats, this starts
after the first mapred job and can be repaired by restarting the
TaskTracker.
The TT nodes have high system cpu usage stats, the JT is not suffering from
this.

my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland


On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by murali adireddy <mu...@gmail.com>.

Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property
"dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by murali adireddy <mu...@gmail.com>.

Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property
"dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Alexey,

I don't know the solution to this problem, but I can second this, I'm
seeing nearly the same:
My TaskTrackers are flooding the JobTracker with heartbeats, this starts
after the first mapred job and can be repaired by restarting the
TaskTracker.
The TT nodes have high system cpu usage stats, the JT is not suffering from
this.

my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland


On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by murali adireddy <mu...@gmail.com>.

Hi ,

Just try this one.

in the file "hdfs-site.xml" try to add the below property
"dfs.heartbeat.interval" and value  in seconds.

Default value is '3' seconds. In your case increase value.

<property>
 <name>dfs.heartbeat.interval</name>
 <value>3</value>
</property>

You can find more properties and default values in the below link.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml


Please let me know is the above solution worked for you ..?




On Fri, May 24, 2013 at 6:40 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>

Re: Please help me with heartbeat storm

Posted by Roland von Herget <ro...@gmail.com>.

Hi Alexey,

I don't know the solution to this problem, but I can second this, I'm
seeing nearly the same:
My TaskTrackers are flooding the JobTracker with heartbeats, this starts
after the first mapred job and can be repaired by restarting the
TaskTracker.
The TT nodes have high system cpu usage stats, the JT is not suffering from
this.

my environment:
debian 6.0.7
hadoop 1.0.4
java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

What's your environment?

--Roland


On Fri, May 24, 2013 at 3:10 PM, Eremikhin Alexey <a.eremihin@corp.badoo.com
> wrote:

> Hi all,
> I have 29 servers hadoop cluster in almost default configuration.
> After installing Hadoop 1.0.4 I've noticed that JT and some TT waste CPU.
> I started stracing its behaviour and found that some TT send heartbeats in
> an unlimited ways.
> It means hundreds in a second.
>
> Daemon restart solves the issue, but even easiest Hive MR returns issue
> back.
>
> Here is the filtered strace of heartbeating process
>
> hadoop9.mlan:~$ sudo strace -tt -f -s 10000 -p 6032 2>&1  | grep 6065 |
> grep write
>
>
> [pid  6065] 13:07:34.801106 write(70, "\0\0\1\30\0:\316N\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\300\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\30", 284) = 284
> [pid  6065] 13:07:34.807968 write(70, "\0\0\1\30\0:\316O\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\312\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\31", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.808080 <... write resumed> ) = 284
> [pid  6065] 13:07:34.814473 write(70, "\0\0\1\30\0:\316P\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\32", 284 <unfinished
> ...>
> [pid  6065] 13:07:34.814595 <... write resumed> ) = 284
> [pid  6065] 13:07:34.820960 write(70, "\0\0\1\30\0:\316Q\0\**
> theartbeat\0\0\0\5\0*org.**apache.hadoop.mapred.**TaskTrackerStatus\0*org.
> **apache.hadoop.mapred.**TaskTrackerStatus.tracker_**
> hadoop9.mlan:localhost/127.0.**0.1:52355 <http://127.0.0.1:52355>
> \fhadoop9.mlan\0\0\**303\214\0\0\0\0\0\0\0\2\0\0\0\**
> 2\213\1\367\373\200\0\214\367\**223\220\0\213\1\341p\220\0\**
> 214\341\351\200\0\377\377\213\**6\243\253\200\0\214q\r\33\336\**
> 215$\205\266\4B\16\333n\0\0\0\**0\1\0\0\0\0\0\0\0\0\0\0\**
> 7boolean\0\0\7boolean\0\0\**7boolean\1\0\5short\316\33", 284 <unfinished
> ...>
>
>
> Please help me to stop this storming 8(
>
>