You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Florin Dinu <fd...@rice.edu> on 2013/02/06 20:46:47 UTC

TaskTracker heartbeats not sent at the configured intervals

Hello everyone,

I've been encountering the following problem for some time now and it is
really slowing down my work. I would appreciate any help you guys can
provide. I am using Hadoop 1.0.3.

I configured the TaskTrackers to send heartbeats to the
JobTracker every 1 second. Most of the time the heartbeats are sent
as configured. Sometimes though, there is a big gap between two
heartbeats sent by a TaskTracker. This gap can be as high as 30 seconds
but it is usually on the order of 10 to 15 seconds. In the TaskTracker
log when this happens there is usually a big gap in the reporting.
Nothing is printed for those 10-30 seconds.


I added some print statement in TaskTracker.java in the offerService and
transmitHeartBeat functions. Oftentimes the last print statement that I
see before the big gap is the one that precedes a call to a synchronized
block. I was not able to localize this to any particular
synchronized block call. Given enough runs, the big gaps appear in
several places in the code. The TaskTracker thread seems to just wait
before those synchronized blocks and it is not able to get to the code
that actually sends the heartbeat.  This makes me think that perhaps the
locks are not always released correctly.


By running many experiments I also noticed that this problem seems to 
appear more often when the number of concurrent tasks running on a node
is larger. Perhaps because more task threads means more
locking/unlocking.

Before switching to Hadoop 1.0.3 I used version 0.21.0 which was showing
the same problem far more often than 1.0.3.


Have you guys seen this before?
Do you know what can be causing this behavior?

Thank you so much
Florin Dinu
Rice University