You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Meng Mao <me...@gmail.com> on 2008/06/04 19:04:46 UTC

checking per-node health (jobs, tasks, failures)?

I'm trying to implement Nagios health monitoring of a Hadoop grid.
If anyone has general tips to share, those would be welcome, too.
For those who don't know, Nagios is monitoring software that organizes and
manages checking of services.

As best as I know, the easiest, most decoupled way to monitor the grid is to
use a script to parse the jobtracker and tasktracker JSPs that are served
when the Hadoop instance is running.

My original implementation was 1 script that pointed to the 2 jsps on the
primary namenode. However, this led to serious performance hangups from
Nagios' bombarding the primary node with frequent checks. To fix this, I'd
like to distribute the script to each Hadoop datanode, so that Nagios is
polling each node directly, instead of always going through the primary node
and making it do all of the work for the whole grid.

The problem is with job info. I can't think of a way to ask a datanode for
this, since it doesn't serve the jobtracker.jsp. Only the namenode serves
that jsp.

Is there 1) a better way to get this info? I'm scripting in perl, so writing
a custom jar to find out things would be rather convoluted. 2) a
straightforward way to get job status from a namenode directly?

Thanks!