You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/01/03 11:22:30 UTC

monit? daemontools? jsvc? something else?

Hello,

I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use 
tools like monit and daemontools (and a few other ones) to keep revive their 
Hadoop processes when they die.

Questions:
1. Is one of these tools better than others for Hadoop?
2. Is there a tool the community recommends?
3. Does anyone know if the future version of CDH will ship with one of these? 
Current CDH3 doesn't include them, so when a process dies you have to manually 
restart it.

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/


Re: monit? daemontools? jsvc? something else?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 6, 2011, at 12:39 AM, Otis Gospodnetic wrote:
>>    In the case of Hadoop,  no.  There has usually been at least a core dump, 
>> message in syslog,  message in datanode log, etc, etc.   [You *do* have cores 
>> enabled,  right?]
> 
> Hm, "cores enabled".... what do you mean by that?  Are you referring to JVM heap 
> dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)?  If not, I'm all 
> eyes/ears!

	I'm talking about system level core dumps. i.e., ulimit -c and friends.  [I'm much more of a systems programmer than a java guy, so ... ] You can definitely write Java code that will make the JVM crash due to misuse of threading libraries.  There are also CPU, kernel, and BIOS bugs that I've seen that cause the JVM to crash. Usually jstack or a core will lead the way to a patching the system to work around these issues. 

> 
>>    We also have in place a monitor that checks  the # of active nodes.  If it 
>> falls below a certain percentage, then we get  alerted and check on them en 
>> masse.   Worrying about one or two nodes going  down probably means you need 
>> more nodes. :D
>> 
> 
> That's probably right. :)
> So what do you use for monitoring the # of active nodes?

	We currently have a custom plugin for Nagios that screen scrapes the NN and JT web UI.  When a certain percentage of nodes dies, we get alerted that we need to take a look and start bringing stuff back up.  [We used the same approach at Y!, so it does work at scale.]

	I'm hoping to replace this (and Ganglia) with something better over the next year.... ;)

Re: monit? daemontools? jsvc? something else?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,


----- Original Message ----
> From: Allen Wittenauer <aw...@linkedin.com>

> > You guys never have JVM die "just because"? I  just had a DN's JVM die the 
> > other day "just because and with no obvious  cause".  Restarting it brought 
>it 
>
> > back to life, everything  recovered smoothly.  Had some automated tool done 
>the 
>
> > restart for  me, I'd be even happier.
> 
>     In the case of Hadoop,  no.  There has usually been at least a core dump, 
>message in syslog,  message in datanode log, etc, etc.   [You *do* have cores 
>enabled,  right?]

Hm, "cores enabled".... what do you mean by that?  Are you referring to JVM heap 
dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)?  If not, I'm all 
eyes/ears!

>     We also have in place a monitor that checks  the # of active nodes.  If it 
>falls below a certain percentage, then we get  alerted and check on them en 
>masse.   Worrying about one or two nodes going  down probably means you need 
>more nodes. :D
> 

That's probably right. :)
So what do you use for monitoring the # of active nodes?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/


Re: monit? daemontools? jsvc? something else?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
So Allen, what do you use to monitor those processes/nodes?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Allen Wittenauer <aw...@linkedin.com>
> To: "<co...@hadoop.apache.org>" <co...@hadoop.apache.org>
> Sent: Wed, January 5, 2011 11:54:22 PM
> Subject: Re: monit? daemontools? jsvc? something else?
> 
> 
> On Jan 5, 2011, at 7:57 PM, Lance Norskog wrote:
> 
> > Isn't this what  Ganglia is for?
> > 
> 
>     No.
> 
>      Ganglia does metrics, not monitoring.
> 
> 
> > On 1/5/11, Allen  Wittenauer <aw...@linkedin.com>  wrote:
> >> 
> >> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic  wrote:
> >> 
> >>> Ah, more manual work! :(
> >>> 
> >>> You guys never have JVM die "just because"? I just had a DN's  JVM die the
> >>> other day "just because and with no obvious  cause".  Restarting it 
brought
> >>> it
> >>> back to  life, everything recovered smoothly.  Had some automated tool  
>done
> >>> the
> >>> restart for me, I'd be even  happier.
> >> 
> >>     In the case of Hadoop,  no.  There has usually been at least a core 
>dump,
> >> message in  syslog, message in datanode log, etc, etc.   [You *do* have  
>cores
> >> enabled, right?]
> >> 
> >>      We also have in place a monitor that checks the # of active nodes.  If  
>it
> >> falls below a certain percentage, then we get alerted and check  on them en
> >> masse.   Worrying about one or two nodes going down  probably means you 
need
> >> more nodes. :D
> >> 
> >> 
> > 
> > 
> > -- 
> > Lance Norskog
> > goksron@gmail.com
> 
> 

Re: monit? daemontools? jsvc? something else?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 5, 2011, at 7:57 PM, Lance Norskog wrote:

> Isn't this what Ganglia is for?
> 

	No.

	Ganglia does metrics, not monitoring.


> On 1/5/11, Allen Wittenauer <aw...@linkedin.com> wrote:
>> 
>> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
>> 
>>> Ah, more manual work! :(
>>> 
>>> You guys never have JVM die "just because"? I just had a DN's JVM die the
>>> other day "just because and with no obvious cause".  Restarting it brought
>>> it
>>> back to life, everything recovered smoothly.  Had some automated tool done
>>> the
>>> restart for me, I'd be even happier.
>> 
>> 	In the case of Hadoop, no.  There has usually been at least a core dump,
>> message in syslog, message in datanode log, etc, etc.   [You *do* have cores
>> enabled, right?]
>> 
>> 	We also have in place a monitor that checks the # of active nodes.  If it
>> falls below a certain percentage, then we get alerted and check on them en
>> masse.   Worrying about one or two nodes going down probably means you need
>> more nodes. :D
>> 
>> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com


Re: monit? daemontools? jsvc? something else?

Posted by Lance Norskog <go...@gmail.com>.
Isn't this what Ganglia is for?

On 1/5/11, Allen Wittenauer <aw...@linkedin.com> wrote:
>
> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
>
>> Ah, more manual work! :(
>>
>> You guys never have JVM die "just because"? I just had a DN's JVM die the
>> other day "just because and with no obvious cause".  Restarting it brought
>> it
>> back to life, everything recovered smoothly.  Had some automated tool done
>> the
>> restart for me, I'd be even happier.
>
> 	In the case of Hadoop, no.  There has usually been at least a core dump,
> message in syslog, message in datanode log, etc, etc.   [You *do* have cores
> enabled, right?]
>
> 	We also have in place a monitor that checks the # of active nodes.  If it
> falls below a certain percentage, then we get alerted and check on them en
> masse.   Worrying about one or two nodes going down probably means you need
> more nodes. :D
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: monit? daemontools? jsvc? something else?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:

> Ah, more manual work! :(
> 
> You guys never have JVM die "just because"? I just had a DN's JVM die the 
> other day "just because and with no obvious cause".  Restarting it brought it 
> back to life, everything recovered smoothly.  Had some automated tool done the 
> restart for me, I'd be even happier.

	In the case of Hadoop, no.  There has usually been at least a core dump, message in syslog, message in datanode log, etc, etc.   [You *do* have cores enabled, right?]

	We also have in place a monitor that checks the # of active nodes.  If it falls below a certain percentage, then we get alerted and check on them en masse.   Worrying about one or two nodes going down probably means you need more nodes. :D


Re: monit? daemontools? jsvc? something else?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ah, more manual work! :(

 You guys never have JVM die "just because"?  I just had a DN's JVM die the 
other day "just because and with no obvious cause".  Restarting it brought it 
back to life, everything recovered smoothly.  Had some automated tool done the 
restart for me, I'd be even happier.

But I'll have to take your advice. :(

Anyone else has a different opinion?
Actually, is anyone actually using any such tools and *not* seeing problems when 
they kick in and do their job of restarting dead processes?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Brian Bockelman <bb...@cse.unl.edu>
> To: common-user@hadoop.apache.org
> Sent: Tue, January 4, 2011 8:43:46 AM
> Subject: Re: monit? daemontools? jsvc? something else?
> 
> I'll second this opinion.  Although there are some tools in life that need  to 
>be actively managed like this (and even then, sometimes management tools can  be 
>set to be too aggressive, making a bad situation terrible), HDFS is not  one.
> 
> If the JVM dies, you likely need a human brain to log in and figure  out what's 
>wrong - or just keep that node dead.
> 
> Brian
> 
> On Jan 3,  2011, at 10:40 PM, Allen Wittenauer wrote:
> 
> > 
> > On Jan 3, 2011,  at 2:22 AM, Otis Gospodnetic wrote:
> >> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* 
>use 
>
> >> tools like monit and daemontools (and a few other ones) to keep  revive 
>their 
>
> >> Hadoop processes when they die.
> >> 
> > 
> >     I'm not a fan of doing this for Hadoop processes,  even TaskTrackers and 
>DataNodes.  The processes generally die for a reason,  usually indicating that 
>something is wrong with the box.  Restarting those  processes may potentially 
>hide issues.
> 
> 

Re: monit? daemontools? jsvc? something else?

Posted by Brian Bockelman <bb...@cse.unl.edu>.
I'll second this opinion.  Although there are some tools in life that need to be actively managed like this (and even then, sometimes management tools can be set to be too aggressive, making a bad situation terrible), HDFS is not one.

If the JVM dies, you likely need a human brain to log in and figure out what's wrong - or just keep that node dead.

Brian

On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote:

> 
> On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
>> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use 
>> tools like monit and daemontools (and a few other ones) to keep revive their 
>> Hadoop processes when they die.
>> 
> 
> 	I'm not a fan of doing this for Hadoop processes, even TaskTrackers and DataNodes.  The processes generally die for a reason, usually indicating that something is wrong with the box.  Restarting those processes may potentially hide issues.


Re: monit? daemontools? jsvc? something else?

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use 
> tools like monit and daemontools (and a few other ones) to keep revive their 
> Hadoop processes when they die.
> 

	I'm not a fan of doing this for Hadoop processes, even TaskTrackers and DataNodes.  The processes generally die for a reason, usually indicating that something is wrong with the box.  Restarting those processes may potentially hide issues.