You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/01/03 11:22:30 UTC
monit? daemontools? jsvc? something else?
Hello,
I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use
tools like monit and daemontools (and a few other ones) to keep revive their
Hadoop processes when they die.
Questions:
1. Is one of these tools better than others for Hadoop?
2. Is there a tool the community recommends?
3. Does anyone know if the future version of CDH will ship with one of these?
Current CDH3 doesn't include them, so when a process dies you have to manually
restart it.
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
Re: monit? daemontools? jsvc? something else?
Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 6, 2011, at 12:39 AM, Otis Gospodnetic wrote:
>> In the case of Hadoop, no. There has usually been at least a core dump,
>> message in syslog, message in datanode log, etc, etc. [You *do* have cores
>> enabled, right?]
>
> Hm, "cores enabled".... what do you mean by that? Are you referring to JVM heap
> dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)? If not, I'm all
> eyes/ears!
I'm talking about system level core dumps. i.e., ulimit -c and friends. [I'm much more of a systems programmer than a java guy, so ... ] You can definitely write Java code that will make the JVM crash due to misuse of threading libraries. There are also CPU, kernel, and BIOS bugs that I've seen that cause the JVM to crash. Usually jstack or a core will lead the way to a patching the system to work around these issues.
>
>> We also have in place a monitor that checks the # of active nodes. If it
>> falls below a certain percentage, then we get alerted and check on them en
>> masse. Worrying about one or two nodes going down probably means you need
>> more nodes. :D
>>
>
> That's probably right. :)
> So what do you use for monitoring the # of active nodes?
We currently have a custom plugin for Nagios that screen scrapes the NN and JT web UI. When a certain percentage of nodes dies, we get alerted that we need to take a look and start bringing stuff back up. [We used the same approach at Y!, so it does work at scale.]
I'm hoping to replace this (and Ganglia) with something better over the next year.... ;)
Re: monit? daemontools? jsvc? something else?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
----- Original Message ----
> From: Allen Wittenauer <aw...@linkedin.com>
> > You guys never have JVM die "just because"? I just had a DN's JVM die the
> > other day "just because and with no obvious cause". Restarting it brought
>it
>
> > back to life, everything recovered smoothly. Had some automated tool done
>the
>
> > restart for me, I'd be even happier.
>
> In the case of Hadoop, no. There has usually been at least a core dump,
>message in syslog, message in datanode log, etc, etc. [You *do* have cores
>enabled, right?]
Hm, "cores enabled".... what do you mean by that? Are you referring to JVM heap
dump -XX JVM argument (-XX:+HeapDumpOnOutOfMemoryError)? If not, I'm all
eyes/ears!
> We also have in place a monitor that checks the # of active nodes. If it
>falls below a certain percentage, then we get alerted and check on them en
>masse. Worrying about one or two nodes going down probably means you need
>more nodes. :D
>
That's probably right. :)
So what do you use for monitoring the # of active nodes?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
Re: monit? daemontools? jsvc? something else?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
So Allen, what do you use to monitor those processes/nodes?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
----- Original Message ----
> From: Allen Wittenauer <aw...@linkedin.com>
> To: "<co...@hadoop.apache.org>" <co...@hadoop.apache.org>
> Sent: Wed, January 5, 2011 11:54:22 PM
> Subject: Re: monit? daemontools? jsvc? something else?
>
>
> On Jan 5, 2011, at 7:57 PM, Lance Norskog wrote:
>
> > Isn't this what Ganglia is for?
> >
>
> No.
>
> Ganglia does metrics, not monitoring.
>
>
> > On 1/5/11, Allen Wittenauer <aw...@linkedin.com> wrote:
> >>
> >> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
> >>
> >>> Ah, more manual work! :(
> >>>
> >>> You guys never have JVM die "just because"? I just had a DN's JVM die the
> >>> other day "just because and with no obvious cause". Restarting it
brought
> >>> it
> >>> back to life, everything recovered smoothly. Had some automated tool
>done
> >>> the
> >>> restart for me, I'd be even happier.
> >>
> >> In the case of Hadoop, no. There has usually been at least a core
>dump,
> >> message in syslog, message in datanode log, etc, etc. [You *do* have
>cores
> >> enabled, right?]
> >>
> >> We also have in place a monitor that checks the # of active nodes. If
>it
> >> falls below a certain percentage, then we get alerted and check on them en
> >> masse. Worrying about one or two nodes going down probably means you
need
> >> more nodes. :D
> >>
> >>
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
>
Re: monit? daemontools? jsvc? something else?
Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 5, 2011, at 7:57 PM, Lance Norskog wrote:
> Isn't this what Ganglia is for?
>
No.
Ganglia does metrics, not monitoring.
> On 1/5/11, Allen Wittenauer <aw...@linkedin.com> wrote:
>>
>> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
>>
>>> Ah, more manual work! :(
>>>
>>> You guys never have JVM die "just because"? I just had a DN's JVM die the
>>> other day "just because and with no obvious cause". Restarting it brought
>>> it
>>> back to life, everything recovered smoothly. Had some automated tool done
>>> the
>>> restart for me, I'd be even happier.
>>
>> In the case of Hadoop, no. There has usually been at least a core dump,
>> message in syslog, message in datanode log, etc, etc. [You *do* have cores
>> enabled, right?]
>>
>> We also have in place a monitor that checks the # of active nodes. If it
>> falls below a certain percentage, then we get alerted and check on them en
>> masse. Worrying about one or two nodes going down probably means you need
>> more nodes. :D
>>
>>
>
>
> --
> Lance Norskog
> goksron@gmail.com
Re: monit? daemontools? jsvc? something else?
Posted by Lance Norskog <go...@gmail.com>.
Isn't this what Ganglia is for?
On 1/5/11, Allen Wittenauer <aw...@linkedin.com> wrote:
>
> On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
>
>> Ah, more manual work! :(
>>
>> You guys never have JVM die "just because"? I just had a DN's JVM die the
>> other day "just because and with no obvious cause". Restarting it brought
>> it
>> back to life, everything recovered smoothly. Had some automated tool done
>> the
>> restart for me, I'd be even happier.
>
> In the case of Hadoop, no. There has usually been at least a core dump,
> message in syslog, message in datanode log, etc, etc. [You *do* have cores
> enabled, right?]
>
> We also have in place a monitor that checks the # of active nodes. If it
> falls below a certain percentage, then we get alerted and check on them en
> masse. Worrying about one or two nodes going down probably means you need
> more nodes. :D
>
>
--
Lance Norskog
goksron@gmail.com
Re: monit? daemontools? jsvc? something else?
Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 4, 2011, at 10:29 PM, Otis Gospodnetic wrote:
> Ah, more manual work! :(
>
> You guys never have JVM die "just because"? I just had a DN's JVM die the
> other day "just because and with no obvious cause". Restarting it brought it
> back to life, everything recovered smoothly. Had some automated tool done the
> restart for me, I'd be even happier.
In the case of Hadoop, no. There has usually been at least a core dump, message in syslog, message in datanode log, etc, etc. [You *do* have cores enabled, right?]
We also have in place a monitor that checks the # of active nodes. If it falls below a certain percentage, then we get alerted and check on them en masse. Worrying about one or two nodes going down probably means you need more nodes. :D
Re: monit? daemontools? jsvc? something else?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ah, more manual work! :(
You guys never have JVM die "just because"? I just had a DN's JVM die the
other day "just because and with no obvious cause". Restarting it brought it
back to life, everything recovered smoothly. Had some automated tool done the
restart for me, I'd be even happier.
But I'll have to take your advice. :(
Anyone else has a different opinion?
Actually, is anyone actually using any such tools and *not* seeing problems when
they kick in and do their job of restarting dead processes?
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
----- Original Message ----
> From: Brian Bockelman <bb...@cse.unl.edu>
> To: common-user@hadoop.apache.org
> Sent: Tue, January 4, 2011 8:43:46 AM
> Subject: Re: monit? daemontools? jsvc? something else?
>
> I'll second this opinion. Although there are some tools in life that need to
>be actively managed like this (and even then, sometimes management tools can be
>set to be too aggressive, making a bad situation terrible), HDFS is not one.
>
> If the JVM dies, you likely need a human brain to log in and figure out what's
>wrong - or just keep that node dead.
>
> Brian
>
> On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote:
>
> >
> > On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
> >> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do*
>use
>
> >> tools like monit and daemontools (and a few other ones) to keep revive
>their
>
> >> Hadoop processes when they die.
> >>
> >
> > I'm not a fan of doing this for Hadoop processes, even TaskTrackers and
>DataNodes. The processes generally die for a reason, usually indicating that
>something is wrong with the box. Restarting those processes may potentially
>hide issues.
>
>
Re: monit? daemontools? jsvc? something else?
Posted by Brian Bockelman <bb...@cse.unl.edu>.
I'll second this opinion. Although there are some tools in life that need to be actively managed like this (and even then, sometimes management tools can be set to be too aggressive, making a bad situation terrible), HDFS is not one.
If the JVM dies, you likely need a human brain to log in and figure out what's wrong - or just keep that node dead.
Brian
On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote:
>
> On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
>> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use
>> tools like monit and daemontools (and a few other ones) to keep revive their
>> Hadoop processes when they die.
>>
>
> I'm not a fan of doing this for Hadoop processes, even TaskTrackers and DataNodes. The processes generally die for a reason, usually indicating that something is wrong with the box. Restarting those processes may potentially hide issues.
Re: monit? daemontools? jsvc? something else?
Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* use
> tools like monit and daemontools (and a few other ones) to keep revive their
> Hadoop processes when they die.
>
I'm not a fan of doing this for Hadoop processes, even TaskTrackers and DataNodes. The processes generally die for a reason, usually indicating that something is wrong with the box. Restarting those processes may potentially hide issues.