You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Matthew Rathbone <ma...@foursquare.com> on 2011/08/26 17:03:59 UTC

Flume Master Issues

Hey all,

We're having totally unpredictable issues with the flume master installation lately, here's what happened to us last night / today:

YESTERDAY
Yesterday we added 8 new nodes to flume. They got set-up fine, and the configs were registered.
a few hours later the master totally stops responding to anything (web/shell/nodes), I don't find out until this morning.

TODAY
I try to stop it using the init script, that doesn't do anything, and it continues to run, but be unresponsive
I kill -9 the flume processes, and remove the pid file, figuring I can just start it again

now the master won't start "master already running on pid=<non-existent-pid>"
when I finally get it to start (changing the pid directory), it starts being unresponsive again
restart it, it does the same
stop all flume-nodes, restart it, looks good, start the flume nodes, it goes unresponsive again
restart it, and this time it works


The only log above an INFO statement that I can see is this:
2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to load output format plugin class - Class not found


but I don't think that's causing the issues.


I do have a flume-node running on the same machine, could there be some sort of race condition happening?
Has anyone else seen behavior like this?
Any idea how to fix it?

Hoping someone can shed some light on this, I'm really not sure what's going on.

Thanks all 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



Re: Flume Master Issues

Posted by Matthew Rathbone <ma...@foursquare.com>.
 I've made sure all of our flume machines are on the same version, that doesn't seem to help. 

Whenever I start a new node, it takes down the master. This happens every single time. I have to restart everything in a very specific order to make sure it starts working again.

It's probably something to do with my use of the rpcSource, which causes issues in other areas too. 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



On Sunday, August 28, 2011 at 9:36 AM, Bao Thai Ngo wrote:

> Mike,
> 
> I had the same problem with flume master. Try to remove flume and its init script at Master machine, then re-install flume master again. Just remember to save your configuration first.
> 
> Good luck.
> 
> ~Thai
> 
> On Fri, Aug 26, 2011 at 10:55 PM, Mike <miketheman@gmail.com (mailto:miketheman@gmail.com)> wrote:
> > I'd also ensure that all nodes/masters/collectors/etc are using the
> >  precise same build of flume.
> > 
> >  On Fri, Aug 26, 2011 at 11:53 AM, Matthew Rathbone
> > <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > > Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I
> > > could look for useful debugging output?
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq
> > >
> > > On Friday, August 26, 2011 at 10:34 AM, Mike wrote:
> > >
> > > I did - but that was when we were testing multi-master mode, and since
> > > it's not fully matured yet, I've gone back to a single master.
> > >
> > > On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
> > > <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > >
> > > You're right, there's another pid file there, that's crazy.
> > > Have you experienced the unresponsiveness thing too?
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq
> > >
> > > On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
> > >
> > > I recall a similar problem I had with this.
> > >
> > > It ended up being another pid-style file dropped somewhere else.
> > >
> > > /var/run/flume/flume-flume-master.pid
> > > /tmp/flumemaster.pid
> > >
> > > See if those are still around once all the flume procs are dead.
> > >
> > > -M
> > >
> > > On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> > > <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > >
> > > Hey all,
> > > We're having totally unpredictable issues with the flume master installation
> > > lately, here's what happened to us last night / today:
> > > YESTERDAY
> > > Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> > > configs were registered.
> > > a few hours later the master totally stops responding to anything
> > > (web/shell/nodes), I don't find out until this morning.
> > > TODAY
> > > I try to stop it using the init script, that doesn't do anything, and it
> > > continues to run, but be unresponsive
> > > I kill -9 the flume processes, and remove the pid file, figuring I can just
> > > start it again
> > > now the master won't start "master already running on
> > > pid=<non-existent-pid>"
> > > when I finally get it to start (changing the pid directory), it starts being
> > > unresponsive again
> > > restart it, it does the same
> > > stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> > > unresponsive again
> > > restart it, and this time it works
> > >
> > > The only log above an INFO statement that I can see is this:
> > > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> > > load output format plugin class - Class not found
> > > but I don't think that's causing the issues.
> > >
> > > I do have a flume-node running on the same machine, could there be some sort
> > > of race condition happening?
> > > Has anyone else seen behavior like this?
> > > Any idea how to fix it?
> > > Hoping someone can shed some light on this, I'm really not sure what's going
> > > on.
> > > Thanks all
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq
> > >
> > >
> 


Re: Flume Master Issues

Posted by Bao Thai Ngo <ba...@gmail.com>.
Mike,

I had the same problem with flume master. Try to remove flume and its init
script at Master machine, then re-install flume master again. Just remember
to save your configuration first.

Good luck.

~Thai

On Fri, Aug 26, 2011 at 10:55 PM, Mike <mi...@gmail.com> wrote:

> I'd also ensure that all nodes/masters/collectors/etc are using the
> precise same build of flume.
>
> On Fri, Aug 26, 2011 at 11:53 AM, Matthew Rathbone
> <ma...@foursquare.com> wrote:
> > Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I
> > could look for useful debugging output?
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com | @rathboma | 4sq
> >
> > On Friday, August 26, 2011 at 10:34 AM, Mike wrote:
> >
> > I did - but that was when we were testing multi-master mode, and since
> > it's not fully matured yet, I've gone back to a single master.
> >
> > On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
> > <ma...@foursquare.com> wrote:
> >
> > You're right, there's another pid file there, that's crazy.
> > Have you experienced the unresponsiveness thing too?
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com | @rathboma | 4sq
> >
> > On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
> >
> > I recall a similar problem I had with this.
> >
> > It ended up being another pid-style file dropped somewhere else.
> >
> > /var/run/flume/flume-flume-master.pid
> > /tmp/flumemaster.pid
> >
> > See if those are still around once all the flume procs are dead.
> >
> > -M
> >
> > On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> > <ma...@foursquare.com> wrote:
> >
> > Hey all,
> > We're having totally unpredictable issues with the flume master
> installation
> > lately, here's what happened to us last night / today:
> > YESTERDAY
> > Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> > configs were registered.
> > a few hours later the master totally stops responding to anything
> > (web/shell/nodes), I don't find out until this morning.
> > TODAY
> > I try to stop it using the init script, that doesn't do anything, and it
> > continues to run, but be unresponsive
> > I kill -9 the flume processes, and remove the pid file, figuring I can
> just
> > start it again
> > now the master won't start "master already running on
> > pid=<non-existent-pid>"
> > when I finally get it to start (changing the pid directory), it starts
> being
> > unresponsive again
> > restart it, it does the same
> > stop all flume-nodes, restart it, looks good, start the flume nodes, it
> goes
> > unresponsive again
> > restart it, and this time it works
> >
> > The only log above an INFO statement that I can see is this:
> > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable
> to
> > load output format plugin class  - Class not found
> > but I don't think that's causing the issues.
> >
> > I do have a flume-node running on the same machine, could there be some
> sort
> > of race condition happening?
> > Has anyone else seen behavior like this?
> > Any idea how to fix it?
> > Hoping someone can shed some light on this, I'm really not sure what's
> going
> > on.
> > Thanks all
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com | @rathboma | 4sq
> >
> >
>

Re: Flume Master Issues

Posted by Mike <mi...@gmail.com>.
I'd also ensure that all nodes/masters/collectors/etc are using the
precise same build of flume.

On Fri, Aug 26, 2011 at 11:53 AM, Matthew Rathbone
<ma...@foursquare.com> wrote:
> Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I
> could look for useful debugging output?
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq
>
> On Friday, August 26, 2011 at 10:34 AM, Mike wrote:
>
> I did - but that was when we were testing multi-master mode, and since
> it's not fully matured yet, I've gone back to a single master.
>
> On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
> <ma...@foursquare.com> wrote:
>
> You're right, there's another pid file there, that's crazy.
> Have you experienced the unresponsiveness thing too?
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq
>
> On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
>
> I recall a similar problem I had with this.
>
> It ended up being another pid-style file dropped somewhere else.
>
> /var/run/flume/flume-flume-master.pid
> /tmp/flumemaster.pid
>
> See if those are still around once all the flume procs are dead.
>
> -M
>
> On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> <ma...@foursquare.com> wrote:
>
> Hey all,
> We're having totally unpredictable issues with the flume master installation
> lately, here's what happened to us last night / today:
> YESTERDAY
> Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> configs were registered.
> a few hours later the master totally stops responding to anything
> (web/shell/nodes), I don't find out until this morning.
> TODAY
> I try to stop it using the init script, that doesn't do anything, and it
> continues to run, but be unresponsive
> I kill -9 the flume processes, and remove the pid file, figuring I can just
> start it again
> now the master won't start "master already running on
> pid=<non-existent-pid>"
> when I finally get it to start (changing the pid directory), it starts being
> unresponsive again
> restart it, it does the same
> stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> unresponsive again
> restart it, and this time it works
>
> The only log above an INFO statement that I can see is this:
> 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> load output format plugin class  - Class not found
> but I don't think that's causing the issues.
>
> I do have a flume-node running on the same machine, could there be some sort
> of race condition happening?
> Has anyone else seen behavior like this?
> Any idea how to fix it?
> Hoping someone can shed some light on this, I'm really not sure what's going
> on.
> Thanks all
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq
>
>

Re: Flume Master Issues

Posted by Matthew Rathbone <ma...@foursquare.com>.
Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I could look for useful debugging output?

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



On Friday, August 26, 2011 at 10:34 AM, Mike wrote:

> I did - but that was when we were testing multi-master mode, and since
> it's not fully matured yet, I've gone back to a single master.
> 
> On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
> <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > You're right, there's another pid file there, that's crazy.
> > Have you experienced the unresponsiveness thing too?
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq
> > 
> > On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
> > 
> > I recall a similar problem I had with this.
> > 
> > It ended up being another pid-style file dropped somewhere else.
> > 
> > /var/run/flume/flume-flume-master.pid
> > /tmp/flumemaster.pid
> > 
> > See if those are still around once all the flume procs are dead.
> > 
> > -M
> > 
> > On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> > <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > 
> > Hey all,
> > We're having totally unpredictable issues with the flume master installation
> > lately, here's what happened to us last night / today:
> > YESTERDAY
> > Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> > configs were registered.
> > a few hours later the master totally stops responding to anything
> > (web/shell/nodes), I don't find out until this morning.
> > TODAY
> > I try to stop it using the init script, that doesn't do anything, and it
> > continues to run, but be unresponsive
> > I kill -9 the flume processes, and remove the pid file, figuring I can just
> > start it again
> > now the master won't start "master already running on
> > pid=<non-existent-pid>"
> > when I finally get it to start (changing the pid directory), it starts being
> > unresponsive again
> > restart it, it does the same
> > stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> > unresponsive again
> > restart it, and this time it works
> > 
> > The only log above an INFO statement that I can see is this:
> > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> > load output format plugin class - Class not found
> > but I don't think that's causing the issues.
> > 
> > I do have a flume-node running on the same machine, could there be some sort
> > of race condition happening?
> > Has anyone else seen behavior like this?
> > Any idea how to fix it?
> > Hoping someone can shed some light on this, I'm really not sure what's going
> > on.
> > Thanks all
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq


Re: Flume Master Issues

Posted by Mike <mi...@gmail.com>.
I did - but that was when we were testing multi-master mode, and since
it's not fully matured yet, I've gone back to a single master.

On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
<ma...@foursquare.com> wrote:
> You're right, there's another pid file there, that's crazy.
> Have you experienced the unresponsiveness thing too?
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq
>
> On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
>
> I recall a similar problem I had with this.
>
> It ended up being another pid-style file dropped somewhere else.
>
> /var/run/flume/flume-flume-master.pid
> /tmp/flumemaster.pid
>
> See if those are still around once all the flume procs are dead.
>
> -M
>
> On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> <ma...@foursquare.com> wrote:
>
> Hey all,
> We're having totally unpredictable issues with the flume master installation
> lately, here's what happened to us last night / today:
> YESTERDAY
> Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> configs were registered.
> a few hours later the master totally stops responding to anything
> (web/shell/nodes), I don't find out until this morning.
> TODAY
> I try to stop it using the init script, that doesn't do anything, and it
> continues to run, but be unresponsive
> I kill -9 the flume processes, and remove the pid file, figuring I can just
> start it again
> now the master won't start "master already running on
> pid=<non-existent-pid>"
> when I finally get it to start (changing the pid directory), it starts being
> unresponsive again
> restart it, it does the same
> stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> unresponsive again
> restart it, and this time it works
>
> The only log above an INFO statement that I can see is this:
> 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> load output format plugin class  - Class not found
> but I don't think that's causing the issues.
>
> I do have a flume-node running on the same machine, could there be some sort
> of race condition happening?
> Has anyone else seen behavior like this?
> Any idea how to fix it?
> Hoping someone can shed some light on this, I'm really not sure what's going
> on.
> Thanks all
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq
>
>

Re: Flume Master Issues

Posted by Matthew Rathbone <ma...@foursquare.com>.
 You're right, there's another pid file there, that's crazy.

Have you experienced the unresponsiveness thing too? 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



On Friday, August 26, 2011 at 10:17 AM, Mike wrote:

> I recall a similar problem I had with this.
> 
> It ended up being another pid-style file dropped somewhere else.
> 
> /var/run/flume/flume-flume-master.pid
> /tmp/flumemaster.pid
> 
> See if those are still around once all the flume procs are dead.
> 
> -M
> 
> On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> <matthew@foursquare.com (mailto:matthew@foursquare.com)> wrote:
> > Hey all,
> > We're having totally unpredictable issues with the flume master installation
> > lately, here's what happened to us last night / today:
> > YESTERDAY
> > Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> > configs were registered.
> > a few hours later the master totally stops responding to anything
> > (web/shell/nodes), I don't find out until this morning.
> > TODAY
> > I try to stop it using the init script, that doesn't do anything, and it
> > continues to run, but be unresponsive
> > I kill -9 the flume processes, and remove the pid file, figuring I can just
> > start it again
> > now the master won't start "master already running on
> > pid=<non-existent-pid>"
> > when I finally get it to start (changing the pid directory), it starts being
> > unresponsive again
> > restart it, it does the same
> > stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> > unresponsive again
> > restart it, and this time it works
> > 
> > The only log above an INFO statement that I can see is this:
> > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> > load output format plugin class - Class not found
> > but I don't think that's causing the issues.
> > 
> > I do have a flume-node running on the same machine, could there be some sort
> > of race condition happening?
> > Has anyone else seen behavior like this?
> > Any idea how to fix it?
> > Hoping someone can shed some light on this, I'm really not sure what's going
> > on.
> > Thanks all
> > --
> > Matthew Rathbone
> > Foursquare | Software Engineer | Server Engineering Team
> > matthew@foursquare.com (mailto:matthew@foursquare.com) | @rathboma | 4sq


Re: Flume Master Issues

Posted by Mike <mi...@gmail.com>.
I recall a similar problem I had with this.

It ended up being another pid-style file dropped somewhere else.

/var/run/flume/flume-flume-master.pid
/tmp/flumemaster.pid

See if those are still around once all the flume procs are dead.

-M

On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
<ma...@foursquare.com> wrote:
> Hey all,
> We're having totally unpredictable issues with the flume master installation
> lately, here's what happened to us last night / today:
> YESTERDAY
> Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> configs were registered.
> a few hours later the master totally stops responding to anything
> (web/shell/nodes), I don't find out until this morning.
> TODAY
> I try to stop it using the init script, that doesn't do anything, and it
> continues to run, but be unresponsive
> I kill -9 the flume processes, and remove the pid file, figuring I can just
> start it again
> now the master won't start "master already running on
> pid=<non-existent-pid>"
> when I finally get it to start (changing the pid directory), it starts being
> unresponsive again
> restart it, it does the same
> stop all flume-nodes, restart it, looks good, start the flume nodes, it goes
> unresponsive again
> restart it, and this time it works
>
> The only log above an INFO statement that I can see is this:
> 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> load output format plugin class  - Class not found
> but I don't think that's causing the issues.
>
> I do have a flume-node running on the same machine, could there be some sort
> of race condition happening?
> Has anyone else seen behavior like this?
> Any idea how to fix it?
> Hoping someone can shed some light on this, I'm really not sure what's going
> on.
> Thanks all
> --
> Matthew Rathbone
> Foursquare | Software Engineer | Server Engineering Team
> matthew@foursquare.com | @rathboma | 4sq