You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Juhani Connolly <ju...@cyberagent.co.jp> on 2013/01/17 04:45:01 UTC

EventDrivenSource and dead threads

I came upon an issue with ScribeSource,  though it's theoretically 
applicable to any EventDrivenSource whose event generating thread(s) 
die. Simple put, sending a bad packet to the thrift(scribe protocol) 
port will result in it trying to allocate space for some arbitrarily 
large packet resulting in an OOMException which kills the 
thread(incidentally I thought this would be an issue in avro too, but it 
throws an exception before making excessive allocation requests).

As far as flume is concerned, the component is still alive. stop() was 
never called, so even monitoring the component state using jmx will not 
notice anything wrong. This situation occurs from user error, but there 
is potential for other errors leaving a zombie component. I think it 
would be more user friendly to be able to recover from such errors.

I'm thinking of adding a StatusPollable interface that 
EventDrivenSources can optionally implement(because we can't change the 
interface without a version change). If implemented, the 
EventDrivenSourceRunner would schedule a regular poll to check the 
state. Upon failure it could either call stop() to signal it broke. With 
autoRestartPolicy, the source would then get restarted by its supervisor.

Would appreciate any opinions before I put together a patch/post an issue.

Re: EventDrivenSource and dead threads

Posted by Brock Noland <br...@cloudera.com>.

Hi,

Yeah I'd prefer to avoid overriding getLifeCycle to do this since
getters are typically lightweight. As such, I feel it violates the
general contract behind getters.

In regards to channels/sinks/pollable sources, I don't think they need
this behavior as sinks and poll-able sources both have threads polling
them at regular intervals. We have two robust channels memory and file
which don't require this behavior as well. Given this, I feel like
this is limited to EventDrivenSources which don't have a thread
polling them.

On second thought, FLUME-1630 got rid of one layer of threads in the
flume agent, perhaps adding a single threaded scheduled executor to
EventDrivenSource is not a bad solution.

Brock

On Wed, Jan 16, 2013 at 11:02 PM, Connor Woodson <cw...@gmail.com> wrote:
> What I was trying to say is that the ScribeSource should not be responsible
> for restarting itself (just from my understanding of your idea; it breaks
> the existing paradigm as components should not have to control their own
> lifecycle). Going from Brock's link, I feel the most dynamic solution would
> be possible to just add in a lifecyle state RESTART and place it in that
> switch statement; when that state is reached, it tries to stop then restart
> the component and then sets the desired state to START (or STOP if it
> couldn't start it again, or set error=true if it couldn't stop it).
>
> And in a way to prevent overriding getLifecycleState to return RESTART,
> there could either be an inherited function from AbstractSource to call for
> a restart, there could maybe be an option to set it to RESTART on crash, or
> something else.
>
> I'll admit though that I know little about the lifecycle system, so I have
> no idea if this idea is any better.
>
> - Connor
>
>
> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>> Sink, Source and Channel all extend LifecycleAware, so the function is
>> available to all components already.
>>
>> I was more questioning whether it was reasonable to start including logic
>> to determine the state. That being said, I think the precedent of just
>> returning the state set by start/stop is more one of habit, so on further
>> thought I don't see it as being unreasonable. I'm going to give fixing
>> ScribeSource with it a poke.
>>
>> As to new lifecycle states, I took a pass at reworking the lifecycle model
>> with guavas service implementation in the past, but it took some very
>> significant changes and didn't get the momentum/interest necessary to keep
>> working on it. That ticket is here https://issues.apache.org/**
>> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .
>> Brock might be working on it now though the issue doesn't appear to have
>> had attention since october.
>>
>>
>> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>>
>>> Why limit it to the sources? If there is going to be a change to one
>>> component's lifecycle, then I see no reason not to change every
>>> component's
>>> lifecycle.
>>>
>>> Sinks and Channels could very well have this problem; so what about giving
>>> each LifecycleAware component a takePulse method (or something); or to
>>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>>> which, when detected, causes a restart of the component. Then components
>>> would just need to override getLifecycleState (if this method is polled
>>> regularly; I don't know. maybe use a listener for when there's a state
>>> change) to detect if it has crashed/needs to be restarted.
>>>
>>> Just my thoughts,
>>>
>>> - Connor
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>>> juhani_connolly@cyberagent.co.**jp <ju...@cyberagent.co.jp>>
>>> wrote:
>>>
>>>  Hmm, overriding the implementation of getLifecycleState provided by
>>>> AbstractSource could work. It would be going against the convention that
>>>> has been maintained in all other components(that I can think of)
>>>>
>>>>
>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>>> another thread. An alternative which wouldn't require another thread
>>>>> would be to check the optional interface in the supervisor,
>>>>> approximately here:
>>>>>
>>>>> https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>>> core/src/main/java/org/apache/****flume/lifecycle/**
>>>>> LifecycleSupervisor.java#L240<**https://github.com/apache/**
>>>>> flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**
>>>>> flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>>> >
>>>>>
>>>>>
>>>>> However, I am not sold on the supervisor being the best place to fix
>>>>> this as I am not sure that other lifecycle components would need this.
>>>>>
>>>>>
>>>>>
>>>>> Brock
>>>>>
>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>>> <juhani_connolly@cyberagent.****co.jp <http://co.jp> <
>>>>> juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>  I came upon an issue with ScribeSource,  though it's theoretically
>>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>>> die.
>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>>> will
>>>>>> result in it trying to allocate space for some arbitrarily large packet
>>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>>> thought
>>>>>> this would be an issue in avro too, but it throws an exception before
>>>>>> making
>>>>>> excessive allocation requests).
>>>>>>
>>>>>> As far as flume is concerned, the component is still alive. stop() was
>>>>>> never
>>>>>> called, so even monitoring the component state using jmx will not
>>>>>> notice
>>>>>> anything wrong. This situation occurs from user error, but there is
>>>>>> potential for other errors leaving a zombie component. I think it would
>>>>>> be
>>>>>> more user friendly to be able to recover from such errors.
>>>>>>
>>>>>> I'm thinking of adding a StatusPollable interface that
>>>>>> EventDrivenSources
>>>>>> can optionally implement(because we can't change the interface without
>>>>>> a
>>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>>> schedule
>>>>>> a regular poll to check the state. Upon failure it could either call
>>>>>> stop()
>>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>>> restarted by its supervisor.
>>>>>>
>>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>>> issue.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: EventDrivenSource and dead threads

Posted by Connor Woodson <cw...@gmail.com>.

I don't know if this is within the scope of this change, but sinks
sometimes will need to be restarted; for instance, I had an HDFS sink crash
from an Out of Memory error (caused by the JIRA I filed; I wasn't planning
on using a large heap) and it doesn't automatically restart; I don't know
if there's a nice way to detect of a sink/source has crashed, but if so it
would be nice to have them auto-restart when they go down.

- Connor


On Thu, Jan 17, 2013 at 12:27 AM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> What I described isn't really taking control of lifecycle.
>
> What would happen is:
> - source start. State=start
> - OOM exception happens
> - getLifecycleState called, override calls server.isServing(). This is a
> lightweight call. If it returns false, it either calls stop() or local
> lifeCycleState is set to ERROR (I'm not actually sure how this is handled).
> The latter is probably the correct call in this case, but I'm not entirely
> sure if the supervisor will ever restart it then.
>
> The supervisor would still be doing any restarting. I guess it's really a
> matter of who detects the bad state(getLifecycleState or a scheduled
> runnable) and what that call is allowed to do(switch state to ERROR or call
> stop()). When the source breaks, changing its status to accurately reflect
> this is not breaking the paradigm(though calling stop() may be)
>
> It looks to me like switching the state to ERROR should be sufficient...
> Monitor should then try to start the source again. However not calling
> stop() before that may cause resource leakage? Could someone confirm the
> behavior in regards to ERROR?
>
>
> On 01/17/2013 04:02 PM, Connor Woodson wrote:
>
>> What I was trying to say is that the ScribeSource should not be
>> responsible
>> for restarting itself (just from my understanding of your idea; it breaks
>> the existing paradigm as components should not have to control their own
>> lifecycle). Going from Brock's link, I feel the most dynamic solution
>> would
>> be possible to just add in a lifecyle state RESTART and place it in that
>> switch statement; when that state is reached, it tries to stop then
>> restart
>> the component and then sets the desired state to START (or STOP if it
>> couldn't start it again, or set error=true if it couldn't stop it).
>>
>> And in a way to prevent overriding getLifecycleState to return RESTART,
>> there could either be an inherited function from AbstractSource to call
>> for
>> a restart, there could maybe be an option to set it to RESTART on crash,
>> or
>> something else.
>>
>> I'll admit though that I know little about the lifecycle system, so I have
>> no idea if this idea is any better.
>>
>> - Connor
>>
>>
>> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
>> juhani_connolly@cyberagent.co.**jp <ju...@cyberagent.co.jp>>
>> wrote:
>>
>>  Sink, Source and Channel all extend LifecycleAware, so the function is
>>> available to all components already.
>>>
>>> I was more questioning whether it was reasonable to start including logic
>>> to determine the state. That being said, I think the precedent of just
>>> returning the state set by start/stop is more one of habit, so on further
>>> thought I don't see it as being unreasonable. I'm going to give fixing
>>> ScribeSource with it a poke.
>>>
>>> As to new lifecycle states, I took a pass at reworking the lifecycle
>>> model
>>> with guavas service implementation in the past, but it took some very
>>> significant changes and didn't get the momentum/interest necessary to
>>> keep
>>> working on it. That ticket is here https://issues.apache.org/**
>>> jira/browse/FLUME-966 <https://issues.apache.org/**jira/browse/FLUME-966<https://issues.apache.org/jira/browse/FLUME-966>>
>>> .
>>>
>>> Brock might be working on it now though the issue doesn't appear to have
>>> had attention since october.
>>>
>>>
>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>>>
>>>  Why limit it to the sources? If there is going to be a change to one
>>>> component's lifecycle, then I see no reason not to change every
>>>> component's
>>>> lifecycle.
>>>>
>>>> Sinks and Channels could very well have this problem; so what about
>>>> giving
>>>> each LifecycleAware component a takePulse method (or something); or to
>>>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>>>> which, when detected, causes a restart of the component. Then components
>>>> would just need to override getLifecycleState (if this method is polled
>>>> regularly; I don't know. maybe use a listener for when there's a state
>>>> change) to detect if it has crashed/needs to be restarted.
>>>>
>>>> Just my thoughts,
>>>>
>>>> - Connor
>>>>
>>>>
>>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>>>> juhani_connolly@cyberagent.co.****jp <juhani_connolly@cyberagent.**
>>>> co.jp <ju...@cyberagent.co.jp>>>
>>>>
>>>> wrote:
>>>>
>>>>   Hmm, overriding the implementation of getLifecycleState provided by
>>>>
>>>>> AbstractSource could work. It would be going against the convention
>>>>> that
>>>>> has been maintained in all other components(that I can think of)
>>>>>
>>>>>
>>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>>>> another thread. An alternative which wouldn't require another thread
>>>>>> would be to check the optional interface in the supervisor,
>>>>>> approximately here:
>>>>>>
>>>>>> https://github.com/apache/******flume/blob/trunk/flume-ng-**<https://github.com/apache/****flume/blob/trunk/flume-ng-**>
>>>>>> <h**ttps://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>>>> >
>>>>>> core/src/main/java/org/apache/******flume/lifecycle/**
>>>>>> LifecycleSupervisor.java#L240<****https://github.com/apache/**
>>>>>>
>>>>>> flume/blob/trunk/flume-ng-****core/src/main/java/org/apache/****
>>>>>> flume/lifecycle/****LifecycleSupervisor.java#L240<**
>>>>>> https://github.com/apache/**flume/blob/trunk/flume-ng-**
>>>>>> core/src/main/java/org/apache/**flume/lifecycle/**
>>>>>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>>>> >
>>>>>>
>>>>>> However, I am not sold on the supervisor being the best place to fix
>>>>>> this as I am not sure that other lifecycle components would need this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Brock
>>>>>>
>>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>>>> <juhani_connolly@cyberagent.******co.jp <http://co.jp> <
>>>>>>
>>>>>> juhani_connolly@cyberagent.**c**o.jp <http://co.jp> <
>>>>>> juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>
>>>>>> >>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>   I came upon an issue with ScribeSource,  though it's theoretically
>>>>>>
>>>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>>>> die.
>>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>>>> will
>>>>>>> result in it trying to allocate space for some arbitrarily large
>>>>>>> packet
>>>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>>>> thought
>>>>>>> this would be an issue in avro too, but it throws an exception before
>>>>>>> making
>>>>>>> excessive allocation requests).
>>>>>>>
>>>>>>> As far as flume is concerned, the component is still alive. stop()
>>>>>>> was
>>>>>>> never
>>>>>>> called, so even monitoring the component state using jmx will not
>>>>>>> notice
>>>>>>> anything wrong. This situation occurs from user error, but there is
>>>>>>> potential for other errors leaving a zombie component. I think it
>>>>>>> would
>>>>>>> be
>>>>>>> more user friendly to be able to recover from such errors.
>>>>>>>
>>>>>>> I'm thinking of adding a StatusPollable interface that
>>>>>>> EventDrivenSources
>>>>>>> can optionally implement(because we can't change the interface
>>>>>>> without
>>>>>>> a
>>>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>>>> schedule
>>>>>>> a regular poll to check the state. Upon failure it could either call
>>>>>>> stop()
>>>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>>>> restarted by its supervisor.
>>>>>>>
>>>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>>>> issue.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>

Re: EventDrivenSource and dead threads

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

What I described isn't really taking control of lifecycle.

What would happen is:
- source start. State=start
- OOM exception happens
- getLifecycleState called, override calls server.isServing(). This is a 
lightweight call. If it returns false, it either calls stop() or local 
lifeCycleState is set to ERROR (I'm not actually sure how this is 
handled). The latter is probably the correct call in this case, but I'm 
not entirely sure if the supervisor will ever restart it then.

The supervisor would still be doing any restarting. I guess it's really 
a matter of who detects the bad state(getLifecycleState or a scheduled 
runnable) and what that call is allowed to do(switch state to ERROR or 
call stop()). When the source breaks, changing its status to accurately 
reflect this is not breaking the paradigm(though calling stop() may be)

It looks to me like switching the state to ERROR should be sufficient... 
Monitor should then try to start the source again. However not calling 
stop() before that may cause resource leakage? Could someone confirm the 
behavior in regards to ERROR?

On 01/17/2013 04:02 PM, Connor Woodson wrote:
> What I was trying to say is that the ScribeSource should not be responsible
> for restarting itself (just from my understanding of your idea; it breaks
> the existing paradigm as components should not have to control their own
> lifecycle). Going from Brock's link, I feel the most dynamic solution would
> be possible to just add in a lifecyle state RESTART and place it in that
> switch statement; when that state is reached, it tries to stop then restart
> the component and then sets the desired state to START (or STOP if it
> couldn't start it again, or set error=true if it couldn't stop it).
>
> And in a way to prevent overriding getLifecycleState to return RESTART,
> there could either be an inherited function from AbstractSource to call for
> a restart, there could maybe be an option to set it to RESTART on crash, or
> something else.
>
> I'll admit though that I know little about the lifecycle system, so I have
> no idea if this idea is any better.
>
> - Connor
>
>
> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>> Sink, Source and Channel all extend LifecycleAware, so the function is
>> available to all components already.
>>
>> I was more questioning whether it was reasonable to start including logic
>> to determine the state. That being said, I think the precedent of just
>> returning the state set by start/stop is more one of habit, so on further
>> thought I don't see it as being unreasonable. I'm going to give fixing
>> ScribeSource with it a poke.
>>
>> As to new lifecycle states, I took a pass at reworking the lifecycle model
>> with guavas service implementation in the past, but it took some very
>> significant changes and didn't get the momentum/interest necessary to keep
>> working on it. That ticket is here https://issues.apache.org/**
>> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .
>> Brock might be working on it now though the issue doesn't appear to have
>> had attention since october.
>>
>>
>> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>>
>>> Why limit it to the sources? If there is going to be a change to one
>>> component's lifecycle, then I see no reason not to change every
>>> component's
>>> lifecycle.
>>>
>>> Sinks and Channels could very well have this problem; so what about giving
>>> each LifecycleAware component a takePulse method (or something); or to
>>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>>> which, when detected, causes a restart of the component. Then components
>>> would just need to override getLifecycleState (if this method is polled
>>> regularly; I don't know. maybe use a listener for when there's a state
>>> change) to detect if it has crashed/needs to be restarted.
>>>
>>> Just my thoughts,
>>>
>>> - Connor
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>>> juhani_connolly@cyberagent.co.**jp <ju...@cyberagent.co.jp>>
>>> wrote:
>>>
>>>   Hmm, overriding the implementation of getLifecycleState provided by
>>>> AbstractSource could work. It would be going against the convention that
>>>> has been maintained in all other components(that I can think of)
>>>>
>>>>
>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>>
>>>>   Hi,
>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>>> another thread. An alternative which wouldn't require another thread
>>>>> would be to check the optional interface in the supervisor,
>>>>> approximately here:
>>>>>
>>>>> https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>>> core/src/main/java/org/apache/****flume/lifecycle/**
>>>>> LifecycleSupervisor.java#L240<**https://github.com/apache/**
>>>>> flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**
>>>>> flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>>>
>>>>> However, I am not sold on the supervisor being the best place to fix
>>>>> this as I am not sure that other lifecycle components would need this.
>>>>>
>>>>>
>>>>>
>>>>> Brock
>>>>>
>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>>> <juhani_connolly@cyberagent.****co.jp <http://co.jp> <
>>>>> juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>   I came upon an issue with ScribeSource,  though it's theoretically
>>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>>> die.
>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>>> will
>>>>>> result in it trying to allocate space for some arbitrarily large packet
>>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>>> thought
>>>>>> this would be an issue in avro too, but it throws an exception before
>>>>>> making
>>>>>> excessive allocation requests).
>>>>>>
>>>>>> As far as flume is concerned, the component is still alive. stop() was
>>>>>> never
>>>>>> called, so even monitoring the component state using jmx will not
>>>>>> notice
>>>>>> anything wrong. This situation occurs from user error, but there is
>>>>>> potential for other errors leaving a zombie component. I think it would
>>>>>> be
>>>>>> more user friendly to be able to recover from such errors.
>>>>>>
>>>>>> I'm thinking of adding a StatusPollable interface that
>>>>>> EventDrivenSources
>>>>>> can optionally implement(because we can't change the interface without
>>>>>> a
>>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>>> schedule
>>>>>> a regular poll to check the state. Upon failure it could either call
>>>>>> stop()
>>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>>> restarted by its supervisor.
>>>>>>
>>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>>> issue.
>>>>>>
>>>>>>
>>>>>

Re: EventDrivenSource and dead threads

Posted by Connor Woodson <cw...@gmail.com>.

What I was trying to say is that the ScribeSource should not be responsible
for restarting itself (just from my understanding of your idea; it breaks
the existing paradigm as components should not have to control their own
lifecycle). Going from Brock's link, I feel the most dynamic solution would
be possible to just add in a lifecyle state RESTART and place it in that
switch statement; when that state is reached, it tries to stop then restart
the component and then sets the desired state to START (or STOP if it
couldn't start it again, or set error=true if it couldn't stop it).

And in a way to prevent overriding getLifecycleState to return RESTART,
there could either be an inherited function from AbstractSource to call for
a restart, there could maybe be an option to set it to RESTART on crash, or
something else.

I'll admit though that I know little about the lifecycle system, so I have
no idea if this idea is any better.

- Connor


On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> Sink, Source and Channel all extend LifecycleAware, so the function is
> available to all components already.
>
> I was more questioning whether it was reasonable to start including logic
> to determine the state. That being said, I think the precedent of just
> returning the state set by start/stop is more one of habit, so on further
> thought I don't see it as being unreasonable. I'm going to give fixing
> ScribeSource with it a poke.
>
> As to new lifecycle states, I took a pass at reworking the lifecycle model
> with guavas service implementation in the past, but it took some very
> significant changes and didn't get the momentum/interest necessary to keep
> working on it. That ticket is here https://issues.apache.org/**
> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .
> Brock might be working on it now though the issue doesn't appear to have
> had attention since october.
>
>
> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>
>> Why limit it to the sources? If there is going to be a change to one
>> component's lifecycle, then I see no reason not to change every
>> component's
>> lifecycle.
>>
>> Sinks and Channels could very well have this problem; so what about giving
>> each LifecycleAware component a takePulse method (or something); or to
>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>> which, when detected, causes a restart of the component. Then components
>> would just need to override getLifecycleState (if this method is polled
>> regularly; I don't know. maybe use a listener for when there's a state
>> change) to detect if it has crashed/needs to be restarted.
>>
>> Just my thoughts,
>>
>> - Connor
>>
>>
>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>> juhani_connolly@cyberagent.co.**jp <ju...@cyberagent.co.jp>>
>> wrote:
>>
>>  Hmm, overriding the implementation of getLifecycleState provided by
>>> AbstractSource could work. It would be going against the convention that
>>> has been maintained in all other components(that I can think of)
>>>
>>>
>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>
>>>  Hi,
>>>>
>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>> another thread. An alternative which wouldn't require another thread
>>>> would be to check the optional interface in the supervisor,
>>>> approximately here:
>>>>
>>>> https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>> core/src/main/java/org/apache/****flume/lifecycle/**
>>>> LifecycleSupervisor.java#L240<**https://github.com/apache/**
>>>> flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**
>>>> flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>> >
>>>>
>>>>
>>>> However, I am not sold on the supervisor being the best place to fix
>>>> this as I am not sure that other lifecycle components would need this.
>>>>
>>>>
>>>>
>>>> Brock
>>>>
>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>> <juhani_connolly@cyberagent.****co.jp <http://co.jp> <
>>>> juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>>>
>>>>
>>>> wrote:
>>>>
>>>>  I came upon an issue with ScribeSource,  though it's theoretically
>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>> die.
>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>> will
>>>>> result in it trying to allocate space for some arbitrarily large packet
>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>> thought
>>>>> this would be an issue in avro too, but it throws an exception before
>>>>> making
>>>>> excessive allocation requests).
>>>>>
>>>>> As far as flume is concerned, the component is still alive. stop() was
>>>>> never
>>>>> called, so even monitoring the component state using jmx will not
>>>>> notice
>>>>> anything wrong. This situation occurs from user error, but there is
>>>>> potential for other errors leaving a zombie component. I think it would
>>>>> be
>>>>> more user friendly to be able to recover from such errors.
>>>>>
>>>>> I'm thinking of adding a StatusPollable interface that
>>>>> EventDrivenSources
>>>>> can optionally implement(because we can't change the interface without
>>>>> a
>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>> schedule
>>>>> a regular poll to check the state. Upon failure it could either call
>>>>> stop()
>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>> restarted by its supervisor.
>>>>>
>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>> issue.
>>>>>
>>>>>
>>>>
>>>>
>

Re: EventDrivenSource and dead threads

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

Sink, Source and Channel all extend LifecycleAware, so the function is 
available to all components already.

I was more questioning whether it was reasonable to start including 
logic to determine the state. That being said, I think the precedent of 
just returning the state set by start/stop is more one of habit, so on 
further thought I don't see it as being unreasonable. I'm going to give 
fixing ScribeSource with it a poke.

As to new lifecycle states, I took a pass at reworking the lifecycle 
model with guavas service implementation in the past, but it took some 
very significant changes and didn't get the momentum/interest necessary 
to keep working on it. That ticket is here 
https://issues.apache.org/jira/browse/FLUME-966 . Brock might be working 
on it now though the issue doesn't appear to have had attention since 
october.

On 01/17/2013 03:01 PM, Connor Woodson wrote:
> Why limit it to the sources? If there is going to be a change to one
> component's lifecycle, then I see no reason not to change every component's
> lifecycle.
>
> Sinks and Channels could very well have this problem; so what about giving
> each LifecycleAware component a takePulse method (or something); or to
> avoid creating a new method, add a new lifecycle state 'crashed' or such
> which, when detected, causes a restart of the component. Then components
> would just need to override getLifecycleState (if this method is polled
> regularly; I don't know. maybe use a listener for when there's a state
> change) to detect if it has crashed/needs to be restarted.
>
> Just my thoughts,
>
> - Connor
>
>
> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>> Hmm, overriding the implementation of getLifecycleState provided by
>> AbstractSource could work. It would be going against the convention that
>> has been maintained in all other components(that I can think of)
>>
>>
>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>
>>> Hi,
>>>
>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>> another thread. An alternative which wouldn't require another thread
>>> would be to check the optional interface in the supervisor,
>>> approximately here:
>>>
>>> https://github.com/apache/**flume/blob/trunk/flume-ng-**
>>> core/src/main/java/org/apache/**flume/lifecycle/**
>>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>
>>> However, I am not sold on the supervisor being the best place to fix
>>> this as I am not sure that other lifecycle components would need this.
>>>
>>>
>>>
>>> Brock
>>>
>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>> <juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>>
>>> wrote:
>>>
>>>> I came upon an issue with ScribeSource,  though it's theoretically
>>>> applicable to any EventDrivenSource whose event generating thread(s) die.
>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port will
>>>> result in it trying to allocate space for some arbitrarily large packet
>>>> resulting in an OOMException which kills the thread(incidentally I
>>>> thought
>>>> this would be an issue in avro too, but it throws an exception before
>>>> making
>>>> excessive allocation requests).
>>>>
>>>> As far as flume is concerned, the component is still alive. stop() was
>>>> never
>>>> called, so even monitoring the component state using jmx will not notice
>>>> anything wrong. This situation occurs from user error, but there is
>>>> potential for other errors leaving a zombie component. I think it would
>>>> be
>>>> more user friendly to be able to recover from such errors.
>>>>
>>>> I'm thinking of adding a StatusPollable interface that EventDrivenSources
>>>> can optionally implement(because we can't change the interface without a
>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>> schedule
>>>> a regular poll to check the state. Upon failure it could either call
>>>> stop()
>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>> restarted by its supervisor.
>>>>
>>>> Would appreciate any opinions before I put together a patch/post an
>>>> issue.
>>>>
>>>
>>>

Re: EventDrivenSource and dead threads

Posted by Connor Woodson <cw...@gmail.com>.

Why limit it to the sources? If there is going to be a change to one
component's lifecycle, then I see no reason not to change every component's
lifecycle.

Sinks and Channels could very well have this problem; so what about giving
each LifecycleAware component a takePulse method (or something); or to
avoid creating a new method, add a new lifecycle state 'crashed' or such
which, when detected, causes a restart of the component. Then components
would just need to override getLifecycleState (if this method is polled
regularly; I don't know. maybe use a listener for when there's a state
change) to detect if it has crashed/needs to be restarted.

Just my thoughts,

- Connor


On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> Hmm, overriding the implementation of getLifecycleState provided by
> AbstractSource could work. It would be going against the convention that
> has been maintained in all other components(that I can think of)
>
>
> On 01/17/2013 01:20 PM, Brock Noland wrote:
>
>> Hi,
>>
>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>> another thread. An alternative which wouldn't require another thread
>> would be to check the optional interface in the supervisor,
>> approximately here:
>>
>> https://github.com/apache/**flume/blob/trunk/flume-ng-**
>> core/src/main/java/org/apache/**flume/lifecycle/**
>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>
>> However, I am not sold on the supervisor being the best place to fix
>> this as I am not sure that other lifecycle components would need this.
>>
>>
>>
>> Brock
>>
>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>> <juhani_connolly@cyberagent.**co.jp <ju...@cyberagent.co.jp>>
>> wrote:
>>
>>> I came upon an issue with ScribeSource,  though it's theoretically
>>> applicable to any EventDrivenSource whose event generating thread(s) die.
>>> Simple put, sending a bad packet to the thrift(scribe protocol) port will
>>> result in it trying to allocate space for some arbitrarily large packet
>>> resulting in an OOMException which kills the thread(incidentally I
>>> thought
>>> this would be an issue in avro too, but it throws an exception before
>>> making
>>> excessive allocation requests).
>>>
>>> As far as flume is concerned, the component is still alive. stop() was
>>> never
>>> called, so even monitoring the component state using jmx will not notice
>>> anything wrong. This situation occurs from user error, but there is
>>> potential for other errors leaving a zombie component. I think it would
>>> be
>>> more user friendly to be able to recover from such errors.
>>>
>>> I'm thinking of adding a StatusPollable interface that EventDrivenSources
>>> can optionally implement(because we can't change the interface without a
>>> version change). If implemented, the EventDrivenSourceRunner would
>>> schedule
>>> a regular poll to check the state. Upon failure it could either call
>>> stop()
>>> to signal it broke. With autoRestartPolicy, the source would then get
>>> restarted by its supervisor.
>>>
>>> Would appreciate any opinions before I put together a patch/post an
>>> issue.
>>>
>>
>>
>>
>

Re: EventDrivenSource and dead threads

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

Hmm, overriding the implementation of getLifecycleState provided by 
AbstractSource could work. It would be going against the convention that 
has been maintained in all other components(that I can think of)

On 01/17/2013 01:20 PM, Brock Noland wrote:
> Hi,
>
> Yes I can definitely see the issue. It sucks that we'd have to add yet
> another thread. An alternative which wouldn't require another thread
> would be to check the optional interface in the supervisor,
> approximately here:
>
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240
>
> However, I am not sold on the supervisor being the best place to fix
> this as I am not sure that other lifecycle components would need this.
>
>
>
> Brock
>
> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
> <ju...@cyberagent.co.jp> wrote:
>> I came upon an issue with ScribeSource,  though it's theoretically
>> applicable to any EventDrivenSource whose event generating thread(s) die.
>> Simple put, sending a bad packet to the thrift(scribe protocol) port will
>> result in it trying to allocate space for some arbitrarily large packet
>> resulting in an OOMException which kills the thread(incidentally I thought
>> this would be an issue in avro too, but it throws an exception before making
>> excessive allocation requests).
>>
>> As far as flume is concerned, the component is still alive. stop() was never
>> called, so even monitoring the component state using jmx will not notice
>> anything wrong. This situation occurs from user error, but there is
>> potential for other errors leaving a zombie component. I think it would be
>> more user friendly to be able to recover from such errors.
>>
>> I'm thinking of adding a StatusPollable interface that EventDrivenSources
>> can optionally implement(because we can't change the interface without a
>> version change). If implemented, the EventDrivenSourceRunner would schedule
>> a regular poll to check the state. Upon failure it could either call stop()
>> to signal it broke. With autoRestartPolicy, the source would then get
>> restarted by its supervisor.
>>
>> Would appreciate any opinions before I put together a patch/post an issue.
>
>

Re: EventDrivenSource and dead threads

Posted by Brock Noland <br...@cloudera.com>.

Hi,

Yes I can definitely see the issue. It sucks that we'd have to add yet
another thread. An alternative which wouldn't require another thread
would be to check the optional interface in the supervisor,
approximately here:

https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240

However, I am not sold on the supervisor being the best place to fix
this as I am not sure that other lifecycle components would need this.



Brock

On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
<ju...@cyberagent.co.jp> wrote:
> I came upon an issue with ScribeSource,  though it's theoretically
> applicable to any EventDrivenSource whose event generating thread(s) die.
> Simple put, sending a bad packet to the thrift(scribe protocol) port will
> result in it trying to allocate space for some arbitrarily large packet
> resulting in an OOMException which kills the thread(incidentally I thought
> this would be an issue in avro too, but it throws an exception before making
> excessive allocation requests).
>
> As far as flume is concerned, the component is still alive. stop() was never
> called, so even monitoring the component state using jmx will not notice
> anything wrong. This situation occurs from user error, but there is
> potential for other errors leaving a zombie component. I think it would be
> more user friendly to be able to recover from such errors.
>
> I'm thinking of adding a StatusPollable interface that EventDrivenSources
> can optionally implement(because we can't change the interface without a
> version change). If implemented, the EventDrivenSourceRunner would schedule
> a regular poll to check the state. Upon failure it could either call stop()
> to signal it broke. With autoRestartPolicy, the source would then get
> restarted by its supervisor.
>
> Would appreciate any opinions before I put together a patch/post an issue.



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/