You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Brock Noland <br...@cloudera.com> on 2012/02/26 18:44:17 UTC

Source/Channel/Sink Behavior Standardization

Hello,

This might be something for the developer guide or it might be
somewhere and I just missed it.  I feel like we should set down some
expectations in regards to:

1) Source behavior when:
  a) Channel put fails
  b) Source started but is unable to obtain new events for some reason
2) Channel behavior when:
  a) Channel capacity exceeded
  b) take when channel is empty
3) Sink behavior when:
  a) Channel take returns null
  b) Sink cannot write to the downstream location

This comes about when I noticed some inconsistencies.  For example, a
take in MemoryChannel blocks for a few seconds by default and
JDBCChannel does not (FLUME-998). Combined with HDFSEvent sink, this
causes tremendous amounts of CPU consumption. Also, currently if HDFS
is unavailable for a period, flume needs to be restarted (FLUME-985).

My general thoughts are are based on experience working with JMS based services.

1) Source/Channel/Sink should not require a restart when up or down
stream services are restarted or become temporarily unavailable.
2) Channel capacity being exceeded should not lead to sources dying
and thus requiring a flume restart. This will happen when downstream
destinations slow down for various reasons.

Brock

-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Source/Channel/Sink Behavior Standardization

Posted by Eric Sammer <es...@cloudera.com>.

I haven't gotten a chance to fully digest this thread but here's the cheat
sheet:

* It's tempting to make channel ops blocking and event driven upon event
arrival. This creates a nasty error handling / timed event competition[1]
case that OG choked on constantly. Until there's an obvious solution (I
have some ideas) I'd prefer to leave it.
* Even if we did this, you'd potentially destroy performance (make it
worse) because you'd fire immediately upon event arrival which can reduce
batch sizes; RPC accounts for significantly more overhead over a steady
stream of events than sleeping while events accumulate and polling again in
a second (or 500ms - whatever it is). We're aiming for global throughput,
not immediate latency compensation at the micro-level (although I'm happy
to revisit that, people will *not* be happy with performance, though - I
did a bunch of early testing here).
* Standardization is good.
* JDBCChannel can block (due to transaction semantics; those block
potentially indefinitely).

[1] The "timed event competition" problem is, for instance, situations such
as rolling file handles while blocking on events. In OG, the HDFS sink
would block on event arrival but it desperately wanted to rotate the file
handle every N seconds. If there was a period of N+1 seconds with no
events, the code descended into thread interruption voodoo that never
worked right. Now, one can deal with this by using intricate bit flipping
(i.e. keeping state about the number of events received since last poll and
skipping rotation if nothing has happened) but this tends to kick the can
down the road on errors. For instance, if a DN failed and the connection to
HDFS becomes invalid, the fact that you didn't rotate means you didn't
catch the error when it happened (or close to it) and instead, catch it N+M
seconds later when you do receive an event and immediately catch an
exception on write, blah blah blah. It's all very yucky.

I promise to give this more attention. This is just because I skimmed and
saw "why not make it blocking" and I panicked. :)

On Mon, Feb 27, 2012 at 6:51 PM, Arvind Prabhakar <ar...@cloudera.com>wrote:

> On Mon, Feb 27, 2012 at 6:33 PM, Brock Noland <br...@cloudera.com> wrote:
>
> > Hi,
> >
> > Thank you everyone for you comments! My comments are inline.
> >
> > On Tue, Feb 28, 2012 at 1:13 AM, Arvind Prabhakar <ar...@apache.org>
> > wrote:
> > > My bad for not chiming on this thread soon enough. When we laid out
> > > the initial architecture, the following assumptions were made and I
> > > still think that most of them are valid:
> > >
> > > 1. Sources doing put() on channel should relay back any exceptions
> > > they receive from the channel. They should not die or become invalid
> > > due to this. If they do, it is more of a bug in source implementation.
> > >
> > > 2. Channels must respect capacity. This is vital for operators to
> > > ensure that they can size a system without overwhelming it. Both mem
> > > and jdbc channels support size specification at this time.
> >
> > yes with a recent commit it looks like MemoryChannel was changed to
> > work like JDBCChannel and throws an exception if full.
> >
> > >
> > > 3. Channels should never block. This is to ensure that there is no
> > > scope of threads deadlocking within the agent due to bugs or invalid
> > > state of the system. The chosen alternative to blocking was the notion
> > > of the sink runner which will honor backoff strategy when necessary.
> > > Consequently the implementation of sink should send the correct signal
> > > to the runner in case it is not able to take events from the channel
> > > or deliver events to the downstream destination.
> >
> > MemoryChannel.take blocks:
> >
> >      if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {
> >
>
> This is perhaps needed to fine tune the memory channel throughput in a
> highly concurrent system. For example, if multiple threads are
> reading/writing to the channel, it is possible that the channel may not get
> handle on the necessary locks in every request - hence a basic timeout
> mechanism would help in ensuring reliability of the channel. I do think
> though that the default of 3 seconds is a bit longish and should instead be
> more like 100ms or something.
>
>
> >
> > JDBCChannel does not. Meaning that JDBCChannel + HDFSEventSink
> > consumes an entire CPU when no events are flowing through the system.
> > It sounds like HDFSEventSink should be returning BACKOFF when the
> > channel returns null?
> >
>
> Correct. The idea is that any logic common to flow
> handling/throttling should be implemented by the runner and not by the
> individual sink or channel. That will likely not be as sophisticated as
> what one can do within a specialized implementation, but certainly serves
> the purpose and keeps things uniform and simple.
>
>
> >
> > My main concern with this strategy is that SinkRunner is going to
> > sleep for some number of seconds regardless of events being pushed to
> > the channel. Users that have very spiky flows have two options:
> >
> > 1) Turn the sleep down considerably which burns unneeded CPU
> > 2) Size the channel large enough to handle the largest spike during
> > the specified sleep time
> >
> > Shouldn't sink runner be event driven and start processing events as
> > soon as they arrive on the channel?
> >
>
> This does make a lot of sense. However, one of the immediate trade offs of
> the flume architecture was to specifically sacrifice blocking behavior in
> favor of implementing simple semantics. It is very possible that as flume
> gets adopted in the field we realize that this is more of a pressing need
> than what we expected it to ben then it will surely get prioritized.
>
> Thanks,
> Arvind
>
>
> >
> > Cheers!
> > Brock
> >
> > >
> > > At some point in time, when we have the basic implementation of Flume
> > > working in production to validate all of these semantics, we can start
> > > discussion on how best these semantics can change to accommodate any
> > > new findings that we discover in the field.
> > >
> > > Thanks,
> > > Arvind Prabhakar
> > >
> > > On Mon, Feb 27, 2012 at 11:30 AM, Prasad Mujumdar <
> prasadm@cloudera.com>
> > wrote:
> > >>  IMO the blocking vs wait time should be an attribute of the flow and
> > not
> > >> individual component. Perhaps each source/sink/channel should make it
> > >> configurable (with consistent default) so that it it can be tweaked
> per
> > the
> > >> use case. The common attributes like timeout, capacity can be standard
> > >> configurations that each component should support wherever possible.
> > >>
> > >> @Brock, I will try to include the relevant conclusions of this
> > discussion
> > >> in the dev guide.
> > >>
> > >> thanks
> > >> Prasad
> > >>
> > >>
> > >> On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <peter@walmartlabs.com
> > >wrote:
> > >>
> > >>> Juhani, FWIW I agree with most of what you described, based on my
> > reading
> > >>> and use of the codebase.  Brock, I agree that these things are not
> yet
> > >>> adequately documented--especially in terms of Javadocs for the main
> > >>> interfaces: Source, Channel, and Sink.  Also, there is enough
> variation
> > >>> among the various implementations of these interfaces to lead to
> > ambiguous
> > >>> interpretation.
> > >>>
> > >>> One thing I wanted to comment on specifically is Juhani's statement
> > about
> > >>> channel capacity:
> > >>>
> > >>> > Channels:
> > >>> > - Only memory channels have a capacity, but when that is exceeded
> > >>> > ChannelException seems a clearcut reaction
> > >>>
> > >>> Before your recent refactoring of MemoryChannel, put() would block
> > >>> indefinitely if the queue was at capacity--are you suggesting that
> > this was
> > >>> incorrect behavior that should not be allowed?  Or just that any such
> > >>> blocking should have a finite duration (similar to take()
> keep-alive),
> > and
> > >>> throw ChannelException upon timeout?
> > >>>
> > >>> Also, other channels may well have implicit capacities, for instance
> > >>> available space in a database or filesystem partition, though I agree
> > that
> > >>> ChannelException would be appropriate in those cases.
> > >>>
> > >>> -peter
> > >>>
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce -
> > http://incubator.apache.org/mrunit/
> >
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Source/Channel/Sink Behavior Standardization

Posted by Arvind Prabhakar <ar...@cloudera.com>.

On Mon, Feb 27, 2012 at 6:33 PM, Brock Noland <br...@cloudera.com> wrote:

> Hi,
>
> Thank you everyone for you comments! My comments are inline.
>
> On Tue, Feb 28, 2012 at 1:13 AM, Arvind Prabhakar <ar...@apache.org>
> wrote:
> > My bad for not chiming on this thread soon enough. When we laid out
> > the initial architecture, the following assumptions were made and I
> > still think that most of them are valid:
> >
> > 1. Sources doing put() on channel should relay back any exceptions
> > they receive from the channel. They should not die or become invalid
> > due to this. If they do, it is more of a bug in source implementation.
> >
> > 2. Channels must respect capacity. This is vital for operators to
> > ensure that they can size a system without overwhelming it. Both mem
> > and jdbc channels support size specification at this time.
>
> yes with a recent commit it looks like MemoryChannel was changed to
> work like JDBCChannel and throws an exception if full.
>
> >
> > 3. Channels should never block. This is to ensure that there is no
> > scope of threads deadlocking within the agent due to bugs or invalid
> > state of the system. The chosen alternative to blocking was the notion
> > of the sink runner which will honor backoff strategy when necessary.
> > Consequently the implementation of sink should send the correct signal
> > to the runner in case it is not able to take events from the channel
> > or deliver events to the downstream destination.
>
> MemoryChannel.take blocks:
>
>      if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {
>

This is perhaps needed to fine tune the memory channel throughput in a
highly concurrent system. For example, if multiple threads are
reading/writing to the channel, it is possible that the channel may not get
handle on the necessary locks in every request - hence a basic timeout
mechanism would help in ensuring reliability of the channel. I do think
though that the default of 3 seconds is a bit longish and should instead be
more like 100ms or something.


>
> JDBCChannel does not. Meaning that JDBCChannel + HDFSEventSink
> consumes an entire CPU when no events are flowing through the system.
> It sounds like HDFSEventSink should be returning BACKOFF when the
> channel returns null?
>

Correct. The idea is that any logic common to flow
handling/throttling should be implemented by the runner and not by the
individual sink or channel. That will likely not be as sophisticated as
what one can do within a specialized implementation, but certainly serves
the purpose and keeps things uniform and simple.


>
> My main concern with this strategy is that SinkRunner is going to
> sleep for some number of seconds regardless of events being pushed to
> the channel. Users that have very spiky flows have two options:
>
> 1) Turn the sleep down considerably which burns unneeded CPU
> 2) Size the channel large enough to handle the largest spike during
> the specified sleep time
>
> Shouldn't sink runner be event driven and start processing events as
> soon as they arrive on the channel?
>

This does make a lot of sense. However, one of the immediate trade offs of
the flume architecture was to specifically sacrifice blocking behavior in
favor of implementing simple semantics. It is very possible that as flume
gets adopted in the field we realize that this is more of a pressing need
than what we expected it to ben then it will surely get prioritized.

Thanks,
Arvind


>
> Cheers!
> Brock
>
> >
> > At some point in time, when we have the basic implementation of Flume
> > working in production to validate all of these semantics, we can start
> > discussion on how best these semantics can change to accommodate any
> > new findings that we discover in the field.
> >
> > Thanks,
> > Arvind Prabhakar
> >
> > On Mon, Feb 27, 2012 at 11:30 AM, Prasad Mujumdar <pr...@cloudera.com>
> wrote:
> >>  IMO the blocking vs wait time should be an attribute of the flow and
> not
> >> individual component. Perhaps each source/sink/channel should make it
> >> configurable (with consistent default) so that it it can be tweaked per
> the
> >> use case. The common attributes like timeout, capacity can be standard
> >> configurations that each component should support wherever possible.
> >>
> >> @Brock, I will try to include the relevant conclusions of this
> discussion
> >> in the dev guide.
> >>
> >> thanks
> >> Prasad
> >>
> >>
> >> On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <peter@walmartlabs.com
> >wrote:
> >>
> >>> Juhani, FWIW I agree with most of what you described, based on my
> reading
> >>> and use of the codebase.  Brock, I agree that these things are not yet
> >>> adequately documented--especially in terms of Javadocs for the main
> >>> interfaces: Source, Channel, and Sink.  Also, there is enough variation
> >>> among the various implementations of these interfaces to lead to
> ambiguous
> >>> interpretation.
> >>>
> >>> One thing I wanted to comment on specifically is Juhani's statement
> about
> >>> channel capacity:
> >>>
> >>> > Channels:
> >>> > - Only memory channels have a capacity, but when that is exceeded
> >>> > ChannelException seems a clearcut reaction
> >>>
> >>> Before your recent refactoring of MemoryChannel, put() would block
> >>> indefinitely if the queue was at capacity--are you suggesting that
> this was
> >>> incorrect behavior that should not be allowed?  Or just that any such
> >>> blocking should have a finite duration (similar to take() keep-alive),
> and
> >>> throw ChannelException upon timeout?
> >>>
> >>> Also, other channels may well have implicit capacities, for instance
> >>> available space in a database or filesystem partition, though I agree
> that
> >>> ChannelException would be appropriate in those cases.
> >>>
> >>> -peter
> >>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: Source/Channel/Sink Behavior Standardization

Posted by Brock Noland <br...@cloudera.com>.

Hi,

Thank you everyone for you comments! My comments are inline.

On Tue, Feb 28, 2012 at 1:13 AM, Arvind Prabhakar <ar...@apache.org> wrote:
> My bad for not chiming on this thread soon enough. When we laid out
> the initial architecture, the following assumptions were made and I
> still think that most of them are valid:
>
> 1. Sources doing put() on channel should relay back any exceptions
> they receive from the channel. They should not die or become invalid
> due to this. If they do, it is more of a bug in source implementation.
>
> 2. Channels must respect capacity. This is vital for operators to
> ensure that they can size a system without overwhelming it. Both mem
> and jdbc channels support size specification at this time.

yes with a recent commit it looks like MemoryChannel was changed to
work like JDBCChannel and throws an exception if full.

>
> 3. Channels should never block. This is to ensure that there is no
> scope of threads deadlocking within the agent due to bugs or invalid
> state of the system. The chosen alternative to blocking was the notion
> of the sink runner which will honor backoff strategy when necessary.
> Consequently the implementation of sink should send the correct signal
> to the runner in case it is not able to take events from the channel
> or deliver events to the downstream destination.

MemoryChannel.take blocks:

      if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {

JDBCChannel does not. Meaning that JDBCChannel + HDFSEventSink
consumes an entire CPU when no events are flowing through the system.
It sounds like HDFSEventSink should be returning BACKOFF when the
channel returns null?

My main concern with this strategy is that SinkRunner is going to
sleep for some number of seconds regardless of events being pushed to
the channel. Users that have very spiky flows have two options:

1) Turn the sleep down considerably which burns unneeded CPU
2) Size the channel large enough to handle the largest spike during
the specified sleep time

Shouldn't sink runner be event driven and start processing events as
soon as they arrive on the channel?

Cheers!
Brock

>
> At some point in time, when we have the basic implementation of Flume
> working in production to validate all of these semantics, we can start
> discussion on how best these semantics can change to accommodate any
> new findings that we discover in the field.
>
> Thanks,
> Arvind Prabhakar
>
> On Mon, Feb 27, 2012 at 11:30 AM, Prasad Mujumdar <pr...@cloudera.com> wrote:
>>  IMO the blocking vs wait time should be an attribute of the flow and not
>> individual component. Perhaps each source/sink/channel should make it
>> configurable (with consistent default) so that it it can be tweaked per the
>> use case. The common attributes like timeout, capacity can be standard
>> configurations that each component should support wherever possible.
>>
>> @Brock, I will try to include the relevant conclusions of this discussion
>> in the dev guide.
>>
>> thanks
>> Prasad
>>
>>
>> On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <pe...@walmartlabs.com>wrote:
>>
>>> Juhani, FWIW I agree with most of what you described, based on my reading
>>> and use of the codebase.  Brock, I agree that these things are not yet
>>> adequately documented--especially in terms of Javadocs for the main
>>> interfaces: Source, Channel, and Sink.  Also, there is enough variation
>>> among the various implementations of these interfaces to lead to ambiguous
>>> interpretation.
>>>
>>> One thing I wanted to comment on specifically is Juhani's statement about
>>> channel capacity:
>>>
>>> > Channels:
>>> > - Only memory channels have a capacity, but when that is exceeded
>>> > ChannelException seems a clearcut reaction
>>>
>>> Before your recent refactoring of MemoryChannel, put() would block
>>> indefinitely if the queue was at capacity--are you suggesting that this was
>>> incorrect behavior that should not be allowed?  Or just that any such
>>> blocking should have a finite duration (similar to take() keep-alive), and
>>> throw ChannelException upon timeout?
>>>
>>> Also, other channels may well have implicit capacities, for instance
>>> available space in a database or filesystem partition, though I agree that
>>> ChannelException would be appropriate in those cases.
>>>
>>> -peter
>>>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Source/Channel/Sink Behavior Standardization

Posted by Arvind Prabhakar <ar...@apache.org>.

My bad for not chiming on this thread soon enough. When we laid out
the initial architecture, the following assumptions were made and I
still think that most of them are valid:

1. Sources doing put() on channel should relay back any exceptions
they receive from the channel. They should not die or become invalid
due to this. If they do, it is more of a bug in source implementation.

2. Channels must respect capacity. This is vital for operators to
ensure that they can size a system without overwhelming it. Both mem
and jdbc channels support size specification at this time.

3. Channels should never block. This is to ensure that there is no
scope of threads deadlocking within the agent due to bugs or invalid
state of the system. The chosen alternative to blocking was the notion
of the sink runner which will honor backoff strategy when necessary.
Consequently the implementation of sink should send the correct signal
to the runner in case it is not able to take events from the channel
or deliver events to the downstream destination.

At some point in time, when we have the basic implementation of Flume
working in production to validate all of these semantics, we can start
discussion on how best these semantics can change to accommodate any
new findings that we discover in the field.

Thanks,
Arvind Prabhakar

On Mon, Feb 27, 2012 at 11:30 AM, Prasad Mujumdar <pr...@cloudera.com> wrote:
>  IMO the blocking vs wait time should be an attribute of the flow and not
> individual component. Perhaps each source/sink/channel should make it
> configurable (with consistent default) so that it it can be tweaked per the
> use case. The common attributes like timeout, capacity can be standard
> configurations that each component should support wherever possible.
>
> @Brock, I will try to include the relevant conclusions of this discussion
> in the dev guide.
>
> thanks
> Prasad
>
>
> On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <pe...@walmartlabs.com>wrote:
>
>> Juhani, FWIW I agree with most of what you described, based on my reading
>> and use of the codebase.  Brock, I agree that these things are not yet
>> adequately documented--especially in terms of Javadocs for the main
>> interfaces: Source, Channel, and Sink.  Also, there is enough variation
>> among the various implementations of these interfaces to lead to ambiguous
>> interpretation.
>>
>> One thing I wanted to comment on specifically is Juhani's statement about
>> channel capacity:
>>
>> > Channels:
>> > - Only memory channels have a capacity, but when that is exceeded
>> > ChannelException seems a clearcut reaction
>>
>> Before your recent refactoring of MemoryChannel, put() would block
>> indefinitely if the queue was at capacity--are you suggesting that this was
>> incorrect behavior that should not be allowed?  Or just that any such
>> blocking should have a finite duration (similar to take() keep-alive), and
>> throw ChannelException upon timeout?
>>
>> Also, other channels may well have implicit capacities, for instance
>> available space in a database or filesystem partition, though I agree that
>> ChannelException would be appropriate in those cases.
>>
>> -peter
>>

Re: Source/Channel/Sink Behavior Standardization

Posted by Prasad Mujumdar <pr...@cloudera.com>.

  IMO the blocking vs wait time should be an attribute of the flow and not
individual component. Perhaps each source/sink/channel should make it
configurable (with consistent default) so that it it can be tweaked per the
use case. The common attributes like timeout, capacity can be standard
configurations that each component should support wherever possible.

@Brock, I will try to include the relevant conclusions of this discussion
in the dev guide.

thanks
Prasad

On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <pe...@walmartlabs.com>wrote:

> Juhani, FWIW I agree with most of what you described, based on my reading
> and use of the codebase.  Brock, I agree that these things are not yet
> adequately documented--especially in terms of Javadocs for the main
> interfaces: Source, Channel, and Sink.  Also, there is enough variation
> among the various implementations of these interfaces to lead to ambiguous
> interpretation.
>
> One thing I wanted to comment on specifically is Juhani's statement about
> channel capacity:
>
> > Channels:
> > - Only memory channels have a capacity, but when that is exceeded
> > ChannelException seems a clearcut reaction
>
> Before your recent refactoring of MemoryChannel, put() would block
> indefinitely if the queue was at capacity--are you suggesting that this was
> incorrect behavior that should not be allowed?  Or just that any such
> blocking should have a finite duration (similar to take() keep-alive), and
> throw ChannelException upon timeout?
>
> Also, other channels may well have implicit capacities, for instance
> available space in a database or filesystem partition, though I agree that
> ChannelException would be appropriate in those cases.
>
> -peter
>

RE: Source/Channel/Sink Behavior Standardization

Posted by Peter Newcomb <pe...@walmartlabs.com>.

Juhani, FWIW I agree with most of what you described, based on my reading and use of the codebase.  Brock, I agree that these things are not yet adequately documented--especially in terms of Javadocs for the main interfaces: Source, Channel, and Sink.  Also, there is enough variation among the various implementations of these interfaces to lead to ambiguous interpretation.

One thing I wanted to comment on specifically is Juhani's statement about channel capacity:

> Channels:
> - Only memory channels have a capacity, but when that is exceeded 
> ChannelException seems a clearcut reaction

Before your recent refactoring of MemoryChannel, put() would block indefinitely if the queue was at capacity--are you suggesting that this was incorrect behavior that should not be allowed?  Or just that any such blocking should have a finite duration (similar to take() keep-alive), and throw ChannelException upon timeout?

Also, other channels may well have implicit capacities, for instance available space in a database or filesystem partition, though I agree that ChannelException would be appropriate in those cases.

-peter

Re: Source/Channel/Sink Behavior Standardization

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

Hi,

On 02/27/2012 02:44 AM, Brock Noland wrote:
> Hello,
>
> This might be something for the developer guide or it might be
> somewhere and I just missed it.  I feel like we should set down some
> expectations in regards to:
>
> 1) Source behavior when:
>    a) Channel put fails
>    b) Source started but is unable to obtain new events for some reason
> 2) Channel behavior when:
>    a) Channel capacity exceeded
>    b) take when channel is empty
> 3) Sink behavior when:
>    a) Channel take returns null
>    b) Sink cannot write to the downstream location
Totally agree. There is little  consistency in implementations right 
now, and part of the problem is some of the interfaces aren't 
documented. We should probably have a JIRA to document sink and source 
interfaces including failure patterns

My take on your issues

Sources:
- Channel put failure is pretty clearcut, failure should be returned, 
and the previous agent should rollback the transaction
- Inability to obtain events should probably be logged at a high level

Channels:
- Only memory channels have a capacity, but when that is exceeded 
ChannelException seems a clearcut reaction
- I think that blocking takes are certainly preferable. That said, I 
believe that it is more important that a sink return backoff when no 
data was processed.

Sinks:
- Ready if data was processed.
- Backoff if no data was processed/the sink needs breathing space.
- Rollback and backoff if downstream write failed
- Throw EventDeliveryException if the sink has a serious problem that 
puts it out of commission(this would result in failover or removal from 
balancing). This would be for cases where it is suspected that 
downstream is unavailable long-term(e.g. avrosink has repeatedly failed 
for X times in a row)

I tried to kick off some discussion about this in regards to sink 
failover too. When developing the failover sink processor I assumed that 
a failed sink will throw SinkDeliveryException, see 
https://issues.apache.org/jira/browse/FLUME-981
> This comes about when I noticed some inconsistencies.  For example, a
> take in MemoryChannel blocks for a few seconds by default and
> JDBCChannel does not (FLUME-998). Combined with HDFSEvent sink, this
> causes tremendous amounts of CPU consumption. Also, currently if HDFS
> is unavailable for a period, flume needs to be restarted (FLUME-985).
>
> My general thoughts are are based on experience working with JMS based services.
>
> 1) Source/Channel/Sink should not require a restart when up or down
> stream services are restarted or become temporarily unavailable.
> 2) Channel capacity being exceeded should not lead to sources dying
> and thus requiring a flume restart. This will happen when downstream
> destinations slow down for various reasons.
What would be a preferable alternative? For sources with an upstream, 
they should be able to signal upstream that the transaction needs to be 
rolled back. Other than that though, throwing away data that couldn't be 
delivered is the only possibility with a plain channel? Hopefully we can 
do something like buffers in scribed.

> Brock
>

Re: Source/Channel/Sink Behavior Standardization

Posted by Brock Noland <br...@cloudera.com>.

Good discussion on some of the reasons we should standardize this on:

https://issues.apache.org/jira/browse/FLUME-998

Brock

On Sun, Feb 26, 2012 at 11:14 PM, Brock Noland <br...@cloudera.com> wrote:
> Hello,
>
> This might be something for the developer guide or it might be
> somewhere and I just missed it.  I feel like we should set down some
> expectations in regards to:
>
> 1) Source behavior when:
>  a) Channel put fails
>  b) Source started but is unable to obtain new events for some reason
> 2) Channel behavior when:
>  a) Channel capacity exceeded
>  b) take when channel is empty
> 3) Sink behavior when:
>  a) Channel take returns null
>  b) Sink cannot write to the downstream location
>
> This comes about when I noticed some inconsistencies.  For example, a
> take in MemoryChannel blocks for a few seconds by default and
> JDBCChannel does not (FLUME-998). Combined with HDFSEvent sink, this
> causes tremendous amounts of CPU consumption. Also, currently if HDFS
> is unavailable for a period, flume needs to be restarted (FLUME-985).
>
> My general thoughts are are based on experience working with JMS based services.
>
> 1) Source/Channel/Sink should not require a restart when up or down
> stream services are restarted or become temporarily unavailable.
> 2) Channel capacity being exceeded should not lead to sources dying
> and thus requiring a flume restart. This will happen when downstream
> destinations slow down for various reasons.
>
> Brock
>
> --
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/