You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Patrick Wendell <pw...@gmail.com> on 2012/07/11 08:12:07 UTC

Questions about Batching in Flume

Hi All,

Most streaming systems have built-in support for batching since it
often offers major performance benefits in terms of throughput.

I'm a little confused about the state of batching in Flume today. It
looks like a ChannelProcessor can process a batch of events within one
transaction, but internally this just calls Channel.put() several
times.

As far as I can tell, both of the durable channels (JDBC and File)
actually flush to disk in some fashion whenever there is a doPut(). It
seems to me like it makes sense to buffer all of those puts in memory
and only flush them once per transaction. Otherwise, isn't the benefit
of batching put()'s within a transaction lost?

I think I might be missing something here, any pointers are appreciated.

- Patrick

Re: Questions about Batching in Flume

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Folks,

So the hole in my thinking was, as brock pointed out, that the
FileChannel doesn't actually sync() until a commit. I misread the code
while looking at it quickly. So it does allow batching within a
transaction as desired.

The JDBC channel however looks like it persists the events on every
put() rather than on transaction boundaries:

-- JdbcChannel.java --
@Override
public void put(Event event) throws ChannelException {
  getProvider().persistEvent(getName(), event);
}

Am I wrong on this one as well?

- Patrick

On Wed, Jul 11, 2012 at 2:12 AM, Juhani Connolly
<ju...@cyberagent.co.jp> wrote:
> I think some of my earlier speculation may have lead to this
> misunderstanding? I can confirm after changing the exec source that the
> puts/takes themselves are not generating the bottleneck, and that
> performance is fine so long as the number of transactions is not too
> large(as each transaction commit will cause an fsync).
>
> An option for the channel to store x events on the heap before flushing
> could be interesting, though it would void any guarrantee deliveries made. I
> do not think this is necessarily a bad thing so long as it is documented(and
> people who want everything committed can request flushing the buffer every
> commit).
>
>
> On 07/11/2012 04:05 PM, Brock Noland wrote:
>>
>> What leads you to that conclusion about FC? (I am curious in case there is
>> something I am unaware of.) This is where a Put ends up being written and
>> there is no flush until a commit.
>>
>>
>> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>>
>> Brock
>>
>> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Most streaming systems have built-in support for batching since it
>>> often offers major performance benefits in terms of throughput.
>>>
>>> I'm a little confused about the state of batching in Flume today. It
>>> looks like a ChannelProcessor can process a batch of events within one
>>> transaction, but internally this just calls Channel.put() several
>>> times.
>>>
>>> As far as I can tell, both of the durable channels (JDBC and File)
>>> actually flush to disk in some fashion whenever there is a doPut(). It
>>> seems to me like it makes sense to buffer all of those puts in memory
>>> and only flush them once per transaction. Otherwise, isn't the benefit
>>> of batching put()'s within a transaction lost?
>>>
>>> I think I might be missing something here, any pointers are appreciated.
>>>
>>> - Patrick
>>>
>>
>>
>
>

Re: Questions about Batching in Flume

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.
I think some of my earlier speculation may have lead to this 
misunderstanding? I can confirm after changing the exec source that the 
puts/takes themselves are not generating the bottleneck, and that 
performance is fine so long as the number of transactions is not too 
large(as each transaction commit will cause an fsync).

An option for the channel to store x events on the heap before flushing 
could be interesting, though it would void any guarrantee deliveries 
made. I do not think this is necessarily a bad thing so long as it is 
documented(and people who want everything committed can request flushing 
the buffer every commit).

On 07/11/2012 04:05 PM, Brock Noland wrote:
> What leads you to that conclusion about FC? (I am curious in case there is
> something I am unaware of.) This is where a Put ends up being written and
> there is no flush until a commit.
>
> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>
> Brock
>
> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hi All,
>>
>> Most streaming systems have built-in support for batching since it
>> often offers major performance benefits in terms of throughput.
>>
>> I'm a little confused about the state of batching in Flume today. It
>> looks like a ChannelProcessor can process a batch of events within one
>> transaction, but internally this just calls Channel.put() several
>> times.
>>
>> As far as I can tell, both of the durable channels (JDBC and File)
>> actually flush to disk in some fashion whenever there is a doPut(). It
>> seems to me like it makes sense to buffer all of those puts in memory
>> and only flush them once per transaction. Otherwise, isn't the benefit
>> of batching put()'s within a transaction lost?
>>
>> I think I might be missing something here, any pointers are appreciated.
>>
>> - Patrick
>>
>
>



Re: Questions about Batching in Flume

Posted by Brock Noland <br...@cloudera.com>.
What leads you to that conclusion about FC? (I am curious in case there is
something I am unaware of.) This is where a Put ends up being written and
there is no flush until a commit.

https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165

Brock

On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com> wrote:

> Hi All,
>
> Most streaming systems have built-in support for batching since it
> often offers major performance benefits in terms of throughput.
>
> I'm a little confused about the state of batching in Flume today. It
> looks like a ChannelProcessor can process a batch of events within one
> transaction, but internally this just calls Channel.put() several
> times.
>
> As far as I can tell, both of the durable channels (JDBC and File)
> actually flush to disk in some fashion whenever there is a doPut(). It
> seems to me like it makes sense to buffer all of those puts in memory
> and only flush them once per transaction. Otherwise, isn't the benefit
> of batching put()'s within a transaction lost?
>
> I think I might be missing something here, any pointers are appreciated.
>
> - Patrick
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/