You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Patrick Wendell <pw...@gmail.com> on 2012/07/11 08:12:07 UTC
Questions about Batching in Flume
Hi All,
Most streaming systems have built-in support for batching since it
often offers major performance benefits in terms of throughput.
I'm a little confused about the state of batching in Flume today. It
looks like a ChannelProcessor can process a batch of events within one
transaction, but internally this just calls Channel.put() several
times.
As far as I can tell, both of the durable channels (JDBC and File)
actually flush to disk in some fashion whenever there is a doPut(). It
seems to me like it makes sense to buffer all of those puts in memory
and only flush them once per transaction. Otherwise, isn't the benefit
of batching put()'s within a transaction lost?
I think I might be missing something here, any pointers are appreciated.
- Patrick
Re: Questions about Batching in Flume
Posted by Patrick Wendell <pw...@gmail.com>.
Hey Folks,
So the hole in my thinking was, as brock pointed out, that the
FileChannel doesn't actually sync() until a commit. I misread the code
while looking at it quickly. So it does allow batching within a
transaction as desired.
The JDBC channel however looks like it persists the events on every
put() rather than on transaction boundaries:
-- JdbcChannel.java --
@Override
public void put(Event event) throws ChannelException {
getProvider().persistEvent(getName(), event);
}
Am I wrong on this one as well?
- Patrick
On Wed, Jul 11, 2012 at 2:12 AM, Juhani Connolly
<ju...@cyberagent.co.jp> wrote:
> I think some of my earlier speculation may have lead to this
> misunderstanding? I can confirm after changing the exec source that the
> puts/takes themselves are not generating the bottleneck, and that
> performance is fine so long as the number of transactions is not too
> large(as each transaction commit will cause an fsync).
>
> An option for the channel to store x events on the heap before flushing
> could be interesting, though it would void any guarrantee deliveries made. I
> do not think this is necessarily a bad thing so long as it is documented(and
> people who want everything committed can request flushing the buffer every
> commit).
>
>
> On 07/11/2012 04:05 PM, Brock Noland wrote:
>>
>> What leads you to that conclusion about FC? (I am curious in case there is
>> something I am unaware of.) This is where a Put ends up being written and
>> there is no flush until a commit.
>>
>>
>> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>>
>> Brock
>>
>> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Most streaming systems have built-in support for batching since it
>>> often offers major performance benefits in terms of throughput.
>>>
>>> I'm a little confused about the state of batching in Flume today. It
>>> looks like a ChannelProcessor can process a batch of events within one
>>> transaction, but internally this just calls Channel.put() several
>>> times.
>>>
>>> As far as I can tell, both of the durable channels (JDBC and File)
>>> actually flush to disk in some fashion whenever there is a doPut(). It
>>> seems to me like it makes sense to buffer all of those puts in memory
>>> and only flush them once per transaction. Otherwise, isn't the benefit
>>> of batching put()'s within a transaction lost?
>>>
>>> I think I might be missing something here, any pointers are appreciated.
>>>
>>> - Patrick
>>>
>>
>>
>
>
Re: Questions about Batching in Flume
Posted by Juhani Connolly <ju...@cyberagent.co.jp>.
I think some of my earlier speculation may have lead to this
misunderstanding? I can confirm after changing the exec source that the
puts/takes themselves are not generating the bottleneck, and that
performance is fine so long as the number of transactions is not too
large(as each transaction commit will cause an fsync).
An option for the channel to store x events on the heap before flushing
could be interesting, though it would void any guarrantee deliveries
made. I do not think this is necessarily a bad thing so long as it is
documented(and people who want everything committed can request flushing
the buffer every commit).
On 07/11/2012 04:05 PM, Brock Noland wrote:
> What leads you to that conclusion about FC? (I am curious in case there is
> something I am unaware of.) This is where a Put ends up being written and
> there is no flush until a commit.
>
> https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
>
> Brock
>
> On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hi All,
>>
>> Most streaming systems have built-in support for batching since it
>> often offers major performance benefits in terms of throughput.
>>
>> I'm a little confused about the state of batching in Flume today. It
>> looks like a ChannelProcessor can process a batch of events within one
>> transaction, but internally this just calls Channel.put() several
>> times.
>>
>> As far as I can tell, both of the durable channels (JDBC and File)
>> actually flush to disk in some fashion whenever there is a doPut(). It
>> seems to me like it makes sense to buffer all of those puts in memory
>> and only flush them once per transaction. Otherwise, isn't the benefit
>> of batching put()'s within a transaction lost?
>>
>> I think I might be missing something here, any pointers are appreciated.
>>
>> - Patrick
>>
>
>
Re: Questions about Batching in Flume
Posted by Brock Noland <br...@cloudera.com>.
What leads you to that conclusion about FC? (I am curious in case there is
something I am unaware of.) This is where a Put ends up being written and
there is no flush until a commit.
https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java#L165
Brock
On Wed, Jul 11, 2012 at 7:12 AM, Patrick Wendell <pw...@gmail.com> wrote:
> Hi All,
>
> Most streaming systems have built-in support for batching since it
> often offers major performance benefits in terms of throughput.
>
> I'm a little confused about the state of batching in Flume today. It
> looks like a ChannelProcessor can process a batch of events within one
> transaction, but internally this just calls Channel.put() several
> times.
>
> As far as I can tell, both of the durable channels (JDBC and File)
> actually flush to disk in some fashion whenever there is a doPut(). It
> seems to me like it makes sense to buffer all of those puts in memory
> and only flush them once per transaction. Otherwise, isn't the benefit
> of batching put()'s within a transaction lost?
>
> I think I might be missing something here, any pointers are appreciated.
>
> - Patrick
>
--
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/