You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Juhani Connolly (JIRA)" <ji...@apache.org> on 2012/07/10 11:48:35 UTC

[jira] [Created] (FLUME-1361) Add event batching to ExecSource

Juhani Connolly created FLUME-1361:
--------------------------------------

             Summary: Add event batching to ExecSource
                 Key: FLUME-1361
                 URL: https://issues.apache.org/jira/browse/FLUME-1361
             Project: Flume
          Issue Type: Improvement
            Reporter: Juhani Connolly
            Assignee: Juhani Connolly


Add a configuration option for the number of items to send to the channel in a single transaction.

This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-1361) Add event batching to ExecSource

Posted by "Juhani Connolly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410602#comment-13410602 ] 

Juhani Connolly commented on FLUME-1361:
----------------------------------------

With a setup up of:

Exec source tailing tomcat logs
Sending to file channel
Which is drained by an avro sink

With the current implementation of FileChannel, and a single disk(so checkpoint/data dirs both on the same disk) we were getting only 10 events/sec throughput. What I have gathered from other discussions and my own assumptions that follow from them(please correct me if this is wrong) is that this is because commits trigger an fsync, which then triggers at least 2 seeks(one for the data dir, one for the checkpoint dir) + seeks for everything else recently written to disk(e.g. tomcat logs). On a system with 2-3 exclusive disks dedicated to flume, the writes would be sequential and probably not a problem.

With this patch, we were getting full throughput of our live logs(amounting to 650ish events per second per server). I have yet to test what the maximum is, but regardless, it solves what I believe will be a very common use case(tailing exec source to file channel)

Apparently the review requests no longer get auto-linked... added a link to the review request... I'll fix up the docs tomorrow once I get back to my work computer
                
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (FLUME-1361) Add event batching to ExecSource

Posted by "Patrick Wendell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410624#comment-13410624 ] 

Patrick Wendell edited comment on FLUME-1361 at 7/10/12 5:52 PM:
-----------------------------------------------------------------

Hey Juhani,

Yep - you've got it. The ideal setup for a FileChannel would either be:

1) Using a dedicated disk for Flume and flushing to disk on every event.
or
2) Using a shared disk for flume and batching disk sync's to prevent excess seeking.

The first case is similar to using a WAL, frequent seeks but a dedicated disk, so you can get high throughput. If you try to use FileChannel with a shared disk, and you are sync'ing on every event, throughput is going to be bad.

So I'd expect adding batching to give better throughput, and it sounds like it is.

One question is whether batching should happen as part of the source or if it should be a first-order feature of the channel, since people will have this problem with other types of sources (e.g. syslog source) whenever they want to do durable writes at high throughput.
                
      was (Author: pwendell@gmail.com):
    Hey Juhani,

Yep - you've got it. The ideal setup for a FileChannel would either be:

1) Using a dedicated disk for Flume and flushing to disk on every event.
- or - 
2) Using a shared disk for flume and batching disk sync's to prevent excess seeking.

The first case is similar to using a WAL, frequent seeks but a dedicated disk, so you can get high throughput. If you try to use FileChannel with a shared disk, and you are sync'ing on every event, throughput is going to be bad.

So I'd expect adding batching to give better throughput, and it sounds like it is.

One question is whether batching should happen as part of the source or if it should be a first-order feature of the channel, since people will have this problem with other types of sources (e.g. syslog source) whenever they want to do durable writes at high throughput.
                  
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-1361) Add event batching to ExecSource

Posted by "Patrick Wendell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410738#comment-13410738 ] 

Patrick Wendell commented on FLUME-1361:
----------------------------------------

I think it's fine to have this batching in the exec source as a short term fix.

Even if we add batching as a core component of flume people might still want this anyways to batch the source at a different granularity.
                
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-1361) Add event batching to ExecSource

Posted by "Patrick Wendell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410580#comment-13410580 ] 

Patrick Wendell commented on FLUME-1361:
----------------------------------------

Hey Juhani,

If you could share any performance improvement you get from this (even roughly) that would be great.

I was looking at:
https://issues.apache.org/jira/browse/FLUME-1339

with Hari, but my instinct is that adding event batching is really what you want for this, not necessarily building a standalone client.

- Patrick
                
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-1361) Add event batching to ExecSource

Posted by "Juhani Connolly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410643#comment-13410643 ] 

Juhani Connolly commented on FLUME-1361:
----------------------------------------

It would be nice to see batching as part of the channel and I've mentioned it on the mailing list before. I did this because we needed it now, it is simple, and doing it channel side looks a lot more awkward and gives less control. Anyway, 3am here now, sleep, and I'll fix up for the comments on the review tomorrow first thing.
                
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-1361) Add event batching to ExecSource

Posted by "Patrick Wendell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410624#comment-13410624 ] 

Patrick Wendell commented on FLUME-1361:
----------------------------------------

Hey Juhani,

Yep - you've got it. The ideal setup for a FileChannel would either be:

1) Using a dedicated disk for Flume and flushing to disk on every event.
- or - 
2) Using a shared disk for flume and batching disk sync's to prevent excess seeking.

The first case is similar to using a WAL, frequent seeks but a dedicated disk, so you can get high throughput. If you try to use FileChannel with a shared disk, and you are sync'ing on every event, throughput is going to be bad.

So I'd expect adding batching to give better throughput, and it sounds like it is.

One question is whether batching should happen as part of the source or if it should be a first-order feature of the channel, since people will have this problem with other types of sources (e.g. syslog source) whenever they want to do durable writes at high throughput.
                
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-1361) Add event batching to ExecSource

Posted by "Juhani Connolly (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juhani Connolly updated FLUME-1361:
-----------------------------------

    Attachment: FLUME-1361-2.patch
    
> Add event batching to ExecSource
> --------------------------------
>
>                 Key: FLUME-1361
>                 URL: https://issues.apache.org/jira/browse/FLUME-1361
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>         Attachments: FLUME-1361-2.patch
>
>
> Add a configuration option for the number of items to send to the channel in a single transaction.
> This will help a lot with FileChannel which needs to fsync every commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira