You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Michael Diamant <di...@gmail.com> on 2014/09/08 21:58:32 UTC

Enabling file channel backup checkpoint causes significant disk IO at start-up

My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded agent to
write to a file channel.  From a previous thread started by my colleague,
"FileChannel Replays consistently take a long time" and associated issue,
https://issues.apache.org/jira/browse/FLUME-2450, it was suggested to use a
backup checkpoint directory to avoid lengthy replays.  When I enabled the
backup checkpoint directory, I observed via iotop near 100% IO by my
application with the embedded agent.  This level of IO persists for about
30 seconds rendering the application unusable during this time period.

For comparison, I monitored via iotop when backup checkpoint is disabled.
 IO activity occurs for at most several seconds.  That is, there is a
qualitative difference when enabling the backup checkpoint directory.
 Additionally, I also tried deleting the existing checkpoints/data
directories to start with a clean slate.  Those experiment results are
in-line with my above observations.

Is this expected behavior when using a backup checkpoint directory?  Is
there anyway in which the amount of IO can be reduced?  I appreciate
feedback and insights because the current behavior is untenable for a
production environment.

Thank you,
Michael

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Abraham Fine <ab...@brightroll.com>.

Hi-

I'm the author of the backup checkpoint compression patch.

We backported it to 1.4 and are running it in production without a problem.

Abe

-- 
Abraham Fine | Software Engineer
(516) 567-2535
BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com

On Mon, Sep 8, 2014 at 1:59 PM, Gary Malouf <ma...@gmail.com> wrote:

> Hi Hari,
>
> I'm a colleague of Michael's, if we are in need of a few of these patches,
> would you recommend we do our own custom build?
>
> Separate from Apache's release cycle, would these patches get included in
> the next CDH build that includes Flume?  (Not sure what the schedule of
> that is...)
>
> Thanks,
>
> Gary
>
>
> On Mon, Sep 8, 2014 at 4:55 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>> Flume releases are once every few months - since we just had one a couple
>> of months back, I don't think there will be one happening right away.
>>
>> Michael Diamant wrote:
>>
>>
>> Hari, thank you for your quick reply.  A follow-up question to help me
>> figure out how best to proceed on my end:  Can you provide an estimate
>> as to when the next Flume release will occur?
>>
>>
>> On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan
>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>
>>     This patch should address the issue, if enabled:
>>
>> https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f
>>
>>     It will be part of the next Flume release (or CDH5.2.0).
>>
>>     --
>>
>>     Thanks,
>>     Hari
>>
>>
>>
>>     Michael Diamant <ma...@gmail.com>
>>     September 8, 2014 at 12:58 PM
>>     My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded
>>     agent to write to a file channel.  From a previous thread started
>>     by my colleague, "FileChannel Replays consistently take a long
>>     time" and associated issue,
>>     https://issues.apache.org/jira/browse/FLUME-2450, it was
>>     suggested to use a backup checkpoint directory to avoid lengthy
>>     replays.  When I enabled the backup checkpoint directory, I
>>     observed via iotop near 100% IO by my application with the
>>     embedded agent.  This level of IO persists for about 30 seconds
>>     rendering the application unusable during this time period.
>>
>>     For comparison, I monitored via iotop when backup checkpoint is
>>     disabled.  IO activity occurs for at most several seconds.  That
>>     is, there is a qualitative difference when enabling the backup
>>     checkpoint directory.  Additionally, I also tried deleting the
>>     existing checkpoints/data directories to start with a clean
>>     slate.  Those experiment results are in-line with my above
>>     observations.
>>
>>     Is this expected behavior when using a backup checkpoint
>>     directory?  Is there anyway in which the amount of IO can be
>>     reduced?  I appreciate feedback and insights because the current
>>     behavior is untenable for a production environment.
>>
>>     Thank you,
>>     Michael
>>
>>
>>
>>
>

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Hari Shreedharan <hs...@cloudera.com>.

If it is urgent, you could do your own build (make sure you use the correct
build profile etc). Usually CDH (and most other vendor releases) include
patches till their code freeze that they select - in most cases, ones that
don't break compat. I am not sure about the exact dates though

On Mon, Sep 8, 2014 at 1:59 PM, Gary Malouf <ma...@gmail.com> wrote:

> Hi Hari,
>
> I'm a colleague of Michael's, if we are in need of a few of these patches,
> would you recommend we do our own custom build?
>
> Separate from Apache's release cycle, would these patches get included in
> the next CDH build that includes Flume?  (Not sure what the schedule of
> that is...)
>
> Thanks,
>
> Gary
>
>
> On Mon, Sep 8, 2014 at 4:55 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>> Flume releases are once every few months - since we just had one a couple
>> of months back, I don't think there will be one happening right away.
>>
>> Michael Diamant wrote:
>>
>>
>> Hari, thank you for your quick reply.  A follow-up question to help me
>> figure out how best to proceed on my end:  Can you provide an estimate
>> as to when the next Flume release will occur?
>>
>>
>> On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan
>> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>>
>>     This patch should address the issue, if enabled:
>>
>> https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f
>>
>>     It will be part of the next Flume release (or CDH5.2.0).
>>
>>     --
>>
>>     Thanks,
>>     Hari
>>
>>
>>
>>     Michael Diamant <ma...@gmail.com>
>>     September 8, 2014 at 12:58 PM
>>     My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded
>>     agent to write to a file channel.  From a previous thread started
>>     by my colleague, "FileChannel Replays consistently take a long
>>     time" and associated issue,
>>     https://issues.apache.org/jira/browse/FLUME-2450, it was
>>     suggested to use a backup checkpoint directory to avoid lengthy
>>     replays.  When I enabled the backup checkpoint directory, I
>>     observed via iotop near 100% IO by my application with the
>>     embedded agent.  This level of IO persists for about 30 seconds
>>     rendering the application unusable during this time period.
>>
>>     For comparison, I monitored via iotop when backup checkpoint is
>>     disabled.  IO activity occurs for at most several seconds.  That
>>     is, there is a qualitative difference when enabling the backup
>>     checkpoint directory.  Additionally, I also tried deleting the
>>     existing checkpoints/data directories to start with a clean
>>     slate.  Those experiment results are in-line with my above
>>     observations.
>>
>>     Is this expected behavior when using a backup checkpoint
>>     directory?  Is there anyway in which the amount of IO can be
>>     reduced?  I appreciate feedback and insights because the current
>>     behavior is untenable for a production environment.
>>
>>     Thank you,
>>     Michael
>>
>>
>>
>>
>

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Gary Malouf <ma...@gmail.com>.

Hi Hari,

I'm a colleague of Michael's, if we are in need of a few of these patches,
would you recommend we do our own custom build?

Separate from Apache's release cycle, would these patches get included in
the next CDH build that includes Flume?  (Not sure what the schedule of
that is...)

Thanks,

Gary


On Mon, Sep 8, 2014 at 4:55 PM, Hari Shreedharan <hs...@cloudera.com>
wrote:

> Flume releases are once every few months - since we just had one a couple
> of months back, I don't think there will be one happening right away.
>
> Michael Diamant wrote:
>
>
> Hari, thank you for your quick reply.  A follow-up question to help me
> figure out how best to proceed on my end:  Can you provide an estimate
> as to when the next Flume release will occur?
>
>
> On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan
> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>
>     This patch should address the issue, if enabled:
>
> https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f
>
>     It will be part of the next Flume release (or CDH5.2.0).
>
>     --
>
>     Thanks,
>     Hari
>
>
>
>     Michael Diamant <ma...@gmail.com>
>     September 8, 2014 at 12:58 PM
>     My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded
>     agent to write to a file channel.  From a previous thread started
>     by my colleague, "FileChannel Replays consistently take a long
>     time" and associated issue,
>     https://issues.apache.org/jira/browse/FLUME-2450, it was
>     suggested to use a backup checkpoint directory to avoid lengthy
>     replays.  When I enabled the backup checkpoint directory, I
>     observed via iotop near 100% IO by my application with the
>     embedded agent.  This level of IO persists for about 30 seconds
>     rendering the application unusable during this time period.
>
>     For comparison, I monitored via iotop when backup checkpoint is
>     disabled.  IO activity occurs for at most several seconds.  That
>     is, there is a qualitative difference when enabling the backup
>     checkpoint directory.  Additionally, I also tried deleting the
>     existing checkpoints/data directories to start with a clean
>     slate.  Those experiment results are in-line with my above
>     observations.
>
>     Is this expected behavior when using a backup checkpoint
>     directory?  Is there anyway in which the amount of IO can be
>     reduced?  I appreciate feedback and insights because the current
>     behavior is untenable for a production environment.
>
>     Thank you,
>     Michael
>
>
>
>

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Hari Shreedharan <hs...@cloudera.com>.

Flume releases are once every few months - since we just had one a 
couple of months back, I don't think there will be one happening right 
away.

Michael Diamant wrote:
>
> Hari, thank you for your quick reply. A follow-up question to help me
> figure out how best to proceed on my end: Can you provide an estimate
> as to when the next Flume release will occur?
>
>
> On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan
> <hshreedharan@cloudera.com <ma...@cloudera.com>> wrote:
>
> This patch should address the issue, if enabled:
> https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f
>
> It will be part of the next Flume release (or CDH5.2.0).
>
> -- 
>
> Thanks,
> Hari
>
>
>>
>> Michael Diamant <ma...@gmail.com>
>> September 8, 2014 at 12:58 PM
>> My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded
>> agent to write to a file channel. From a previous thread started
>> by my colleague, "FileChannel Replays consistently take a long
>> time" and associated issue,
>> https://issues.apache.org/jira/browse/FLUME-2450, it was
>> suggested to use a backup checkpoint directory to avoid lengthy
>> replays. When I enabled the backup checkpoint directory, I
>> observed via iotop near 100% IO by my application with the
>> embedded agent. This level of IO persists for about 30 seconds
>> rendering the application unusable during this time period.
>>
>> For comparison, I monitored via iotop when backup checkpoint is
>> disabled. IO activity occurs for at most several seconds. That
>> is, there is a qualitative difference when enabling the backup
>> checkpoint directory. Additionally, I also tried deleting the
>> existing checkpoints/data directories to start with a clean
>> slate. Those experiment results are in-line with my above
>> observations.
>>
>> Is this expected behavior when using a backup checkpoint
>> directory? Is there anyway in which the amount of IO can be
>> reduced? I appreciate feedback and insights because the current
>> behavior is untenable for a production environment.
>>
>> Thank you,
>> Michael
>
>

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Michael Diamant <di...@gmail.com>.

Hari, thank you for your quick reply.  A follow-up question to help me
figure out how best to proceed on my end:  Can you provide an estimate as
to when the next Flume release will occur?


On Mon, Sep 8, 2014 at 4:07 PM, Hari Shreedharan <hs...@cloudera.com>
wrote:

> This patch should address the issue, if enabled:
> https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f
>
> It will be part of the next Flume release (or CDH5.2.0).
>
> --
>
> Thanks,
> Hari
>
>
>   Michael Diamant <di...@gmail.com>
>  September 8, 2014 at 12:58 PM
> My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded agent to
> write to a file channel.  From a previous thread started by my colleague,
> "FileChannel Replays consistently take a long time" and associated issue,
> https://issues.apache.org/jira/browse/FLUME-2450, it was suggested to use
> a backup checkpoint directory to avoid lengthy replays.  When I enabled the
> backup checkpoint directory, I observed via iotop near 100% IO by my
> application with the embedded agent.  This level of IO persists for about
> 30 seconds rendering the application unusable during this time period.
>
> For comparison, I monitored via iotop when backup checkpoint is disabled.
>  IO activity occurs for at most several seconds.  That is, there is a
> qualitative difference when enabling the backup checkpoint directory.
>  Additionally, I also tried deleting the existing checkpoints/data
> directories to start with a clean slate.  Those experiment results are
> in-line with my above observations.
>
> Is this expected behavior when using a backup checkpoint directory?  Is
> there anyway in which the amount of IO can be reduced?  I appreciate
> feedback and insights because the current behavior is untenable for a
> production environment.
>
> Thank you,
> Michael
>
>

Re: Enabling file channel backup checkpoint causes significant disk IO at start-up

Posted by Hari Shreedharan <hs...@cloudera.com>.

This patch should address the issue, if enabled: 
https://git-wip-us.apache.org/repos/asf?p=flume.git;a=commitdiff;h=69fd6b3ad5e5b9ae6f1293b3d8e57ed57fd6701c;hp=f15f20785262ac3cb3e35c2a12e669b7a836d35f

It will be part of the next Flume release (or CDH5.2.0).

-- 

Thanks,
Hari


> Michael Diamant <ma...@gmail.com>
> September 8, 2014 at 12:58 PM
> My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded agent 
> to write to a file channel.  From a previous thread started by my 
> colleague, "FileChannel Replays consistently take a long time" and 
> associated issue, https://issues.apache.org/jira/browse/FLUME-2450, it 
> was suggested to use a backup checkpoint directory to avoid lengthy 
> replays.  When I enabled the backup checkpoint directory, I observed 
> via iotop near 100% IO by my application with the embedded agent. 
>  This level of IO persists for about 30 seconds rendering the 
> application unusable during this time period.
>
> For comparison, I monitored via iotop when backup checkpoint is 
> disabled.  IO activity occurs for at most several seconds.  That is, 
> there is a qualitative difference when enabling the backup checkpoint 
> directory.  Additionally, I also tried deleting the existing 
> checkpoints/data directories to start with a clean slate.  Those 
> experiment results are in-line with my above observations.
>
> Is this expected behavior when using a backup checkpoint directory? 
>  Is there anyway in which the amount of IO can be reduced?  I 
> appreciate feedback and insights because the current behavior is 
> untenable for a production environment.
>
> Thank you,
> Michael