You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Gary Malouf <ma...@gmail.com> on 2014/08/20 05:55:11 UTC
FileChannel Replays consistently take a long time
We are repeatedly running into cases where the replays of from a file
channel going to HDFS take an eternity.
I've read this thread
<http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E>,
but I just am not convinced that our checkpoints are constantly being
corrupted.
We are seeing messages such as:
20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
(org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
- Reading checkpoint metadata from
/opt/flume/brq/ch1/checkpoint/checkpoint.meta
How can it be that this takes so long?
Re: FileChannel Replays consistently take a long time
Posted by Gary Malouf <ma...@gmail.com>.
Hi Hari,
I wanted to provide more details on the issues we are seeing as there may
be better actions we can take.
As mentioned earlier, we are on the Flume 1.4 release that is packaged with
the CDH 5.0.2 distribution. We've been in production with Flume for well
over a year, but have just in the last month or two seen more significant
issues as our traffic has ramped up
We have 2 sets of Flume 'clusters/configurations' - each instance has an
agent with 2 or more source-channel-sink combinations. We use file
channels for all of them as they allow us to buffer when downstream issues
occur (usually when replacing HDFS nodes - will touch on this at the
bottom). All of the channels use capacities of 100000000 items (again,
using Flume as a buffer during downtime).
For reference, we have a 12 node HDFS cluster that these are writing to
with a replication factor of 3.
An example channel configuration:
agent-1.channels.trdbuy-bid-req-ch1.type = file
agent-1.channels.trdbuy-bid-req-ch1.checkpointDir =
/opt/flume/trdbuy-req-ch1/checkpoint
agent-1.channels.trdbuy-bid-req-ch1.dataDirs =
/opt/flume/trdbuy-req-ch1/data
agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
We use HDFS sinks for everything, an example configuration:
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.channel = trdbuy-bid-req-ch1
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.type = hdfs
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.path =
hdfs://nn-01:8020/flume/trading/buying/bidrequests/yr=%Y/mo=%m/d=%d/
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.filePrefix = %{host}s1
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.batchSize = 10000
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollInterval = 3600
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollCount = 0
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.rollSize = 0
agent-1.sinks.hdfs-trdbuy-bid-req-sink1.hdfs.idleTimeout = 1800
We are running into a few distinct problems:
1) Losing a single datanode causes our flume instances which are under
higher volumes (6k messages/second approximately) to be unable to write
(others seem to continue to hum along). They do not seem to recover until
we restart the instance. I think this exception is related:
20 Aug 2014 01:16:23,942 ERROR
[SinkRunner-PollingRunner-DefaultSinkProcessor]
(org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated:82)
- Unexpected
error while checking replication factor
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:147)
at
org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:68)
at
org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:505)
at
org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:440)
at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401)
at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: Failed to replace a bad datanode on the
existing pipeline due to no more good datanodes being available to try.
(Nodes: current=[10.9.178.151:50010, 10.144.197.136:50010], original=[
10.9.178.151:50010, 10.144.197.136:50010]). The current failed datanode
replacement policy is DEFAULT, and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:960)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1026)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1174)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2) Stopping a namenode (as expected) causes the Flume instances to not
write as well. When the namenode comes back up, Flume does not recover
either. It again needs a restart.
3) After restarting Flume in the case of either (1) or (2), it seems to
take an eternity (anywhere from 5-30 minutes based on the amount of data in
the channel) to open the source ports for listening. During this time, a
single cpu on the machine is cranking but no data gets copied to HDFS.
Once the ports are allowed to open, the data built up in the channel
starts to copy and the catch-up process begins.
This last point has caused us to consider other logging solutions, as our
system is stuck backed up waiting for Flume to figure itself out. The
embedded agents (on the client/writer side) have protected us to date, but
they too can be subject to the same backup issues with file channels.
Thank you for any help around this, if nothing else this informs other
potential users of problems we have been seeing.
Gary
On Wed, Aug 20, 2014 at 12:50 AM, Gary Malouf <ma...@gmail.com> wrote:
> We run on ssd ephemeral storage.
>
> You can see from the timestamps below how long the replay ended up taking a
>
> 20 Aug 2014 04:05:20,520 INFO [lifecycleSupervisor-1-1]
> (org.apache.flume.channel.file.Log.replay:348) - Replay started
>
> 20 Aug 2014 04:05:20,532 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.Log.replay:348) - Replay started
>
> 20 Aug 2014 04:05:20,534 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 0,
>
> from []
>
> 20 Aug 2014 04:05:20,544 INFO [lifecycleSupervisor-1-1]
> (org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 619
>
> , from [/opt/flume/trdbuy-bid-req-ch1/data/log-613,
> /opt/flume/trdbuy-bid-req-ch1/data/log-615, /opt/flume/trdbuy-bid-req-ch1/d
>
> ata/log-618, /opt/flume/trdbuy-bid-req-ch1/data/log-616,
> /opt/flume/trdbuy-bid-req-ch1/data/log-617, /opt/flume/trdbuy-bid-req-
>
> ch1/data/log-612, /opt/flume/trdbuy-bid-req-ch1/data/log-619,
> /opt/flume/trdbuy-bid-req-ch1/data/log-614]
>
> 20 Aug 2014 04:05:20,552 INFO [lifecycleSupervisor-1-1]
> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:53)
>
> - Starting up with /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint
> and /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoin
>
> t.meta
>
> 20 Aug 2014 04:05:20,553 INFO [lifecycleSupervisor-1-1]
> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>
> - Reading checkpoint metadata from
> /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint.meta
>
> 20 Aug 2014 04:36:36,024 INFO [lifecycleSupervisor-1-1]
> (org.apache.flume.channel.file.FlumeEventQueue.<init>:114) - QueueSet
>
> population inserting 22678800 took 1874849
>
>
>
>
> ---------- Forwarded message ----------
> From: Hari Shreedharan <hs...@cloudera.com>
> Date: Wed, Aug 20, 2014 at 12:42 AM
> Subject: Re: FileChannel Replays consistently take a long time
> To: "user@flume.apache.org" <us...@flume.apache.org>
>
>
> Are you running on EBS or ephemeral storage? I have seen IO being slow on
> AWS when EBS with provisioned IO is not used. This might be what you are
> seeing.
>
> Also what do you see as checkpoint size when the channel starts up?
>
>
> On Tue, Aug 19, 2014 at 9:18 PM, Gary Malouf <ma...@gmail.com>
> wrote:
>
>> To be clear, we have this load handled across 3 EC2 instances running
>> Flume so each individually we are asking to handle 3.3k (5k). With 16GB of
>> data in the channel, I would have expected the replay to be faster.
>>
>>
>> On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <ma...@gmail.com>
>> wrote:
>>
>>> Our capacity setting is:
>>>
>>> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
>>>
>>>
>>> Our current channel size can not be accessed because it still is in this
>>> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume
>>> node and the avro source ports have not yet opened. The pattern we see is
>>> that after anywhere from 15-30 minutes, the ports magically open and we can
>>> continue.
>>>
>>>
>>> This is because we are logging around 10k messages/second and did not
>>> want to lose any data during brief interruptions.
>>>
>>>
>>> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan <
>>> hshreedharan@cloudera.com> wrote:
>>>
>>>> How large is your channel (and how long does it take to replay?)
>>>>
>>>> Gary Malouf wrote:
>>>>
>>>>
>>>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>>>>
>>>>
>>>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>> We are repeatedly running into cases where the replays of from a
>>>> file channel going to HDFS take an eternity.
>>>>
>>>> I've read this thread
>>>> <
>>>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E
>>>> >,
>>>>
>>>> but I just am not convinced that our checkpoints are constantly
>>>> being corrupted.
>>>>
>>>> We are seeing messages such as:
>>>>
>>>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
>>>>
>>>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>>>>
>>>> - Reading checkpoint metadata from
>>>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>>>>
>>>>
>>>> How can it be that this takes so long?
>>>>
>>>>
>>>
>>
>
>
Fwd: FileChannel Replays consistently take a long time
Posted by Gary Malouf <ma...@gmail.com>.
We run on ssd ephemeral storage.
You can see from the timestamps below how long the replay ended up taking a
20 Aug 2014 04:05:20,520 INFO [lifecycleSupervisor-1-1]
(org.apache.flume.channel.file.Log.replay:348) - Replay started
20 Aug 2014 04:05:20,532 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.replay:348) - Replay started
20 Aug 2014 04:05:20,534 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 0,
from []
20 Aug 2014 04:05:20,544 INFO [lifecycleSupervisor-1-1]
(org.apache.flume.channel.file.Log.replay:360) - Found NextFileID 619
, from [/opt/flume/trdbuy-bid-req-ch1/data/log-613,
/opt/flume/trdbuy-bid-req-ch1/data/log-615, /opt/flume/trdbuy-bid-req-ch1/d
ata/log-618, /opt/flume/trdbuy-bid-req-ch1/data/log-616,
/opt/flume/trdbuy-bid-req-ch1/data/log-617, /opt/flume/trdbuy-bid-req-
ch1/data/log-612, /opt/flume/trdbuy-bid-req-ch1/data/log-619,
/opt/flume/trdbuy-bid-req-ch1/data/log-614]
20 Aug 2014 04:05:20,552 INFO [lifecycleSupervisor-1-1]
(org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:53)
- Starting up with /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint
and /opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoin
t.meta
20 Aug 2014 04:05:20,553 INFO [lifecycleSupervisor-1-1]
(org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
- Reading checkpoint metadata from
/opt/flume/trdbuy-bid-req-ch1/checkpoint/checkpoint.meta
20 Aug 2014 04:36:36,024 INFO [lifecycleSupervisor-1-1]
(org.apache.flume.channel.file.FlumeEventQueue.<init>:114) - QueueSet
population inserting 22678800 took 1874849
---------- Forwarded message ----------
From: Hari Shreedharan <hs...@cloudera.com>
Date: Wed, Aug 20, 2014 at 12:42 AM
Subject: Re: FileChannel Replays consistently take a long time
To: "user@flume.apache.org" <us...@flume.apache.org>
Are you running on EBS or ephemeral storage? I have seen IO being slow on
AWS when EBS with provisioned IO is not used. This might be what you are
seeing.
Also what do you see as checkpoint size when the channel starts up?
On Tue, Aug 19, 2014 at 9:18 PM, Gary Malouf <ma...@gmail.com> wrote:
> To be clear, we have this load handled across 3 EC2 instances running
> Flume so each individually we are asking to handle 3.3k (5k). With 16GB of
> data in the channel, I would have expected the replay to be faster.
>
>
> On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <ma...@gmail.com>
> wrote:
>
>> Our capacity setting is:
>>
>> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
>>
>>
>> Our current channel size can not be accessed because it still is in this
>> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume
>> node and the avro source ports have not yet opened. The pattern we see is
>> that after anywhere from 15-30 minutes, the ports magically open and we can
>> continue.
>>
>>
>> This is because we are logging around 10k messages/second and did not
>> want to lose any data during brief interruptions.
>>
>>
>> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>> How large is your channel (and how long does it take to replay?)
>>>
>>> Gary Malouf wrote:
>>>
>>>
>>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>>>
>>>
>>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>> We are repeatedly running into cases where the replays of from a
>>> file channel going to HDFS take an eternity.
>>>
>>> I've read this thread
>>> <
>>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E
>>> >,
>>>
>>> but I just am not convinced that our checkpoints are constantly
>>> being corrupted.
>>>
>>> We are seeing messages such as:
>>>
>>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
>>>
>>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>>>
>>> - Reading checkpoint metadata from
>>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>>>
>>>
>>> How can it be that this takes so long?
>>>
>>>
>>
>
Re: FileChannel Replays consistently take a long time
Posted by Hari Shreedharan <hs...@cloudera.com>.
Are you running on EBS or ephemeral storage? I have seen IO being slow on
AWS when EBS with provisioned IO is not used. This might be what you are
seeing.
Also what do you see as checkpoint size when the channel starts up?
On Tue, Aug 19, 2014 at 9:18 PM, Gary Malouf <ma...@gmail.com> wrote:
> To be clear, we have this load handled across 3 EC2 instances running
> Flume so each individually we are asking to handle 3.3k (5k). With 16GB of
> data in the channel, I would have expected the replay to be faster.
>
>
> On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <ma...@gmail.com>
> wrote:
>
>> Our capacity setting is:
>>
>> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
>>
>>
>> Our current channel size can not be accessed because it still is in this
>> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume
>> node and the avro source ports have not yet opened. The pattern we see is
>> that after anywhere from 15-30 minutes, the ports magically open and we can
>> continue.
>>
>>
>> This is because we are logging around 10k messages/second and did not
>> want to lose any data during brief interruptions.
>>
>>
>> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>> How large is your channel (and how long does it take to replay?)
>>>
>>> Gary Malouf wrote:
>>>
>>>
>>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>>>
>>>
>>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>> We are repeatedly running into cases where the replays of from a
>>> file channel going to HDFS take an eternity.
>>>
>>> I've read this thread
>>> <
>>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E
>>> >,
>>>
>>> but I just am not convinced that our checkpoints are constantly
>>> being corrupted.
>>>
>>> We are seeing messages such as:
>>>
>>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
>>>
>>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>>>
>>> - Reading checkpoint metadata from
>>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>>>
>>>
>>> How can it be that this takes so long?
>>>
>>>
>>
>
Re: FileChannel Replays consistently take a long time
Posted by Gary Malouf <ma...@gmail.com>.
To be clear, we have this load handled across 3 EC2 instances running Flume
so each individually we are asking to handle 3.3k (5k). With 16GB of data
in the channel, I would have expected the replay to be faster.
On Wed, Aug 20, 2014 at 12:12 AM, Gary Malouf <ma...@gmail.com> wrote:
> Our capacity setting is:
>
> agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
>
>
> Our current channel size can not be accessed because it still is in this
> odd 'replay' mode. There's not logs, but the cpu is cranking on the flume
> node and the avro source ports have not yet opened. The pattern we see is
> that after anywhere from 15-30 minutes, the ports magically open and we can
> continue.
>
>
> This is because we are logging around 10k messages/second and did not want
> to lose any data during brief interruptions.
>
>
> On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>> How large is your channel (and how long does it take to replay?)
>>
>> Gary Malouf wrote:
>>
>>
>> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>>
>>
>> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>> We are repeatedly running into cases where the replays of from a
>> file channel going to HDFS take an eternity.
>>
>> I've read this thread
>> <
>> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E
>> >,
>>
>> but I just am not convinced that our checkpoints are constantly
>> being corrupted.
>>
>> We are seeing messages such as:
>>
>> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
>> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>>
>> - Reading checkpoint metadata from
>> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>>
>>
>> How can it be that this takes so long?
>>
>>
>
Re: FileChannel Replays consistently take a long time
Posted by Gary Malouf <ma...@gmail.com>.
Our capacity setting is:
agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000
Our current channel size can not be accessed because it still is in this
odd 'replay' mode. There's not logs, but the cpu is cranking on the flume
node and the avro source ports have not yet opened. The pattern we see is
that after anywhere from 15-30 minutes, the ports magically open and we can
continue.
This is because we are logging around 10k messages/second and did not want
to lose any data during brief interruptions.
On Wed, Aug 20, 2014 at 12:02 AM, Hari Shreedharan <
hshreedharan@cloudera.com> wrote:
> How large is your channel (and how long does it take to replay?)
>
> Gary Malouf wrote:
>
>
> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>
>
> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
> <ma...@gmail.com>> wrote:
>
> We are repeatedly running into cases where the replays of from a
> file channel going to HDFS take an eternity.
>
> I've read this thread
> <
> http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E
> >,
>
> but I just am not convinced that our checkpoints are constantly
> being corrupted.
>
> We are seeing messages such as:
>
> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>
> - Reading checkpoint metadata from
> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>
>
> How can it be that this takes so long?
>
>
Re: FileChannel Replays consistently take a long time
Posted by Hari Shreedharan <hs...@cloudera.com>.
How large is your channel (and how long does it take to replay?)
Gary Malouf wrote:
>
> For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
>
>
> On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <malouf.gary@gmail.com
> <ma...@gmail.com>> wrote:
>
> We are repeatedly running into cases where the replays of from a
> file channel going to HDFS take an eternity.
>
> I've read this thread
> <http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E>,
> but I just am not convinced that our checkpoints are constantly
> being corrupted.
>
> We are seeing messages such as:
>
> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>
> - Reading checkpoint metadata from
> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>
>
> How can it be that this takes so long?
>
Re: FileChannel Replays consistently take a long time
Posted by Gary Malouf <ma...@gmail.com>.
For the record, we are using Flume 1.4.0 packaged with CDH5.0.2
On Tue, Aug 19, 2014 at 11:55 PM, Gary Malouf <ma...@gmail.com> wrote:
> We are repeatedly running into cases where the replays of from a file
> channel going to HDFS take an eternity.
>
> I've read this thread
> <http://mail-archives.apache.org/mod_mbox/flume-dev/201306.mbox/%3CCAHbPYVbmeD6pKZkDADmyaW_gC_p7cQDefPsycWKnkY72TFiAuQ@mail.gmail.com%3E>,
> but I just am not convinced that our checkpoints are constantly being
> corrupted.
>
> We are seeing messages such as:
>
> 20 Aug 2014 03:52:26,849 INFO [lifecycleSupervisor-1-2]
> (org.apache.flume.channel.file.EventQueueBackingStoreFileV3.<init>:57)
>
> - Reading checkpoint metadata from
> /opt/flume/brq/ch1/checkpoint/checkpoint.meta
>
>
> How can it be that this takes so long?
>