You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Sagar Mehta <sa...@gmail.com> on 2013/03/04 23:42:33 UTC

Re: Flume Ng replaying events when the source is idle

Guys,

Yes this issue was also seen in the memory channel. In fact when we moved
to File based channel, we initially thought  this issue won't occur since
it stores check points.

Anyways below are all files for collector110 [whose source didn't receive
any events] and you can see all the replays below. I have attached the
corresponding flume log file for the same day.

hadoop@jobtracker301:/home/smehta$ hls
/ngpipes-raw-logs/2013-03-03/*/collector110* |  head -5
-rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:20
/ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362270044367.gz
-rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:51
/ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362271875065.gz
-rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:21
/ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362273675770.gz
-rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:51
/ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362275476474.gz
-rw-r--r--   3 hadoop supergroup       1594 2013-03-03 02:20
/ngpipes-raw-logs/2013-03-03/0200/collector110.ngpipes.sac.ngmoco.com.1362277246704.gz

Also in the attached flume log, you can see the replay stuff I'm talking
about - Please note the source received no events during this time.

sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "Queue Size
after replay" | head
2013-03-03 00:20:44,355 [lifecycleSupervisor-1-3] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel1]
2013-03-03 00:20:44,356 [lifecycleSupervisor-1-4] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel2]
2013-03-03 00:51:14,571 [lifecycleSupervisor-1-7] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
[channel=channel2]
2013-03-03 00:51:14,577 [lifecycleSupervisor-1-1] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel1]
2013-03-03 01:21:15,276 [lifecycleSupervisor-1-8] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
[channel=channel2]
2013-03-03 01:21:15,281 [lifecycleSupervisor-1-7] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel1]
2013-03-03 01:51:15,979 [lifecycleSupervisor-1-9] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
[channel=channel2]
2013-03-03 01:51:15,985 [lifecycleSupervisor-1-5] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel1]
2013-03-03 02:20:46,697 [lifecycleSupervisor-1-2] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel1]
2013-03-03 02:20:46,697 [lifecycleSupervisor-1-8] INFO
 org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
[channel=channel2]

As for the contents of the file, yes they are exactly the same 10 lines of
events replayed over and over - I checked that.

Let me know if you guys have any insights into this or if this is a bug in
Flume Ng.

Sagar


On Thu, Feb 28, 2013 at 2:59 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

>  Can also send the flume agent logs? Did you check the contents of the
> files?
>
> --
> Hari Shreedharan
>
> On Thursday, February 28, 2013 at 2:43 PM, Roshan Naik wrote:
>
> would you be able to you verify if the same problem can be reproduced by
> using the memory channel instead in a test setup ?
>
>
> On Wed, Feb 27, 2013 at 11:37 AM, Sagar Mehta <sa...@gmail.com>wrote:
>
> Hi Guys,
>
> I'm using Flume-Ng and it is working pretty well except for a weird
> situation which I observed lately. In essence I'm using an exec source for
> doing  tail -F on a logfile and using two HDFS sinks with a File channel.
>
> However I have observed that when the source [ logfile of a jetty based
> collector] is idle - that is no new events are pushed to the logFile,
> FlumeNg seems to replay the same set of events.
>
> For example collector110 received no events for 2 subsequent hours and
> below are the corresponding Flume written files at the HDFS sink
>
> hadoop@jobtracker301:/home/hadoop/sagar$ hls
> /ngpipes-raw-logs/2013-02-27/1400/collector110*
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:20
> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:50
> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>
> hadoop@jobtracker301:/home/hadoop/sagar$ hls
> /ngpipes-raw-logs/2013-02-27/1500/collector110*
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:20
> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:50
> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>
> hadoop@jobtracker301:/home/hadoop/sagar$ md5sum *
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>
> As you can see above the md5sums match.
>
> I'm using a File channel which has checkpoints, so I'm not sure what is
> going on. Btw looks like the difference in timestamps of the two replays is
> exactly 30 mins.
>
> *Is this a known bug or am I missing something?*
> *
> *
> *Below is my Flume config file*
>
> smehta@collector110:/opt/flume/conf$ cat hdfs.conf
> # An hdfs sink to write events to the hdfs on the test cluster
> # A memory based channel to connect the above source and sink
>
> # Name the components on this agent
> collector110.sources = source1
> collector110.sinks = sink1 sink2
> collector110.channels = channel1 channel2
>
> # Configure the source
> collector110.sources.source1.type = exec
> collector110.sources.source1.command = tail -F /opt/jetty/logFile.log
>
> # Configure the interceptors
> collector110.sources.source1.interceptors = TimestampInterceptor
> HostInterceptor
>
> # We use the Timestamp interceptor to get timestamps of when flume
> receives events
> # This is used for figuring out the bucket to which an event goes
> collector110.sources.source1.interceptors.TimestampInterceptor.type =
> timestamp
>
> # We use the Host interceptor to populate the host header with the fully
> qualified domain name of the collector.
> # That way we know which file in the sink respresents which collector.
> collector110.sources.source1.interceptors.HostInterceptor.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> collector110.sources.source1.interceptors.HostInterceptor.preserveExisting
> = false
> collector110.sources.source1.interceptors.HostInterceptor.useIP = false
> collector110.sources.source1.interceptors.HostInterceptor.hostHeader = host
>
>
> # Configure the sink
>
> collector110.sinks.sink1.type = hdfs
>
> # Configure the bucketing
> collector110.sinks.sink1.hdfs.path=hdfs://
> namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>
> # Prefix the file with the source so that we know where the events in the
> file came from
> collector110.sinks.sink1.hdfs.filePrefix = %{host}
>
> # We roll the flume output file based on time interval - currently every 5
> minutes
> collector110.sinks.sink1.hdfs.rollSize = 0
> collector110.sinks.sink1.hdfs.rollCount = 0
> collector110.sinks.sink1.hdfs.rollInterval = 300
>
> #gzip compression related settings
> collector110.sinks.sink1.hdfs.codeC = gzip
> collector110.sinks.sink1.hdfs.fileType = CompressedStream
> collector110.sinks.sink1.hdfs.fileSuffix = .gz
>
> # Configure the sink
>
> collector110.sinks.sink2.type = hdfs
>
> # Configure the bucketing
> collector110.sinks.sink2.hdfs.path=hdfs://
> namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>
> # Prefix the file with the source so that we know where the events in the
> file came from
> collector110.sinks.sink2.hdfs.filePrefix = %{host}
>
> # We roll the flume output file based on time interval - currently every 5
> minutes
> collector110.sinks.sink2.hdfs.rollSize = 0
> collector110.sinks.sink2.hdfs.rollCount = 0
> collector110.sinks.sink2.hdfs.rollInterval = 300
>
> #gzip compression related settings
> collector110.sinks.sink2.hdfs.codeC = gzip
> collector110.sinks.sink2.hdfs.fileType = CompressedStream
> collector110.sinks.sink2.hdfs.fileSuffix = .gz
>
> # Configure the channel that connects the source to the sink
>
> # Use a channel which buffers events in filesystem
> collector110.channels.channel1.type = file
> collector110.channels.channel1.checkpointDir =
> /data/flume_data/channel1/checkpoint
> collector110.channels.channel1.dataDirs = /data/flume_data/channel1/data
>
> # Use a channel which buffers events in filesystem
> collector110.channels.channel2.type = file
> collector110.channels.channel2.checkpointDir =
> /data/flume_data/channel2/checkpoint
> collector110.channels.channel2.dataDirs = /data/flume_data/channel2/data
>
> # Bind the source and sink to the channel configured above
> collector110.sources.source1.channels = channel1 channel2
> collector110.sinks.sink1.channel = channel1
> collector110.sinks.sink2.channel = channel2
>
> Sagar
>
>
>
>
>

Re: Flume Ng replaying events when the source is idle

Posted by Sagar Mehta <sa...@gmail.com>.
Hi Guys,

So I disabled puppet on one of the boxes and yes there have been no replays.

So essentially to summarize -

   - Puppet was re-writing the flume config file [expand a template] every
   30 mins.
   - This caused Flume to reload the configuration.
   - As a result the exec source was getting re-executed which caused the
   replay of events since the logFile had no changes. [From Flume's point of
   view these are still new events :)]

*@Mike - Your suggestion totally makes sense and I'm going to try that now
in a test environment. Do you recommend it for production use?*

Sagar

On Mon, Mar 4, 2013 at 7:10 PM, Mike Percy <mp...@apache.org> wrote:

> Sagar,
> Just try "tail -F" on the same file over and over on the command line. It
> will display the last few lines.
>
> If you want to avoid this, try "tail -F -n 0 filename" and you should not
> see this. Every time you reload your configuration file, the specified
> command is re-executed by the source.
>
> Regards,
> Mike
>
>
>
> On Mon, Mar 4, 2013 at 4:13 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>>  Flume will reload the configuration file every time it is modified.
>> Since puppet rewrites it, Flume reloads it. The events are probably
>> replayed because of the transactions being incomplete or something like
>> that. File Channel will not replay the events if they have been completely
>> persisted to HDFS and transaction closed. If pupper does not rewrite the
>> config file, do you see this issue?
>>
>> --
>> Hari Shreedharan
>>
>> On Monday, March 4, 2013 at 3:06 PM, Sagar Mehta wrote:
>>
>> I think we found the issue, not sure if this is the root cause but looks
>> highly correlated.
>>
>> So we manage configs using puppet which currently runs in a cron mode
>> with following configuration
>>
>> ## puppetrun Cron Job
>> 20,50 * * * *  root sleep $((RANDOM\%60)) > /dev/null 2>&1; puppet agent
>> --onetime --no-daemonize --logdest syslog > /dev/null 2>&1
>>
>>  *Note - the times at which puppet is run along with the time-stamps in
>> the listing below.*
>>
>> Also after combing through flume logs, we noticed Flume is reloading the
>> configuration after every puppet run
>>
>> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "reloading" |
>> head -5
>> 2013-03-03 00:20:44,174 [conf-file-poller-0] INFO
>>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
>> Reloading configuration file:/opt/flume/conf/hdfs.conf
>> 2013-03-03 00:51:14,374 [conf-file-poller-0] INFO
>>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
>> Reloading configuration file:/opt/flume/conf/hdfs.conf
>> 2013-03-03 01:21:15,072 [conf-file-poller-0] INFO
>>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
>> Reloading configuration file:/opt/flume/conf/hdfs.conf
>> 2013-03-03 01:51:15,778 [conf-file-poller-0] INFO
>>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
>> Reloading configuration file:/opt/flume/conf/hdfs.conf
>> 2013-03-03 02:20:46,481 [conf-file-poller-0] INFO
>>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
>> Reloading configuration file:/opt/flume/conf/hdfs.conf
>>
>> The way we have our current setup, the flume config file
>> namely /opt/flume/conf/hdfs.conf is re-written after every puppet run due
>> to variable interpolation in the template.
>>
>>  *We are still not sure what is causing Flume to reload the config file,
>> and even if the file is reloaded why are the same events getting replayed
>> [the state should be saved somewhere on disk - thats what the file channel
>> is for I thought]*
>>
>> Any pointers/insights appreciated.
>>
>> Sagar
>>
>>
>> On Mon, Mar 4, 2013 at 2:42 PM, Sagar Mehta <sa...@gmail.com> wrote:
>>
>> Guys,
>>
>> Yes this issue was also seen in the memory channel. In fact when we moved
>> to File based channel, we initially thought  this issue won't occur since
>> it stores check points.
>>
>> Anyways below are all files for collector110 [whose source didn't receive
>> any events] and you can see all the replays below. I have attached the
>> corresponding flume log file for the same day.
>>
>> hadoop@jobtracker301:/home/smehta$ hls
>> /ngpipes-raw-logs/2013-03-03/*/collector110* |  head -5
>> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:20
>> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362270044367.gz
>> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:51
>> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362271875065.gz
>> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:21
>> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362273675770.gz
>> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:51
>> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362275476474.gz
>> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 02:20
>> /ngpipes-raw-logs/2013-03-03/0200/collector110.ngpipes.sac.ngmoco.com.1362277246704.gz
>>
>> Also in the attached flume log, you can see the replay stuff I'm talking
>> about - Please note the source received no events during this time.
>>
>> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "Queue Size
>> after replay" | head
>> 2013-03-03 00:20:44,355 [lifecycleSupervisor-1-3] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel1]
>> 2013-03-03 00:20:44,356 [lifecycleSupervisor-1-4] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel2]
>> 2013-03-03 00:51:14,571 [lifecycleSupervisor-1-7] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
>> [channel=channel2]
>> 2013-03-03 00:51:14,577 [lifecycleSupervisor-1-1] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel1]
>> 2013-03-03 01:21:15,276 [lifecycleSupervisor-1-8] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
>> [channel=channel2]
>> 2013-03-03 01:21:15,281 [lifecycleSupervisor-1-7] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel1]
>> 2013-03-03 01:51:15,979 [lifecycleSupervisor-1-9] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
>> [channel=channel2]
>> 2013-03-03 01:51:15,985 [lifecycleSupervisor-1-5] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel1]
>> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-2] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel1]
>> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-8] INFO
>>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
>> [channel=channel2]
>>
>> As for the contents of the file, yes they are exactly the same 10 lines
>> of events replayed over and over - I checked that.
>>
>> Let me know if you guys have any insights into this or if this is a bug
>> in Flume Ng.
>>
>> Sagar
>>
>>
>> On Thu, Feb 28, 2013 at 2:59 PM, Hari Shreedharan <
>> hshreedharan@cloudera.com> wrote:
>>
>>  Can also send the flume agent logs? Did you check the contents of the
>> files?
>>
>> --
>> Hari Shreedharan
>>
>> On Thursday, February 28, 2013 at 2:43 PM, Roshan Naik wrote:
>>
>> would you be able to you verify if the same problem can be reproduced by
>> using the memory channel instead in a test setup ?
>>
>>
>> On Wed, Feb 27, 2013 at 11:37 AM, Sagar Mehta <sa...@gmail.com>wrote:
>>
>> Hi Guys,
>>
>> I'm using Flume-Ng and it is working pretty well except for a weird
>> situation which I observed lately. In essence I'm using an exec source for
>> doing  tail -F on a logfile and using two HDFS sinks with a File channel.
>>
>> However I have observed that when the source [ logfile of a jetty based
>> collector] is idle - that is no new events are pushed to the logFile,
>> FlumeNg seems to replay the same set of events.
>>
>> For example collector110 received no events for 2 subsequent hours and
>> below are the corresponding Flume written files at the HDFS sink
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ hls
>> /ngpipes-raw-logs/2013-02-27/1400/collector110*
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:20
>> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:50
>> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ hls
>> /ngpipes-raw-logs/2013-02-27/1500/collector110*
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:20
>> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:50
>> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ md5sum *
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>>
>> As you can see above the md5sums match.
>>
>> I'm using a File channel which has checkpoints, so I'm not sure what is
>> going on. Btw looks like the difference in timestamps of the two replays is
>> exactly 30 mins.
>>
>> *Is this a known bug or am I missing something?*
>> *
>> *
>> *Below is my Flume config file*
>>
>> smehta@collector110:/opt/flume/conf$ cat hdfs.conf
>> # An hdfs sink to write events to the hdfs on the test cluster
>> # A memory based channel to connect the above source and sink
>>
>> # Name the components on this agent
>> collector110.sources = source1
>> collector110.sinks = sink1 sink2
>> collector110.channels = channel1 channel2
>>
>> # Configure the source
>> collector110.sources.source1.type = exec
>> collector110.sources.source1.command = tail -F /opt/jetty/logFile.log
>>
>> # Configure the interceptors
>> collector110.sources.source1.interceptors = TimestampInterceptor
>> HostInterceptor
>>
>> # We use the Timestamp interceptor to get timestamps of when flume
>> receives events
>> # This is used for figuring out the bucket to which an event goes
>> collector110.sources.source1.interceptors.TimestampInterceptor.type =
>> timestamp
>>
>> # We use the Host interceptor to populate the host header with the fully
>> qualified domain name of the collector.
>> # That way we know which file in the sink respresents which collector.
>> collector110.sources.source1.interceptors.HostInterceptor.type =
>> org.apache.flume.interceptor.HostInterceptor$Builder
>> collector110.sources.source1.interceptors.HostInterceptor.preserveExisting
>> = false
>> collector110.sources.source1.interceptors.HostInterceptor.useIP = false
>> collector110.sources.source1.interceptors.HostInterceptor.hostHeader =
>> host
>>
>>
>> # Configure the sink
>>
>> collector110.sinks.sink1.type = hdfs
>>
>> # Configure the bucketing
>> collector110.sinks.sink1.hdfs.path=hdfs://
>> namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>>
>> # Prefix the file with the source so that we know where the events in the
>> file came from
>> collector110.sinks.sink1.hdfs.filePrefix = %{host}
>>
>> # We roll the flume output file based on time interval - currently every
>> 5 minutes
>> collector110.sinks.sink1.hdfs.rollSize = 0
>> collector110.sinks.sink1.hdfs.rollCount = 0
>> collector110.sinks.sink1.hdfs.rollInterval = 300
>>
>> #gzip compression related settings
>> collector110.sinks.sink1.hdfs.codeC = gzip
>> collector110.sinks.sink1.hdfs.fileType = CompressedStream
>> collector110.sinks.sink1.hdfs.fileSuffix = .gz
>>
>> # Configure the sink
>>
>> collector110.sinks.sink2.type = hdfs
>>
>> # Configure the bucketing
>> collector110.sinks.sink2.hdfs.path=hdfs://
>> namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>>
>> # Prefix the file with the source so that we know where the events in the
>> file came from
>> collector110.sinks.sink2.hdfs.filePrefix = %{host}
>>
>> # We roll the flume output file based on time interval - currently every
>> 5 minutes
>> collector110.sinks.sink2.hdfs.rollSize = 0
>> collector110.sinks.sink2.hdfs.rollCount = 0
>> collector110.sinks.sink2.hdfs.rollInterval = 300
>>
>> #gzip compression related settings
>> collector110.sinks.sink2.hdfs.codeC = gzip
>> collector110.sinks.sink2.hdfs.fileType = CompressedStream
>> collector110.sinks.sink2.hdfs.fileSuffix = .gz
>>
>> # Configure the channel that connects the source to the sink
>>
>> # Use a channel which buffers events in filesystem
>> collector110.channels.channel1.type = file
>> collector110.channels.channel1.checkpointDir =
>> /data/flume_data/channel1/checkpoint
>> collector110.channels.channel1.dataDirs = /data/flume_data/channel1/data
>>
>> # Use a channel which buffers events in filesystem
>> collector110.channels.channel2.type = file
>> collector110.channels.channel2.checkpointDir =
>> /data/flume_data/channel2/checkpoint
>> collector110.channels.channel2.dataDirs = /data/flume_data/channel2/data
>>
>> # Bind the source and sink to the channel configured above
>> collector110.sources.source1.channels = channel1 channel2
>> collector110.sinks.sink1.channel = channel1
>> collector110.sinks.sink2.channel = channel2
>>
>> Sagar
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Flume Ng replaying events when the source is idle

Posted by Mike Percy <mp...@apache.org>.
Sagar,
Just try "tail -F" on the same file over and over on the command line. It
will display the last few lines.

If you want to avoid this, try "tail -F -n 0 filename" and you should not
see this. Every time you reload your configuration file, the specified
command is re-executed by the source.

Regards,
Mike



On Mon, Mar 4, 2013 at 4:13 PM, Hari Shreedharan
<hs...@cloudera.com>wrote:

>  Flume will reload the configuration file every time it is modified. Since
> puppet rewrites it, Flume reloads it. The events are probably replayed
> because of the transactions being incomplete or something like that. File
> Channel will not replay the events if they have been completely persisted
> to HDFS and transaction closed. If pupper does not rewrite the config file,
> do you see this issue?
>
> --
> Hari Shreedharan
>
> On Monday, March 4, 2013 at 3:06 PM, Sagar Mehta wrote:
>
> I think we found the issue, not sure if this is the root cause but looks
> highly correlated.
>
> So we manage configs using puppet which currently runs in a cron mode with
> following configuration
>
> ## puppetrun Cron Job
> 20,50 * * * *  root sleep $((RANDOM\%60)) > /dev/null 2>&1; puppet agent
> --onetime --no-daemonize --logdest syslog > /dev/null 2>&1
>
>  *Note - the times at which puppet is run along with the time-stamps in
> the listing below.*
>
> Also after combing through flume logs, we noticed Flume is reloading the
> configuration after every puppet run
>
> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "reloading" |
> head -5
> 2013-03-03 00:20:44,174 [conf-file-poller-0] INFO
>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
> Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 00:51:14,374 [conf-file-poller-0] INFO
>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
> Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 01:21:15,072 [conf-file-poller-0] INFO
>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
> Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 01:51:15,778 [conf-file-poller-0] INFO
>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
> Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 02:20:46,481 [conf-file-poller-0] INFO
>  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
> Reloading configuration file:/opt/flume/conf/hdfs.conf
>
> The way we have our current setup, the flume config file
> namely /opt/flume/conf/hdfs.conf is re-written after every puppet run due
> to variable interpolation in the template.
>
>  *We are still not sure what is causing Flume to reload the config file,
> and even if the file is reloaded why are the same events getting replayed
> [the state should be saved somewhere on disk - thats what the file channel
> is for I thought]*
>
> Any pointers/insights appreciated.
>
> Sagar
>
>
> On Mon, Mar 4, 2013 at 2:42 PM, Sagar Mehta <sa...@gmail.com> wrote:
>
> Guys,
>
> Yes this issue was also seen in the memory channel. In fact when we moved
> to File based channel, we initially thought  this issue won't occur since
> it stores check points.
>
> Anyways below are all files for collector110 [whose source didn't receive
> any events] and you can see all the replays below. I have attached the
> corresponding flume log file for the same day.
>
> hadoop@jobtracker301:/home/smehta$ hls
> /ngpipes-raw-logs/2013-03-03/*/collector110* |  head -5
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:20
> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362270044367.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:51
> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362271875065.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:21
> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362273675770.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:51
> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362275476474.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 02:20
> /ngpipes-raw-logs/2013-03-03/0200/collector110.ngpipes.sac.ngmoco.com.1362277246704.gz
>
> Also in the attached flume log, you can see the replay stuff I'm talking
> about - Please note the source received no events during this time.
>
> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "Queue Size
> after replay" | head
> 2013-03-03 00:20:44,355 [lifecycleSupervisor-1-3] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 00:20:44,356 [lifecycleSupervisor-1-4] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel2]
> 2013-03-03 00:51:14,571 [lifecycleSupervisor-1-7] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 00:51:14,577 [lifecycleSupervisor-1-1] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 01:21:15,276 [lifecycleSupervisor-1-8] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 01:21:15,281 [lifecycleSupervisor-1-7] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 01:51:15,979 [lifecycleSupervisor-1-9] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 01:51:15,985 [lifecycleSupervisor-1-5] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-2] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-8] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel2]
>
> As for the contents of the file, yes they are exactly the same 10 lines of
> events replayed over and over - I checked that.
>
> Let me know if you guys have any insights into this or if this is a bug in
> Flume Ng.
>
> Sagar
>
>
> On Thu, Feb 28, 2013 at 2:59 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>  Can also send the flume agent logs? Did you check the contents of the
> files?
>
> --
> Hari Shreedharan
>
> On Thursday, February 28, 2013 at 2:43 PM, Roshan Naik wrote:
>
> would you be able to you verify if the same problem can be reproduced by
> using the memory channel instead in a test setup ?
>
>
> On Wed, Feb 27, 2013 at 11:37 AM, Sagar Mehta <sa...@gmail.com>wrote:
>
> Hi Guys,
>
> I'm using Flume-Ng and it is working pretty well except for a weird
> situation which I observed lately. In essence I'm using an exec source for
> doing  tail -F on a logfile and using two HDFS sinks with a File channel.
>
> However I have observed that when the source [ logfile of a jetty based
> collector] is idle - that is no new events are pushed to the logFile,
> FlumeNg seems to replay the same set of events.
>
> For example collector110 received no events for 2 subsequent hours and
> below are the corresponding Flume written files at the HDFS sink
>
> hadoop@jobtracker301:/home/hadoop/sagar$ hls
> /ngpipes-raw-logs/2013-02-27/1400/collector110*
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:20
> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:50
> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>
> hadoop@jobtracker301:/home/hadoop/sagar$ hls
> /ngpipes-raw-logs/2013-02-27/1500/collector110*
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:20
> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:50
> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>
> hadoop@jobtracker301:/home/hadoop/sagar$ md5sum *
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> c7360ef5c8deaee3ce9f4c92e9d9be63
>  collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>
> As you can see above the md5sums match.
>
> I'm using a File channel which has checkpoints, so I'm not sure what is
> going on. Btw looks like the difference in timestamps of the two replays is
> exactly 30 mins.
>
> *Is this a known bug or am I missing something?*
> *
> *
> *Below is my Flume config file*
>
> smehta@collector110:/opt/flume/conf$ cat hdfs.conf
> # An hdfs sink to write events to the hdfs on the test cluster
> # A memory based channel to connect the above source and sink
>
> # Name the components on this agent
> collector110.sources = source1
> collector110.sinks = sink1 sink2
> collector110.channels = channel1 channel2
>
> # Configure the source
> collector110.sources.source1.type = exec
> collector110.sources.source1.command = tail -F /opt/jetty/logFile.log
>
> # Configure the interceptors
> collector110.sources.source1.interceptors = TimestampInterceptor
> HostInterceptor
>
> # We use the Timestamp interceptor to get timestamps of when flume
> receives events
> # This is used for figuring out the bucket to which an event goes
> collector110.sources.source1.interceptors.TimestampInterceptor.type =
> timestamp
>
> # We use the Host interceptor to populate the host header with the fully
> qualified domain name of the collector.
> # That way we know which file in the sink respresents which collector.
> collector110.sources.source1.interceptors.HostInterceptor.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> collector110.sources.source1.interceptors.HostInterceptor.preserveExisting
> = false
> collector110.sources.source1.interceptors.HostInterceptor.useIP = false
> collector110.sources.source1.interceptors.HostInterceptor.hostHeader = host
>
>
> # Configure the sink
>
> collector110.sinks.sink1.type = hdfs
>
> # Configure the bucketing
> collector110.sinks.sink1.hdfs.path=hdfs://
> namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>
> # Prefix the file with the source so that we know where the events in the
> file came from
> collector110.sinks.sink1.hdfs.filePrefix = %{host}
>
> # We roll the flume output file based on time interval - currently every 5
> minutes
> collector110.sinks.sink1.hdfs.rollSize = 0
> collector110.sinks.sink1.hdfs.rollCount = 0
> collector110.sinks.sink1.hdfs.rollInterval = 300
>
> #gzip compression related settings
> collector110.sinks.sink1.hdfs.codeC = gzip
> collector110.sinks.sink1.hdfs.fileType = CompressedStream
> collector110.sinks.sink1.hdfs.fileSuffix = .gz
>
> # Configure the sink
>
> collector110.sinks.sink2.type = hdfs
>
> # Configure the bucketing
> collector110.sinks.sink2.hdfs.path=hdfs://
> namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>
> # Prefix the file with the source so that we know where the events in the
> file came from
> collector110.sinks.sink2.hdfs.filePrefix = %{host}
>
> # We roll the flume output file based on time interval - currently every 5
> minutes
> collector110.sinks.sink2.hdfs.rollSize = 0
> collector110.sinks.sink2.hdfs.rollCount = 0
> collector110.sinks.sink2.hdfs.rollInterval = 300
>
> #gzip compression related settings
> collector110.sinks.sink2.hdfs.codeC = gzip
> collector110.sinks.sink2.hdfs.fileType = CompressedStream
> collector110.sinks.sink2.hdfs.fileSuffix = .gz
>
> # Configure the channel that connects the source to the sink
>
> # Use a channel which buffers events in filesystem
> collector110.channels.channel1.type = file
> collector110.channels.channel1.checkpointDir =
> /data/flume_data/channel1/checkpoint
> collector110.channels.channel1.dataDirs = /data/flume_data/channel1/data
>
> # Use a channel which buffers events in filesystem
> collector110.channels.channel2.type = file
> collector110.channels.channel2.checkpointDir =
> /data/flume_data/channel2/checkpoint
> collector110.channels.channel2.dataDirs = /data/flume_data/channel2/data
>
> # Bind the source and sink to the channel configured above
> collector110.sources.source1.channels = channel1 channel2
> collector110.sinks.sink1.channel = channel1
> collector110.sinks.sink2.channel = channel2
>
> Sagar
>
>
>
>
>
>
>
>

Re: Flume Ng replaying events when the source is idle

Posted by Hari Shreedharan <hs...@cloudera.com>.
Flume will reload the configuration file every time it is modified. Since puppet rewrites it, Flume reloads it. The events are probably replayed because of the transactions being incomplete or something like that. File Channel will not replay the events if they have been completely persisted to HDFS and transaction closed. If pupper does not rewrite the config file, do you see this issue? 

-- 
Hari Shreedharan


On Monday, March 4, 2013 at 3:06 PM, Sagar Mehta wrote:

> I think we found the issue, not sure if this is the root cause but looks highly correlated.
> 
> So we manage configs using puppet which currently runs in a cron mode with following configuration
> 
> ## puppetrun Cron Job
> 20,50 * * * *  root sleep $((RANDOM\%60)) > /dev/null 2>&1; puppet agent --onetime --no-daemonize --logdest syslog > /dev/null 2>&1
> 
> Note - the times at which puppet is run along with the time-stamps in the listing below.
> 
> Also after combing through flume logs, we noticed Flume is reloading the configuration after every puppet run 
> 
> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "reloading" | head -5
> 2013-03-03 00:20:44,174 [conf-file-poller-0] INFO  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider - Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 00:51:14,374 [conf-file-poller-0] INFO  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider - Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 01:21:15,072 [conf-file-poller-0] INFO  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider - Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 01:51:15,778 [conf-file-poller-0] INFO  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider - Reloading configuration file:/opt/flume/conf/hdfs.conf
> 2013-03-03 02:20:46,481 [conf-file-poller-0] INFO  org.apache.flume.conf.properties.PropertiesFileConfigurationProvider - Reloading configuration file:/opt/flume/conf/hdfs.conf
> 
> 
> The way we have our current setup, the flume config file namely /opt/flume/conf/hdfs.conf is re-written after every puppet run due to variable interpolation in the template.
> 
> We are still not sure what is causing Flume to reload the config file, and even if the file is reloaded why are the same events getting replayed [the state should be saved somewhere on disk - thats what the file channel is for I thought] 
> 
> Any pointers/insights appreciated.
> 
> Sagar
> 
> 
> On Mon, Mar 4, 2013 at 2:42 PM, Sagar Mehta <sagarmehta@gmail.com (mailto:sagarmehta@gmail.com)> wrote:
> > Guys,
> > 
> > Yes this issue was also seen in the memory channel. In fact when we moved to File based channel, we initially thought  this issue won't occur since it stores check points. 
> > 
> > Anyways below are all files for collector110 [whose source didn't receive any events] and you can see all the replays below. I have attached the corresponding flume log file for the same day.
> > 
> > hadoop@jobtracker301:/home/smehta$ hls /ngpipes-raw-logs/2013-03-03/*/collector110* |  head -5
> > -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:20 /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362270044367.gz
> > -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:51 /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362271875065.gz
> > -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:21 /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362273675770.gz
> > -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:51 /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362275476474.gz
> > -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 02:20 /ngpipes-raw-logs/2013-03-03/0200/collector110.ngpipes.sac.ngmoco.com.1362277246704.gz
> > 
> > 
> > Also in the attached flume log, you can see the replay stuff I'm talking about - Please note the source received no events during this time.
> > 
> > sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "Queue Size after replay" | head  
> > 2013-03-03 00:20:44,355 [lifecycleSupervisor-1-3] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel1]
> > 2013-03-03 00:20:44,356 [lifecycleSupervisor-1-4] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel2]
> > 2013-03-03 00:51:14,571 [lifecycleSupervisor-1-7] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0 [channel=channel2]
> > 2013-03-03 00:51:14,577 [lifecycleSupervisor-1-1] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel1]
> > 2013-03-03 01:21:15,276 [lifecycleSupervisor-1-8] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0 [channel=channel2]
> > 2013-03-03 01:21:15,281 [lifecycleSupervisor-1-7] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel1]
> > 2013-03-03 01:51:15,979 [lifecycleSupervisor-1-9] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0 [channel=channel2]
> > 2013-03-03 01:51:15,985 [lifecycleSupervisor-1-5] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel1]
> > 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-2] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel1]
> > 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-8] INFO  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10 [channel=channel2]
> > 
> > 
> > As for the contents of the file, yes they are exactly the same 10 lines of events replayed over and over - I checked that.
> > 
> > Let me know if you guys have any insights into this or if this is a bug in Flume Ng. 
> > 
> > Sagar
> > 
> > 
> > On Thu, Feb 28, 2013 at 2:59 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > > Can also send the flume agent logs? Did you check the contents of the files?  
> > > 
> > > -- 
> > > Hari Shreedharan
> > > 
> > > 
> > > On Thursday, February 28, 2013 at 2:43 PM, Roshan Naik wrote:
> > > 
> > > > would you be able to you verify if the same problem can be reproduced by using the memory channel instead in a test setup ?
> > > > 
> > > > 
> > > > On Wed, Feb 27, 2013 at 11:37 AM, Sagar Mehta <sagarmehta@gmail.com (mailto:sagarmehta@gmail.com)> wrote:
> > > > > Hi Guys,
> > > > > 
> > > > > I'm using Flume-Ng and it is working pretty well except for a weird situation which I observed lately. In essence I'm using an exec source for doing  tail -F on a logfile and using two HDFS sinks with a File channel. 
> > > > > 
> > > > > However I have observed that when the source [ logfile of a jetty based collector] is idle - that is no new events are pushed to the logFile, FlumeNg seems to replay the same set of events.
> > > > > 
> > > > > For example collector110 received no events for 2 subsequent hours and below are the corresponding Flume written files at the HDFS sink
> > > > > 
> > > > > hadoop@jobtracker301:/home/hadoop/sagar$ hls /ngpipes-raw-logs/2013-02-27/1400/collector110* 
> > > > > -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:20 /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> > > > > -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:50 /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
> > > > > 
> > > > > hadoop@jobtracker301:/home/hadoop/sagar$ hls /ngpipes-raw-logs/2013-02-27/1500/collector110*
> > > > > -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:20 /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> > > > > -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:50 /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
> > > > > 
> > > > > hadoop@jobtracker301:/home/hadoop/sagar$ md5sum * 
> > > > > c7360ef5c8deaee3ce9f4c92e9d9be63  collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
> > > > > c7360ef5c8deaee3ce9f4c92e9d9be63  collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
> > > > > c7360ef5c8deaee3ce9f4c92e9d9be63  collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
> > > > > c7360ef5c8deaee3ce9f4c92e9d9be63  collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
> > > > > 
> > > > > 
> > > > > As you can see above the md5sums match.
> > > > > 
> > > > > I'm using a File channel which has checkpoints, so I'm not sure what is going on. Btw looks like the difference in timestamps of the two replays is exactly 30 mins. 
> > > > > 
> > > > > Is this a known bug or am I missing something?
> > > > > 
> > > > > Below is my Flume config file
> > > > > 
> > > > > smehta@collector110:/opt/flume/conf$ cat hdfs.conf 
> > > > > # An hdfs sink to write events to the hdfs on the test cluster
> > > > > # A memory based channel to connect the above source and sink
> > > > > 
> > > > > # Name the components on this agent
> > > > > collector110.sources = source1
> > > > > collector110.sinks = sink1 sink2
> > > > > collector110.channels = channel1 channel2
> > > > > 
> > > > > # Configure the source
> > > > > collector110.sources.source1.type = exec
> > > > > collector110.sources.source1.command = tail -F /opt/jetty/logFile.log
> > > > > 
> > > > > # Configure the interceptors
> > > > > collector110.sources.source1.interceptors = TimestampInterceptor HostInterceptor
> > > > > 
> > > > > # We use the Timestamp interceptor to get timestamps of when flume receives events 
> > > > > # This is used for figuring out the bucket to which an event goes
> > > > > collector110.sources.source1.interceptors.TimestampInterceptor.type = timestamp
> > > > > 
> > > > > # We use the Host interceptor to populate the host header with the fully qualified domain name of the collector. 
> > > > > # That way we know which file in the sink respresents which collector.
> > > > > collector110.sources.source1.interceptors.HostInterceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
> > > > > collector110.sources.source1.interceptors.HostInterceptor.preserveExisting = false
> > > > > collector110.sources.source1.interceptors.HostInterceptor.useIP = false
> > > > > collector110.sources.source1.interceptors.HostInterceptor.hostHeader = host
> > > > > 
> > > > > 
> > > > > # Configure the sink 
> > > > > 
> > > > > collector110.sinks.sink1.type = hdfs
> > > > > 
> > > > > # Configure the bucketing
> > > > > collector110.sinks.sink1.hdfs.path=hdfs://namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 (http://namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00)
> > > > > 
> > > > > # Prefix the file with the source so that we know where the events in the file came from
> > > > > collector110.sinks.sink1.hdfs.filePrefix = %{host}
> > > > > 
> > > > > # We roll the flume output file based on time interval - currently every 5 minutes 
> > > > > collector110.sinks.sink1.hdfs.rollSize = 0
> > > > > collector110.sinks.sink1.hdfs.rollCount = 0
> > > > > collector110.sinks.sink1.hdfs.rollInterval = 300
> > > > > 
> > > > > #gzip compression related settings 
> > > > > collector110.sinks.sink1.hdfs.codeC = gzip
> > > > > collector110.sinks.sink1.hdfs.fileType = CompressedStream
> > > > > collector110.sinks.sink1.hdfs.fileSuffix = .gz
> > > > > 
> > > > > # Configure the sink 
> > > > > 
> > > > > collector110.sinks.sink2.type = hdfs
> > > > > 
> > > > > # Configure the bucketing
> > > > > collector110.sinks.sink2.hdfs.path=hdfs://namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 (http://namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00)
> > > > > 
> > > > > # Prefix the file with the source so that we know where the events in the file came from
> > > > > collector110.sinks.sink2.hdfs.filePrefix = %{host}
> > > > > 
> > > > > # We roll the flume output file based on time interval - currently every 5 minutes 
> > > > > collector110.sinks.sink2.hdfs.rollSize = 0
> > > > > collector110.sinks.sink2.hdfs.rollCount = 0
> > > > > collector110.sinks.sink2.hdfs.rollInterval = 300
> > > > > 
> > > > > #gzip compression related settings 
> > > > > collector110.sinks.sink2.hdfs.codeC = gzip
> > > > > collector110.sinks.sink2.hdfs.fileType = CompressedStream
> > > > > collector110.sinks.sink2.hdfs.fileSuffix = .gz
> > > > > 
> > > > > # Configure the channel that connects the source to the sink 
> > > > > 
> > > > > # Use a channel which buffers events in filesystem
> > > > > collector110.channels.channel1.type = file
> > > > > collector110.channels.channel1.checkpointDir = /data/flume_data/channel1/checkpoint
> > > > > collector110.channels.channel1.dataDirs = /data/flume_data/channel1/data
> > > > > 
> > > > > # Use a channel which buffers events in filesystem
> > > > > collector110.channels.channel2.type = file
> > > > > collector110.channels.channel2.checkpointDir = /data/flume_data/channel2/checkpoint
> > > > > collector110.channels.channel2.dataDirs = /data/flume_data/channel2/data
> > > > > 
> > > > > # Bind the source and sink to the channel configured above
> > > > > collector110.sources.source1.channels = channel1 channel2
> > > > > collector110.sinks.sink1.channel = channel1
> > > > > collector110.sinks.sink2.channel = channel2
> > > > >  
> > > > > Sagar
> > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 


Re: Flume Ng replaying events when the source is idle

Posted by Sagar Mehta <sa...@gmail.com>.
I think we found the issue, not sure if this is the root cause but looks
highly correlated.

So we manage configs using puppet which currently runs in a cron mode with
following configuration

## puppetrun Cron Job
20,50 * * * *  root sleep $((RANDOM\%60)) > /dev/null 2>&1; puppet agent
--onetime --no-daemonize --logdest syslog > /dev/null 2>&1

*Note - the times at which puppet is run along with the time-stamps in the
listing below.*

Also after combing through flume logs, we noticed Flume is reloading the
configuration after every puppet run

sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "reloading" |
head -5
2013-03-03 00:20:44,174 [conf-file-poller-0] INFO
 org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
Reloading configuration file:/opt/flume/conf/hdfs.conf
2013-03-03 00:51:14,374 [conf-file-poller-0] INFO
 org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
Reloading configuration file:/opt/flume/conf/hdfs.conf
2013-03-03 01:21:15,072 [conf-file-poller-0] INFO
 org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
Reloading configuration file:/opt/flume/conf/hdfs.conf
2013-03-03 01:51:15,778 [conf-file-poller-0] INFO
 org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
Reloading configuration file:/opt/flume/conf/hdfs.conf
2013-03-03 02:20:46,481 [conf-file-poller-0] INFO
 org.apache.flume.conf.properties.PropertiesFileConfigurationProvider -
Reloading configuration file:/opt/flume/conf/hdfs.conf

The way we have our current setup, the flume config file
namely /opt/flume/conf/hdfs.conf is re-written after every puppet run due
to variable interpolation in the template.

*We are still not sure what is causing Flume to reload the config file, and
even if the file is reloaded why are the same events getting replayed [the
state should be saved somewhere on disk - thats what the file channel is
for I thought]*

Any pointers/insights appreciated.

Sagar


On Mon, Mar 4, 2013 at 2:42 PM, Sagar Mehta <sa...@gmail.com> wrote:

> Guys,
>
> Yes this issue was also seen in the memory channel. In fact when we moved
> to File based channel, we initially thought  this issue won't occur since
> it stores check points.
>
> Anyways below are all files for collector110 [whose source didn't receive
> any events] and you can see all the replays below. I have attached the
> corresponding flume log file for the same day.
>
> hadoop@jobtracker301:/home/smehta$ hls
> /ngpipes-raw-logs/2013-03-03/*/collector110* |  head -5
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:20
> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362270044367.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 00:51
> /ngpipes-raw-logs/2013-03-03/0000/collector110.ngpipes.sac.ngmoco.com.1362271875065.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:21
> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362273675770.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 01:51
> /ngpipes-raw-logs/2013-03-03/0100/collector110.ngpipes.sac.ngmoco.com.1362275476474.gz
> -rw-r--r--   3 hadoop supergroup       1594 2013-03-03 02:20
> /ngpipes-raw-logs/2013-03-03/0200/collector110.ngpipes.sac.ngmoco.com.1362277246704.gz
>
> Also in the attached flume log, you can see the replay stuff I'm talking
> about - Please note the source received no events during this time.
>
> sagar@drspock ~/temp $ cat flume.log.2013-03-03 | egrep -i "Queue Size
> after replay" | head
> 2013-03-03 00:20:44,355 [lifecycleSupervisor-1-3] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 00:20:44,356 [lifecycleSupervisor-1-4] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel2]
> 2013-03-03 00:51:14,571 [lifecycleSupervisor-1-7] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 00:51:14,577 [lifecycleSupervisor-1-1] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 01:21:15,276 [lifecycleSupervisor-1-8] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 01:21:15,281 [lifecycleSupervisor-1-7] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 01:51:15,979 [lifecycleSupervisor-1-9] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 0
> [channel=channel2]
> 2013-03-03 01:51:15,985 [lifecycleSupervisor-1-5] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-2] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel1]
> 2013-03-03 02:20:46,697 [lifecycleSupervisor-1-8] INFO
>  org.apache.flume.channel.file.FileChannel - Queue Size after replay: 10
> [channel=channel2]
>
> As for the contents of the file, yes they are exactly the same 10 lines of
> events replayed over and over - I checked that.
>
> Let me know if you guys have any insights into this or if this is a bug in
> Flume Ng.
>
> Sagar
>
>
> On Thu, Feb 28, 2013 at 2:59 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>>  Can also send the flume agent logs? Did you check the contents of the
>> files?
>>
>> --
>> Hari Shreedharan
>>
>> On Thursday, February 28, 2013 at 2:43 PM, Roshan Naik wrote:
>>
>> would you be able to you verify if the same problem can be reproduced by
>> using the memory channel instead in a test setup ?
>>
>>
>> On Wed, Feb 27, 2013 at 11:37 AM, Sagar Mehta <sa...@gmail.com>wrote:
>>
>> Hi Guys,
>>
>> I'm using Flume-Ng and it is working pretty well except for a weird
>> situation which I observed lately. In essence I'm using an exec source for
>> doing  tail -F on a logfile and using two HDFS sinks with a File channel.
>>
>> However I have observed that when the source [ logfile of a jetty based
>> collector] is idle - that is no new events are pushed to the logFile,
>> FlumeNg seems to replay the same set of events.
>>
>> For example collector110 received no events for 2 subsequent hours and
>> below are the corresponding Flume written files at the HDFS sink
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ hls
>> /ngpipes-raw-logs/2013-02-27/1400/collector110*
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:20
>> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 14:50
>> /ngpipes-raw-logs/2013-02-27/1400/collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ hls
>> /ngpipes-raw-logs/2013-02-27/1500/collector110*
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:20
>> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
>> -rw-r--r--   3 hadoop supergroup        441 2013-02-27 15:50
>> /ngpipes-raw-logs/2013-02-27/1500/collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>>
>> hadoop@jobtracker301:/home/hadoop/sagar$ md5sum *
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361974853210.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361976653432.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361978454123.gz
>> c7360ef5c8deaee3ce9f4c92e9d9be63
>>  collector110.ngpipes.sac.ngmoco.com.1361980254338.gz
>>
>> As you can see above the md5sums match.
>>
>> I'm using a File channel which has checkpoints, so I'm not sure what is
>> going on. Btw looks like the difference in timestamps of the two replays is
>> exactly 30 mins.
>>
>> *Is this a known bug or am I missing something?*
>> *
>> *
>> *Below is my Flume config file*
>>
>> smehta@collector110:/opt/flume/conf$ cat hdfs.conf
>> # An hdfs sink to write events to the hdfs on the test cluster
>> # A memory based channel to connect the above source and sink
>>
>> # Name the components on this agent
>> collector110.sources = source1
>> collector110.sinks = sink1 sink2
>> collector110.channels = channel1 channel2
>>
>> # Configure the source
>> collector110.sources.source1.type = exec
>> collector110.sources.source1.command = tail -F /opt/jetty/logFile.log
>>
>> # Configure the interceptors
>> collector110.sources.source1.interceptors = TimestampInterceptor
>> HostInterceptor
>>
>> # We use the Timestamp interceptor to get timestamps of when flume
>> receives events
>> # This is used for figuring out the bucket to which an event goes
>> collector110.sources.source1.interceptors.TimestampInterceptor.type =
>> timestamp
>>
>> # We use the Host interceptor to populate the host header with the fully
>> qualified domain name of the collector.
>> # That way we know which file in the sink respresents which collector.
>> collector110.sources.source1.interceptors.HostInterceptor.type =
>> org.apache.flume.interceptor.HostInterceptor$Builder
>> collector110.sources.source1.interceptors.HostInterceptor.preserveExisting
>> = false
>> collector110.sources.source1.interceptors.HostInterceptor.useIP = false
>> collector110.sources.source1.interceptors.HostInterceptor.hostHeader =
>> host
>>
>>
>> # Configure the sink
>>
>> collector110.sinks.sink1.type = hdfs
>>
>> # Configure the bucketing
>> collector110.sinks.sink1.hdfs.path=hdfs://
>> namenode3001.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>>
>> # Prefix the file with the source so that we know where the events in the
>> file came from
>> collector110.sinks.sink1.hdfs.filePrefix = %{host}
>>
>> # We roll the flume output file based on time interval - currently every
>> 5 minutes
>> collector110.sinks.sink1.hdfs.rollSize = 0
>> collector110.sinks.sink1.hdfs.rollCount = 0
>> collector110.sinks.sink1.hdfs.rollInterval = 300
>>
>> #gzip compression related settings
>> collector110.sinks.sink1.hdfs.codeC = gzip
>> collector110.sinks.sink1.hdfs.fileType = CompressedStream
>> collector110.sinks.sink1.hdfs.fileSuffix = .gz
>>
>> # Configure the sink
>>
>> collector110.sinks.sink2.type = hdfs
>>
>> # Configure the bucketing
>> collector110.sinks.sink2.hdfs.path=hdfs://
>> namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00
>>
>> # Prefix the file with the source so that we know where the events in the
>> file came from
>> collector110.sinks.sink2.hdfs.filePrefix = %{host}
>>
>> # We roll the flume output file based on time interval - currently every
>> 5 minutes
>> collector110.sinks.sink2.hdfs.rollSize = 0
>> collector110.sinks.sink2.hdfs.rollCount = 0
>> collector110.sinks.sink2.hdfs.rollInterval = 300
>>
>> #gzip compression related settings
>> collector110.sinks.sink2.hdfs.codeC = gzip
>> collector110.sinks.sink2.hdfs.fileType = CompressedStream
>> collector110.sinks.sink2.hdfs.fileSuffix = .gz
>>
>> # Configure the channel that connects the source to the sink
>>
>> # Use a channel which buffers events in filesystem
>> collector110.channels.channel1.type = file
>> collector110.channels.channel1.checkpointDir =
>> /data/flume_data/channel1/checkpoint
>> collector110.channels.channel1.dataDirs = /data/flume_data/channel1/data
>>
>> # Use a channel which buffers events in filesystem
>> collector110.channels.channel2.type = file
>> collector110.channels.channel2.checkpointDir =
>> /data/flume_data/channel2/checkpoint
>> collector110.channels.channel2.dataDirs = /data/flume_data/channel2/data
>>
>> # Bind the source and sink to the channel configured above
>> collector110.sources.source1.channels = channel1 channel2
>> collector110.sinks.sink1.channel = channel1
>> collector110.sinks.sink2.channel = channel2
>>
>> Sagar
>>
>>
>>
>>
>>
>