You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Liam Friel <li...@gmail.com> on 2011/08/14 14:22:41 UTC
Re: duplicates using tail source

Hi Jon,

Back from vacation.

Thanks for that explanation, it is quite possible that the roll time and the
retransmit settings are not compatible in the site settings, I'll check
that.

Regards
Liam

On Mon, Jul 25, 2011 at 11:07 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> [Please subscribe to new flume-user@incubator.apache.org list, bcc
> flume-user@cloudera.org, cc flume-user@incubator.apache.org]
>
> Liam,
>
> I've written up a FAQ topic on duplicates here:
>
> https://cwiki.apache.org/confluence/display/FLUME/Troubleshooting+FAQ
>
> I believe the duplicates you are seeing with e2e mode here may be due to
> the later two reasons: 1) collector closing too infrequently causing agent
> to retry, and 2) recovered old logs being resent.
>
> Jon.
>
>
> On Wed, Jul 13, 2011 at 11:54 AM, Liam <li...@gmail.com> wrote:
>
>> Actually, a followup on this.
>>
>> I had this running using the OS tail (tail -n +0 -F ...) and using an
>> autoE2EChain setup, sink to HDFS.
>>
>> However when I looked in detail at the data in HDFS, I found that
>> there were sometimes still duplicates of the data.
>>
>> Not always however.
>>
>> The duplicates though, when they happened, would always be *of the
>> entire file being tailed*.
>>
>> I ran some tests and sometimes I saw the correct number of results,
>> sometimes 2x, 5x or even 10x the number of results.
>> It only happened when the file was actually growing: if the file was
>> not growing, then Flume using OS tail quite reliably never sends any
>> data (if you see what I mean)
>>
>> And if the file was growing fast, then the number of records sent to
>> HDFS was never correct, it was always a multiple of the file.
>>
>> I changed the auto failover chain to DFO, and this issue seems to have
>> gone away. Since I don't really care about E2E reliability in this
>> case, that will do for me.
>> But someone else might care ...
>>
>>
>>
>> On Jul 1, 9:06 am, Liam Friel <li...@gmail.com> wrote:
>> > yes, you do.
>> >
>> > tail -F is the same as tail --follow=name, which reopens the file
>> > periodically to check that it hasn't moved ...
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jul 1, 2011 at 6:32 AM, Rob S. <ro...@gmail.com> wrote:
>> > > If you use the OS tail, do you still get the benefit of following
>> > > rolled files?
>> >
>> > > On Jun 30, 4:38 am, Liam Friel <li...@gmail.com> wrote:
>> > > > A previous poster suggest to me that instead of tail('file') I
>> switched
>> > > to
>> > > > using the OS tail instead.
>> >
>> > > > This fixed the problem for me.
>> >
>> > > > So instead of my source being tail(filename) my source is now:
>> >
>> > > > exec(/usr/bin/tail -n +0 -F filename)
>> >
>> > > > You will need to add appropriate command escaping, depending on how
>> you
>> > > are
>> > > > issuing the command.
>> >
>> > > > But using the OS tail make the problem go away.
>> >
>> > > > On Thu, Jun 30, 2011 at 7:30 AM, giridhar addepalli
>> > > > <gi...@gmail.com>wrote:> Hi,
>> >
>> > > > > using flume 0.9.4 version.
>> >
>> > > > > using the following configuration is pseudo-distributed mode
>> >
>> > > > > tail("/tmp/source.txt") | agent-sink("giridhar-dev", 35853);
>> > > > > collectorSource(35853) |
>> collectorSink("file:///tmp/flume/collected",
>> > > > > "prefix")
>> >
>> > > > > set flume.collector.roll.millis to 300000
>> >
>> > > > > we see lot of duplicates in the output files.
>> >
>> > > > > went through previous posts, looks like this problem is fixed ,
>> but
>> > > > > although we are  using latest version of flume still duplicates
>> are
>> > > > > appearing.
>> >
>> > > > > Please help.
>> >
>> > > > > Thanks,
>> > > > > Giridhar.
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>