You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Joe Crobak <jo...@gmail.com> on 2011/08/16 16:52:42 UTC

E2E mode with decorators

According to the Flume FAQ [1], Flume ack's events from the CollectorSink in
E2E mode.  If I have a Decorator running on the Collector that filters out
events (or transforms them or something), does that mean those events won't
get ACK'd and thus will delivery will be retried for them indefinitely? IOW,
is E2E mode unsupported in this situation -- or maybe is there a way for me
to ACK events that I want to filter from the Decorator itself?

Thanks,
Joe


[1] https://github.com/cloudera/flume/wiki/FAQ

Re: E2E mode with decorators

Posted by Joe Crobak <jo...@gmail.com>.
Thanks Jon.  This makes perfect sense and helps a lot.  It seems to make a
lot of sense to change the decorator we've been working on to add attributes
to the events rather than replacing them outright.

Joe

On Fri, Aug 19, 2011 at 3:45 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> The acks are generated from checksums of the body of events.  So if you
> augment your events with new attributes (regex, value) the acks will still
> work.  However, if you filter out events the checksums between the agentSink
> and  the collectorSink the checksums won't sum up.
>
> You can however, put filtering "after" the collector, or do filtering "next
> to" the collector.
>
> Ok because value adds attributes and does not modify the body.
> node : <source> |  agentE2ESink("ip of collector");
> collector: collectorSource | value("newattr","newvalue")
> collectorSink("hdfs://xxxx", ...);
>
> Ok because filter is before checksums calculated
> node : <source> | filterOutEvents agentE2ESink("ip of collector");
> collector: collectorSource | collectorSink("hdfs://xxxx", ...);
>
> Ok because filter is after checksums are validated.
> node : <source> | agentE2ESink("ip of collector");
> collector: collectorSource | collector(xxx) { filterOutEvents
> escapedFormatDfs("hdfs://xxxx", ...) } ;
>
> Not ok -- checksums won't work out because events with checksum info never
> get checksum calculation.
> node : <source> | agentE2ESink("ip of collector");
> collector: collectorSource | filterOutEvents collectorSink("hdfs://xxxx",
> ...);
>
> Does that make sense?
>
> Jon.
>
>
> On Wed, Aug 17, 2011 at 2:39 AM, Bao Thai Ngo <ba...@gmail.com>wrote:
>
>> Hi,
>>
>> As far as I understand ACK mechanism should work regardless any decorator
>> deployed at Collector as Mingje said. I developed and deployed several
>> plug-ins (decorators) that filter out events at Collector side and they work
>> well with ACK. Another thing I can suggest is: do not try to develop an ACK
>> events part in your decorator.
>>
>> @Felix: Some advantages for deploying a decorator at collector side are:
>> - do not depend on agent side
>> - collect data we need and save other data for future needs (what we need
>> is just a small part of a very huge data)
>>
>> just my 2cent.
>>
>> ~Thai
>>
>>
>> On Tue, Aug 16, 2011 at 9:52 PM, Joe Crobak <jo...@gmail.com> wrote:
>>
>>> According to the Flume FAQ [1], Flume ack's events from the CollectorSink
>>> in E2E mode.  If I have a Decorator running on the Collector that filters
>>> out events (or transforms them or something), does that mean those events
>>> won't get ACK'd and thus will delivery will be retried for them
>>> indefinitely? IOW, is E2E mode unsupported in this situation -- or maybe is
>>> there a way for me to ACK events that I want to filter from the Decorator
>>> itself?
>>>
>>> Thanks,
>>> Joe
>>>
>>>
>>> [1] https://github.com/cloudera/flume/wiki/FAQ
>>>
>>
>>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>

Re: E2E mode with decorators

Posted by Jonathan Hsieh <jo...@cloudera.com>.
The acks are generated from checksums of the body of events.  So if you
augment your events with new attributes (regex, value) the acks will still
work.  However, if you filter out events the checksums between the agentSink
and  the collectorSink the checksums won't sum up.

You can however, put filtering "after" the collector, or do filtering "next
to" the collector.

Ok because value adds attributes and does not modify the body.
node : <source> |  agentE2ESink("ip of collector");
collector: collectorSource | value("newattr","newvalue")
collectorSink("hdfs://xxxx", ...);

Ok because filter is before checksums calculated
node : <source> | filterOutEvents agentE2ESink("ip of collector");
collector: collectorSource | collectorSink("hdfs://xxxx", ...);

Ok because filter is after checksums are validated.
node : <source> | agentE2ESink("ip of collector");
collector: collectorSource | collector(xxx) { filterOutEvents
escapedFormatDfs("hdfs://xxxx", ...) } ;

Not ok -- checksums won't work out because events with checksum info never
get checksum calculation.
node : <source> | agentE2ESink("ip of collector");
collector: collectorSource | filterOutEvents collectorSink("hdfs://xxxx",
...);

Does that make sense?

Jon.


On Wed, Aug 17, 2011 at 2:39 AM, Bao Thai Ngo <ba...@gmail.com> wrote:

> Hi,
>
> As far as I understand ACK mechanism should work regardless any decorator
> deployed at Collector as Mingje said. I developed and deployed several
> plug-ins (decorators) that filter out events at Collector side and they work
> well with ACK. Another thing I can suggest is: do not try to develop an ACK
> events part in your decorator.
>
> @Felix: Some advantages for deploying a decorator at collector side are:
> - do not depend on agent side
> - collect data we need and save other data for future needs (what we need
> is just a small part of a very huge data)
>
> just my 2cent.
>
> ~Thai
>
>
> On Tue, Aug 16, 2011 at 9:52 PM, Joe Crobak <jo...@gmail.com> wrote:
>
>> According to the Flume FAQ [1], Flume ack's events from the CollectorSink
>> in E2E mode.  If I have a Decorator running on the Collector that filters
>> out events (or transforms them or something), does that mean those events
>> won't get ACK'd and thus will delivery will be retried for them
>> indefinitely? IOW, is E2E mode unsupported in this situation -- or maybe is
>> there a way for me to ACK events that I want to filter from the Decorator
>> itself?
>>
>> Thanks,
>> Joe
>>
>>
>> [1] https://github.com/cloudera/flume/wiki/FAQ
>>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: E2E mode with decorators

Posted by Bao Thai Ngo <ba...@gmail.com>.
Hi,

As far as I understand ACK mechanism should work regardless any decorator
deployed at Collector as Mingje said. I developed and deployed several
plug-ins (decorators) that filter out events at Collector side and they work
well with ACK. Another thing I can suggest is: do not try to develop an ACK
events part in your decorator.

@Felix: Some advantages for deploying a decorator at collector side are:
- do not depend on agent side
- collect data we need and save other data for future needs (what we need is
just a small part of a very huge data)

just my 2cent.

~Thai

On Tue, Aug 16, 2011 at 9:52 PM, Joe Crobak <jo...@gmail.com> wrote:

> According to the Flume FAQ [1], Flume ack's events from the CollectorSink
> in E2E mode.  If I have a Decorator running on the Collector that filters
> out events (or transforms them or something), does that mean those events
> won't get ACK'd and thus will delivery will be retried for them
> indefinitely? IOW, is E2E mode unsupported in this situation -- or maybe is
> there a way for me to ACK events that I want to filter from the Decorator
> itself?
>
> Thanks,
> Joe
>
>
> [1] https://github.com/cloudera/flume/wiki/FAQ
>

Re: E2E mode with decorators

Posted by Mingjie Lai <mj...@gmail.com>.
 > If I have a Decorator running on the
 > Collector that filters out events (or transforms them or something),
 > does that mean those events won't get ACK'd and thus will delivery will
 > be retried for them indefinitely?

Ack of e2e mode should work regardless any decorator configured at 
collector side. You can verify by a simple example.

[flume localhost:35873:45678] getconfigs
NODE            FLOW            SOURCE                  SINK 

collector1      default-flow    collectorSource(35853)  collector(10000) 
{regex("(.*)\\t(.*)\\t(.*)\\t", 1, "ip")  console}
agent1          default-flow    tail("/tmp/full.log") 
agentE2ESink("localhost", 35853)

I didn't get a chance to look at the source code, but I think the ack of 
e2e should be passed thru the whole collector path as an attribute. And 
it won't get affected by any decorator (unless you modify the attribute 
by intention).

Correct me if my observation is wrong.

-mingjie

On 08/16/2011 07:52 AM, Joe Crobak wrote:
> According to the Flume FAQ [1], Flume ack's events from the
> CollectorSink in E2E mode.  If I have a Decorator running on the
> Collector that filters out events (or transforms them or something),
> does that mean those events won't get ACK'd and thus will delivery will
> be retried for them indefinitely? IOW, is E2E mode unsupported in this
> situation -- or maybe is there a way for me to ACK events that I want to
> filter from the Decorator itself?
>
> Thanks,
> Joe
>
>
> [1] https://github.com/cloudera/flume/wiki/FAQ

Re: E2E mode with decorators

Posted by Joe Crobak <jo...@gmail.com>.
On Tue, Aug 16, 2011 at 10:59 AM, Felix Giguere Villegas <
felix.giguere@mate1inc.com> wrote:

> Maybe I'm missing something, but why don't you put your filtering Decorator
> on the agent/source instead?
>
> What's the point of sending those events all the way to the CollectorSink
> if they're going to be filtered out in the end? The only reason I can see is
> if the only place where you can easily determine which events to filter out
> IS at the end of the flow, but I can't think of a reason why that would be
> the case...
>
> Good question - this might be worth investigating.  The other reason I can
think of is if you're forwarding data from your collector to multiple sinks
-- e.g. both HDFS and Hbase, but perhaps filtering out some of the data for
one or the other (we're not doing this, and I guess it doesn't work in E2E
mode according to FLUME-165 anyway).

I simplified our situation in my example -- we're really doing a
light-weight ETL with an in-memory aggregation. So for each event that comes
in, 0 or more events might come out of that event -- it's not just
filtering, and it's not 1 event in, 1 event out.


> I don't know the answer to your specific question (and I'd be curious to
> find out as well), so I'm sorry if my comment doesn't help :) ...
>
> --
> Felix
>
>
>
>
> On Tue, Aug 16, 2011 at 10:52 AM, Joe Crobak <jo...@gmail.com> wrote:
>
>> According to the Flume FAQ [1], Flume ack's events from the CollectorSink
>> in E2E mode.  If I have a Decorator running on the Collector that filters
>> out events (or transforms them or something), does that mean those events
>> won't get ACK'd and thus will delivery will be retried for them
>> indefinitely? IOW, is E2E mode unsupported in this situation -- or maybe is
>> there a way for me to ACK events that I want to filter from the Decorator
>> itself?
>>
>> Thanks,
>> Joe
>>
>>
>> [1] https://github.com/cloudera/flume/wiki/FAQ
>>
>
>

Re: E2E mode with decorators

Posted by Felix Giguere Villegas <fe...@mate1inc.com>.
Maybe I'm missing something, but why don't you put your filtering Decorator
on the agent/source instead?

What's the point of sending those events all the way to the CollectorSink if
they're going to be filtered out in the end? The only reason I can see is if
the only place where you can easily determine which events to filter out IS
at the end of the flow, but I can't think of a reason why that would be the
case...

I don't know the answer to your specific question (and I'd be curious to
find out as well), so I'm sorry if my comment doesn't help :) ...

--
Felix



On Tue, Aug 16, 2011 at 10:52 AM, Joe Crobak <jo...@gmail.com> wrote:

> According to the Flume FAQ [1], Flume ack's events from the CollectorSink
> in E2E mode.  If I have a Decorator running on the Collector that filters
> out events (or transforms them or something), does that mean those events
> won't get ACK'd and thus will delivery will be retried for them
> indefinitely? IOW, is E2E mode unsupported in this situation -- or maybe is
> there a way for me to ACK events that I want to filter from the Decorator
> itself?
>
> Thanks,
> Joe
>
>
> [1] https://github.com/cloudera/flume/wiki/FAQ
>