You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2015/08/07 12:23:16 UTC

About duplicate events and how to deal with them in Flume with interceptors.

Hi,

I would like to delete duplicates in Flume with Interceptors.
The idea is to calculate an MD5 or similar for the event and store in Redis
or another database. I want just to check the lost of performance and which
it's the best solution for dealing with it.

As I understand the max number of events what they could be duplicates
depend of the batchSize. So, you only need to store that number of keys in
your database. I don't know if Redis has that feature as capped collection
in Mongo.

Has someone done something similar and knows the lost of performance? Which
could it be the best place where to store the keys for really fast access??
Mongo, Redis,...? I think that HBase or Cassandra could be worse since with
Redis or similar could be in the same host than Flume and you don't lose
time because the network.
Any other solution to deal with duplicates in realtime?

Re: About duplicate events and how to deal with them in Flume with interceptors.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I want to store them in Kafka. If I use Kafka as channel I guess that it
doesn't fix this case.
El 07/08/2015 14:26, "Majid Alfifi" <ma...@gmail.com> escribió:

> You may also find some hints in the discussions of FLUME-2173:
>
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLUME-2173
>
> -Majid
>
> On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> Thanks for the answer. I was talking more about possible failures of an
> Flume Agent. There's a tiny possiblity to get duplicates not because the
> source is producing duplicates. It's true that they should be a really
> small percentage of the data size but if the agent crashs you could get
> duplicates when you starts the agent again.
>
> I guess that you need a third player if you want to manage this case of
> duplicates and it's not possible to use a CircularFifoQueue in the same JVM
> than Flume that's why I thought about Redis or something similar. Ideally,
> that system should be independent of Flume and have HA.
>
>
>
> 2015-08-07 13:20 GMT+02:00 Majid Alfifi <ma...@gmail.com>:
>
>> It's not clear if you are referring to duplicates that result from the
>> source or duplicates that result from Flume itself trying to maintain the
>> at-least-once delivery of events.
>>
>> I had a case were the source was producing  duplicates but the network
>> bandwidth was almost fully utilized by the regular de-duplicated stream so
>> we couldn't afford to have duplicates travel all the way to the final
>> destination (HDFS in our case). We ultimately just used a CircularFifoQueue
>> in a flume interceptor. It was a good fit because for our case all
>> duplicates will come in about 30-seconds window. We were receiving about
>> 600 event per second so a CircularFifoQueue of size 18,000 for example was
>> an easy solution to remove duplicates but at the expense of having a single
>> flume agent to remove duplicates (SPOF).
>>
>> However, we still see duplicates at the final destination that are a
>> result of Flume architecture or from occasional duplicates that come more
>> than 30 seconds apart from the source but they were a very small percentage
>> of the data size. We had a MapReduce job that removed those remaining
>> duplicates in HDFS.
>>
>> -Majid
>>
>> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I would like to delete duplicates in Flume with Interceptors.
>> > The idea is to calculate an MD5 or similar for the event and store in
>> Redis or another database. I want just to check the lost of performance and
>> which it's the best solution for dealing with it.
>> >
>> > As I understand the max number of events what they could be duplicates
>> depend of the batchSize. So, you only need to store that number of keys in
>> your database. I don't know if Redis has that feature as capped collection
>> in Mongo.
>> >
>> > Has someone done something similar and knows the lost of performance?
>> Which could it be the best place where to store the keys for really fast
>> access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse
>> since with Redis or similar could be in the same host than Flume and you
>> don't lose time because the network.
>> > Any other solution to deal with duplicates in realtime?
>> >
>> >
>>
>
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Posted by Majid Alfifi <ma...@gmail.com>.

You may also find some hints in the discussions of FLUME-2173:

https://issues.apache.org/jira/plugins/servlet/mobile#issue/FLUME-2173

-Majid

> On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
> 
> Thanks for the answer. I was talking more about possible failures of an Flume Agent. There's a tiny possiblity to get duplicates not because the source is producing duplicates. It's true that they should be a really small percentage of the data size but if the agent crashs you could get duplicates when you starts the agent again. 
> 
> I guess that you need a third player if you want to manage this case of duplicates and it's not possible to use a CircularFifoQueue in the same JVM than Flume that's why I thought about Redis or something similar. Ideally, that system should be independent of Flume and have HA.
> 
> 
> 
> 2015-08-07 13:20 GMT+02:00 Majid Alfifi <ma...@gmail.com>:
>> It's not clear if you are referring to duplicates that result from the source or duplicates that result from Flume itself trying to maintain the at-least-once delivery of events.
>> 
>> I had a case were the source was producing  duplicates but the network bandwidth was almost fully utilized by the regular de-duplicated stream so we couldn't afford to have duplicates travel all the way to the final destination (HDFS in our case). We ultimately just used a CircularFifoQueue in a flume interceptor. It was a good fit because for our case all duplicates will come in about 30-seconds window. We were receiving about 600 event per second so a CircularFifoQueue of size 18,000 for example was an easy solution to remove duplicates but at the expense of having a single flume agent to remove duplicates (SPOF).
>> 
>> However, we still see duplicates at the final destination that are a result of Flume architecture or from occasional duplicates that come more than 30 seconds apart from the source but they were a very small percentage of the data size. We had a MapReduce job that removed those remaining duplicates in HDFS.
>> 
>> -Majid
>> 
>> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I would like to delete duplicates in Flume with Interceptors.
>> > The idea is to calculate an MD5 or similar for the event and store in Redis or another database. I want just to check the lost of performance and which it's the best solution for dealing with it.
>> >
>> > As I understand the max number of events what they could be duplicates depend of the batchSize. So, you only need to store that number of keys in your database. I don't know if Redis has that feature as capped collection in Mongo.
>> >
>> > Has someone done something similar and knows the lost of performance? Which could it be the best place where to store the keys for really fast access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse since with Redis or similar could be in the same host than Flume and you don't lose time because the network.
>> > Any other solution to deal with duplicates in realtime?
>> >
>> >
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Posted by Majid Alfifi <ma...@gmail.com>.

Right the example of CircularFifoQueue won't work for your case.
What's the ultimate destination for the Flume events?

-Majid

> On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
> 
> Thanks for the answer. I was talking more about possible failures of an Flume Agent. There's a tiny possiblity to get duplicates not because the source is producing duplicates. It's true that they should be a really small percentage of the data size but if the agent crashs you could get duplicates when you starts the agent again. 
> 
> I guess that you need a third player if you want to manage this case of duplicates and it's not possible to use a CircularFifoQueue in the same JVM than Flume that's why I thought about Redis or something similar. Ideally, that system should be independent of Flume and have HA.
> 
> 
> 
> 2015-08-07 13:20 GMT+02:00 Majid Alfifi <ma...@gmail.com>:
>> It's not clear if you are referring to duplicates that result from the source or duplicates that result from Flume itself trying to maintain the at-least-once delivery of events.
>> 
>> I had a case were the source was producing  duplicates but the network bandwidth was almost fully utilized by the regular de-duplicated stream so we couldn't afford to have duplicates travel all the way to the final destination (HDFS in our case). We ultimately just used a CircularFifoQueue in a flume interceptor. It was a good fit because for our case all duplicates will come in about 30-seconds window. We were receiving about 600 event per second so a CircularFifoQueue of size 18,000 for example was an easy solution to remove duplicates but at the expense of having a single flume agent to remove duplicates (SPOF).
>> 
>> However, we still see duplicates at the final destination that are a result of Flume architecture or from occasional duplicates that come more than 30 seconds apart from the source but they were a very small percentage of the data size. We had a MapReduce job that removed those remaining duplicates in HDFS.
>> 
>> -Majid
>> 
>> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I would like to delete duplicates in Flume with Interceptors.
>> > The idea is to calculate an MD5 or similar for the event and store in Redis or another database. I want just to check the lost of performance and which it's the best solution for dealing with it.
>> >
>> > As I understand the max number of events what they could be duplicates depend of the batchSize. So, you only need to store that number of keys in your database. I don't know if Redis has that feature as capped collection in Mongo.
>> >
>> > Has someone done something similar and knows the lost of performance? Which could it be the best place where to store the keys for really fast access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse since with Redis or similar could be in the same host than Flume and you don't lose time because the network.
>> > Any other solution to deal with duplicates in realtime?
>> >
>> >
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Posted by Guillermo Ortiz <ko...@gmail.com>.

Thanks for the answer. I was talking more about possible failures of an
Flume Agent. There's a tiny possiblity to get duplicates not because the
source is producing duplicates. It's true that they should be a really
small percentage of the data size but if the agent crashs you could get
duplicates when you starts the agent again.

I guess that you need a third player if you want to manage this case of
duplicates and it's not possible to use a CircularFifoQueue in the same JVM
than Flume that's why I thought about Redis or something similar. Ideally,
that system should be independent of Flume and have HA.



2015-08-07 13:20 GMT+02:00 Majid Alfifi <ma...@gmail.com>:

> It's not clear if you are referring to duplicates that result from the
> source or duplicates that result from Flume itself trying to maintain the
> at-least-once delivery of events.
>
> I had a case were the source was producing  duplicates but the network
> bandwidth was almost fully utilized by the regular de-duplicated stream so
> we couldn't afford to have duplicates travel all the way to the final
> destination (HDFS in our case). We ultimately just used a CircularFifoQueue
> in a flume interceptor. It was a good fit because for our case all
> duplicates will come in about 30-seconds window. We were receiving about
> 600 event per second so a CircularFifoQueue of size 18,000 for example was
> an easy solution to remove duplicates but at the expense of having a single
> flume agent to remove duplicates (SPOF).
>
> However, we still see duplicates at the final destination that are a
> result of Flume architecture or from occasional duplicates that come more
> than 30 seconds apart from the source but they were a very small percentage
> of the data size. We had a MapReduce job that removed those remaining
> duplicates in HDFS.
>
> -Majid
>
> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I would like to delete duplicates in Flume with Interceptors.
> > The idea is to calculate an MD5 or similar for the event and store in
> Redis or another database. I want just to check the lost of performance and
> which it's the best solution for dealing with it.
> >
> > As I understand the max number of events what they could be duplicates
> depend of the batchSize. So, you only need to store that number of keys in
> your database. I don't know if Redis has that feature as capped collection
> in Mongo.
> >
> > Has someone done something similar and knows the lost of performance?
> Which could it be the best place where to store the keys for really fast
> access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse
> since with Redis or similar could be in the same host than Flume and you
> don't lose time because the network.
> > Any other solution to deal with duplicates in realtime?
> >
> >
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Posted by Majid Alfifi <ma...@gmail.com>.

It's not clear if you are referring to duplicates that result from the source or duplicates that result from Flume itself trying to maintain the at-least-once delivery of events.

I had a case were the source was producing  duplicates but the network bandwidth was almost fully utilized by the regular de-duplicated stream so we couldn't afford to have duplicates travel all the way to the final destination (HDFS in our case). We ultimately just used a CircularFifoQueue in a flume interceptor. It was a good fit because for our case all duplicates will come in about 30-seconds window. We were receiving about 600 event per second so a CircularFifoQueue of size 18,000 for example was an easy solution to remove duplicates but at the expense of having a single flume agent to remove duplicates (SPOF). 

However, we still see duplicates at the final destination that are a result of Flume architecture or from occasional duplicates that come more than 30 seconds apart from the source but they were a very small percentage of the data size. We had a MapReduce job that removed those remaining duplicates in HDFS.

-Majid

> On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
> 
> Hi, 
> 
> I would like to delete duplicates in Flume with Interceptors. 
> The idea is to calculate an MD5 or similar for the event and store in Redis or another database. I want just to check the lost of performance and which it's the best solution for dealing with it. 
> 
> As I understand the max number of events what they could be duplicates depend of the batchSize. So, you only need to store that number of keys in your database. I don't know if Redis has that feature as capped collection in Mongo.
> 
> Has someone done something similar and knows the lost of performance? Which could it be the best place where to store the keys for really fast access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse since with Redis or similar could be in the same host than Flume and you don't lose time because the network.
> Any other solution to deal with duplicates in realtime?
> 
>