You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Thanh Hong Dai <hd...@tma.com.vn> on 2016/07/27 11:39:43 UTC

Is it a good idea to use Flume Interceptor to process data?

Hi,

 

To give some background: We are currently buffering monitoring data into
Kafka, where each message in Kafka records several metrics at a point in
time.

For each of the record, we need to perform some calculation based on the
metrics in the record, append the results (multiple of them) to the record
and send the resulting record into a data store (let's call it DS1). All
data required for the calculation are encapsulated in the record,
essentially making this an embarrassingly parallel problem.

The formula for the calculation is stored in a different data store (let's
call it DS2), and can be changed (add/delete/modified by user). We are not
required to react to the change immediately, but we should do so in
reasonable time (e.g. 5 minutes).

 

Currently, we have prototyped an implementation which implements the data
processing as described above in an Interceptor. We define the source as
Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the
channel. As described above, the Interceptor will be reading the formula
from DS1 regularly for any change, and will be responsible for processing
the data as they come in from Kafka.

 

We are aware of other streaming processing frameworks such as Spark of
Kafka. However, the implementation above is motivated by the fact that Flume
has provided reliable streaming, and we want to reuse as much code as
possible.

 

Is this usage of Flume a good idea in term of performance and scalability?

 

Best regards,

Hong Dai Thanh.

Re: Is it a good idea to use Flume Interceptor to process data?

Posted by Gonzalo Herreros <gh...@gmail.com>.

I would avoid doing calculations in the source, that can impact the
ingestion and cause timeouts, duplicates, etc. specially for some sources
(e.g. http).
However, what I have done in the past is having a durable channel, create a
custom sink that extends a regular sink and does additional
calculations/enrichment based on custom configuration in the flume config
file.

While not ideal, it's a very simple solution because it doesn't require
extra infrastructure and since you do that processing after the events have
been stored in the durable channel, it doesn't impact the ingestion.
An example of this would be the Morphlines Sink for Solr.

Gonzalo

On 28 July 2016 at 06:40, iain wright <ia...@gmail.com> wrote:

> You likely want to pose the ZK questions on the zookeeper list. I know
> I've seen folks have problems when receiving >1MB of data in a response,
> and definitely problems with > 200k children of a znode
>
> That said I've used it with hbase 0.94-98 with ~20k regions without issue,
> I believe region severs use watchers vs polling
>
> How often do the formulas change? Below doc states there is a potential
> race condition or gap in events with watchers, in that you need to set an
> additional watcher after receiving an event
>
> Maybe it would be possible to use on heap cache, pub sub queue, and DB as
> a source of truth? It's a pattern that has worked for us , although not in
> the context of flume
>
> IE:
> If you don't have the formula in cache go to DB (then cache it).
> If you do have the formula in cache use it.
> If something changes the formula, it writes to the DB and publishes a
> message to a topic that all agents listen on, and agents change their
> formula based on the published message.
>
> The caveat being if an agent ever disconnects from the pubsub topic, to
> either self murder or go to the DB every time
>
> Relevant:
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html
> https://cwiki.apache.org/confluence/display/CURATOR/TN4
>
> Sent from my iPhone
>
> On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <hd...@tma.com.vn> wrote:
>
> Hi,
>
>
>
> We actually attach the Interceptor to the source, as you have said. Sorry
> for the confusion.
>
>
>
> (I also found out that I wrote “other streaming processing frameworks such
> as Spark of Kafka”, which should be read as “other streaming processing
> frameworks such as Spark or Storm”)
>
>
>
> Thanks for the suggestion about Zookeeper. We are aware of the
> configuration storage functionality of Zookeeper, but we don’t have much
> experience using it. Would storing around 5000 formula (usually simple
> ones, less than 100 bytes) affect the overall performance of Zookeeper? To
> detect update, there are 2 approaches: poll all the formulas, or use
> watcher. Which approach would be better?
>
>
>
> The monitoring data is not latency sensitive – the process that put the
> data of the last hour into Kafka only runs at 5th or 10th minute of the
> hour. We are allowed to take one more hour to process the data (which means
> that we can see the 8AM data at 10AM at the latest).
>
>
>
> Best regards,
>
> Thanh Hong.
>
>
>
> *From:* Chris Horrocks [mailto:chris@hor.rocks <ch...@hor.rocks>]
> *Sent:* Wednesday, 27 July, 2016 7:28 PM
> *To:* user@flume.apache.org
> *Subject:* Re: Is it a good idea to use Flume Interceptor to process data?
>
>
>
> Some rough initial thoughts:
>
>
>
> This is interesting but you might need to elaborate on how you've achieved
> attaching an interceptor to a channel (and why, in lieu of attaching it to
> the source):
>
> we attach the Interceptor to the channel
>
> Personally I'd have done this by feeding data into Spark Streaming and
> keeping flume as low overhead as possible, particularily if it's monitoring
> data that's latency sensitive. For storing the calculations variables for
> consumption by the interceptor I'd go with something like ZooKeeper.
>
>
>
>
>
> --
>
> Chris Horrocks
>
>
>
>
>
> On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'>
> wrote:
>
> Hi,
>
>
>
> To give some background: We are currently buffering monitoring data into
> Kafka, where each message in Kafka records several metrics at a point in
> time.
>
> For each of the record, we need to perform some calculation based on the
> metrics in the record, append the results (multiple of them) to the record
> and send the resulting record into a data store (let’s call it DS1). All
> data required for the calculation are encapsulated in the record,
> essentially making this an embarrassingly parallel problem.
>
> The formula for the calculation is stored in a different data store (let’s
> call it DS2), and can be changed (add/delete/modified by user). We are not
> required to react to the change immediately, but we should do so in
> reasonable time (e.g. 5 minutes).
>
>
>
> Currently, we have prototyped an implementation which implements the data
> processing as described above in an Interceptor. We define the source as
> Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the
> channel. As described above, the Interceptor will be reading the formula
> from DS1 regularly for any change, and will be responsible for processing
> the data as they come in from Kafka.
>
>
>
> We are aware of other streaming processing frameworks such as Spark of
> Kafka. However, the implementation above is motivated by the fact that
> Flume has provided reliable streaming, and we want to reuse as much code as
> possible.
>
>
>
> Is this usage of Flume a good idea in term of performance and scalability?
>
>
>
> Best regards,
>
> Hong Dai Thanh.
>
>

Re: Is it a good idea to use Flume Interceptor to process data?

Posted by iain wright <ia...@gmail.com>.

Sounds reasonable and agreed that seems well below ZK limits 

Best Regards,
Iain


Sent from my iPhone

> On Jul 28, 2016, at 2:00 AM, Thanh Hong Dai <hd...@tma.com.vn> wrote:
> 
> If we are to use ZK, the formulae will be stored in something like this tree:
>  
> /formula/<message_type>[/<field>]
>  
> I’m not sure whether each formula should go into one node or not, but each message type only has 10 formulae on average, and there are around 400-500 message type. Correct me if I’m wrong, but from my reading on ZK, this shouldn’t run into the ZK node size problem.
>  
> The formulae are rarely changed, but we need to support such use case when they do. We plan to cache the formula on heap and poll the primary source once in a while for update.
>  
> Best regards,
> Thanh Hong.
>  
> From: iain wright [mailto:iainwrig@gmail.com] 
> Sent: Thursday, 28 July, 2016 12:40 PM
> To: user@flume.apache.org
> Cc: Chris Horrocks <ch...@hor.rocks>
> Subject: Re: Is it a good idea to use Flume Interceptor to process data?
>  
> You likely want to pose the ZK questions on the zookeeper list. I know I've seen folks have problems when receiving >1MB of data in a response, and definitely problems with > 200k children of a znode
>  
> That said I've used it with hbase 0.94-98 with ~20k regions without issue, I believe region severs use watchers vs polling 
> 
> How often do the formulas change? Below doc states there is a potential race condition or gap in events with watchers, in that you need to set an additional watcher after receiving an event
> 
> Maybe it would be possible to use on heap cache, pub sub queue, and DB as a source of truth? It's a pattern that has worked for us , although not in the context of flume
>  
> IE:
> If you don't have the formula in cache go to DB (then cache it).
> If you do have the formula in cache use it.
> If something changes the formula, it writes to the DB and publishes a message to a topic that all agents listen on, and agents change their formula based on the published message. 
>  
> The caveat being if an agent ever disconnects from the pubsub topic, to either self murder or go to the DB every time
>  
> Relevant:
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html
> https://cwiki.apache.org/confluence/display/CURATOR/TN4
>  
> Sent from my iPhone
> 
> On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <hd...@tma.com.vn> wrote:
> 
> Hi,
>  
> We actually attach the Interceptor to the source, as you have said. Sorry for the confusion.
>  
> (I also found out that I wrote “other streaming processing frameworks such as Spark of Kafka”, which should be read as “other streaming processing frameworks such as Spark or Storm”)
>  
> Thanks for the suggestion about Zookeeper. We are aware of the configuration storage functionality of Zookeeper, but we don’t have much experience using it. Would storing around 5000 formula (usually simple ones, less than 100 bytes) affect the overall performance of Zookeeper? To detect update, there are 2 approaches: poll all the formulas, or use watcher. Which approach would be better?
>  
> The monitoring data is not latency sensitive – the process that put the data of the last hour into Kafka only runs at 5th or 10th minute of the hour. We are allowed to take one more hour to process the data (which means that we can see the 8AM data at 10AM at the latest).
>  
> Best regards,
> Thanh Hong.
>  
> From: Chris Horrocks [mailto:chris@hor.rocks] 
> Sent: Wednesday, 27 July, 2016 7:28 PM
> To: user@flume.apache.org
> Subject: Re: Is it a good idea to use Flume Interceptor to process data?
>  
> Some rough initial thoughts:
>  
> This is interesting but you might need to elaborate on how you've achieved attaching an interceptor to a channel (and why, in lieu of attaching it to the source):
> we attach the Interceptor to the channel
> Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low overhead as possible, particularily if it's monitoring data that's latency sensitive. For storing the calculations variables for consumption by the interceptor I'd go with something like ZooKeeper. 
>  
>  
> --
> Chris Horrocks
>  
>  
> On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'> wrote:
> Hi,
>  
> To give some background: We are currently buffering monitoring data into Kafka, where each message in Kafka records several metrics at a point in time.
> For each of the record, we need to perform some calculation based on the metrics in the record, append the results (multiple of them) to the record and send the resulting record into a data store (let’s call it DS1). All data required for the calculation are encapsulated in the record, essentially making this an embarrassingly parallel problem.
> The formula for the calculation is stored in a different data store (let’s call it DS2), and can be changed (add/delete/modified by user). We are not required to react to the change immediately, but we should do so in reasonable time (e.g. 5 minutes).
>  
> Currently, we have prototyped an implementation which implements the data processing as described above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the channel. As described above, the Interceptor will be reading the formula from DS1 regularly for any change, and will be responsible for processing the data as they come in from Kafka.
>  
> We are aware of other streaming processing frameworks such as Spark of Kafka. However, the implementation above is motivated by the fact that Flume has provided reliable streaming, and we want to reuse as much code as possible.
>  
> Is this usage of Flume a good idea in term of performance and scalability?
>  
> Best regards,
> Hong Dai Thanh.

RE: Is it a good idea to use Flume Interceptor to process data?

Posted by Thanh Hong Dai <hd...@tma.com.vn>.

If we are to use ZK, the formulae will be stored in something like this tree:

 

/formula/<message_type>[/<field>]

 

I’m not sure whether each formula should go into one node or not, but each message type only has 10 formulae on average, and there are around 400-500 message type. Correct me if I’m wrong, but from my reading on ZK, this shouldn’t run into the ZK node size problem. 

 

The formulae are rarely changed, but we need to support such use case when they do. We plan to cache the formula on heap and poll the primary source once in a while for update.

 

Best regards,

Thanh Hong.

 

From: iain wright [mailto:iainwrig@gmail.com] 
Sent: Thursday, 28 July, 2016 12:40 PM
To: user@flume.apache.org
Cc: Chris Horrocks <ch...@hor.rocks>
Subject: Re: Is it a good idea to use Flume Interceptor to process data?

 

You likely want to pose the ZK questions on the zookeeper list. I know I've seen folks have problems when receiving >1MB of data in a response, and definitely problems with > 200k children of a znode

 

That said I've used it with hbase 0.94-98 with ~20k regions without issue, I believe region severs use watchers vs polling 

How often do the formulas change? Below doc states there is a potential race condition or gap in events with watchers, in that you need to set an additional watcher after receiving an event


Maybe it would be possible to use on heap cache, pub sub queue, and DB as a source of truth? It's a pattern that has worked for us , although not in the context of flume

 

IE:

If you don't have the formula in cache go to DB (then cache it).

If you do have the formula in cache use it.

If something changes the formula, it writes to the DB and publishes a message to a topic that all agents listen on, and agents change their formula based on the published message. 

 

The caveat being if an agent ever disconnects from the pubsub topic, to either self murder or go to the DB every time

 

Relevant:

https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html

https://cwiki.apache.org/confluence/display/CURATOR/TN4

 

Sent from my iPhone


On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <hdthanh@tma.com.vn <ma...@tma.com.vn> > wrote:

Hi,

 

We actually attach the Interceptor to the source, as you have said. Sorry for the confusion.

 

(I also found out that I wrote “other streaming processing frameworks such as Spark of Kafka”, which should be read as “other streaming processing frameworks such as Spark or Storm”)

 

Thanks for the suggestion about Zookeeper. We are aware of the configuration storage functionality of Zookeeper, but we don’t have much experience using it. Would storing around 5000 formula (usually simple ones, less than 100 bytes) affect the overall performance of Zookeeper? To detect update, there are 2 approaches: poll all the formulas, or use watcher. Which approach would be better?

 

The monitoring data is not latency sensitive – the process that put the data of the last hour into Kafka only runs at 5th or 10th minute of the hour. We are allowed to take one more hour to process the data (which means that we can see the 8AM data at 10AM at the latest).

 

Best regards,

Thanh Hong.

 

From: Chris Horrocks [mailto:chris@hor.rocks] 
Sent: Wednesday, 27 July, 2016 7:28 PM
To: user@flume.apache.org <ma...@flume.apache.org> 
Subject: Re: Is it a good idea to use Flume Interceptor to process data?

 

Some rough initial thoughts:

 

This is interesting but you might need to elaborate on how you've achieved attaching an interceptor to a channel (and why, in lieu of attaching it to the source):

we attach the Interceptor to the channel 

Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low overhead as possible, particularily if it's monitoring data that's latency sensitive. For storing the calculations variables for consumption by the interceptor I'd go with something like ZooKeeper. 

 

 

-- 

Chris Horrocks

 

 

On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn <ma...@tma.com.vn> '> wrote:

Hi, 

  

To give some background: We are currently buffering monitoring data into Kafka, where each message in Kafka records several metrics at a point in time. 

For each of the record, we need to perform some calculation based on the metrics in the record, append the results (multiple of them) to the record and send the resulting record into a data store (let’s call it DS1). All data required for the calculation are encapsulated in the record, essentially making this an embarrassingly parallel problem. 

The formula for the calculation is stored in a different data store (let’s call it DS2), and can be changed (add/delete/modified by user). We are not required to react to the change immediately, but we should do so in reasonable time (e.g. 5 minutes). 

  

Currently, we have prototyped an implementation which implements the data processing as described above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the channel. As described above, the Interceptor will be reading the formula from DS1 regularly for any change, and will be responsible for processing the data as they come in from Kafka. 

  

We are aware of other streaming processing frameworks such as Spark of Kafka. However, the implementation above is motivated by the fact that Flume has provided reliable streaming, and we want to reuse as much code as possible. 

  

Is this usage of Flume a good idea in term of performance and scalability? 

  

Best regards, 

Hong Dai Thanh.

Re: Is it a good idea to use Flume Interceptor to process data?

Posted by iain wright <ia...@gmail.com>.

You likely want to pose the ZK questions on the zookeeper list. I know I've seen folks have problems when receiving >1MB of data in a response, and definitely problems with > 200k children of a znode

That said I've used it with hbase 0.94-98 with ~20k regions without issue, I believe region severs use watchers vs polling 

How often do the formulas change? Below doc states there is a potential race condition or gap in events with watchers, in that you need to set an additional watcher after receiving an event

Maybe it would be possible to use on heap cache, pub sub queue, and DB as a source of truth? It's a pattern that has worked for us , although not in the context of flume

IE:
If you don't have the formula in cache go to DB (then cache it).
If you do have the formula in cache use it.
If something changes the formula, it writes to the DB and publishes a message to a topic that all agents listen on, and agents change their formula based on the published message. 

The caveat being if an agent ever disconnects from the pubsub topic, to either self murder or go to the DB every time

Relevant:
https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html
https://cwiki.apache.org/confluence/display/CURATOR/TN4

Sent from my iPhone

> On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <hd...@tma.com.vn> wrote:
> 
> Hi,
>  
> We actually attach the Interceptor to the source, as you have said. Sorry for the confusion.
>  
> (I also found out that I wrote “other streaming processing frameworks such as Spark of Kafka”, which should be read as “other streaming processing frameworks such as Spark or Storm”)
>  
> Thanks for the suggestion about Zookeeper. We are aware of the configuration storage functionality of Zookeeper, but we don’t have much experience using it. Would storing around 5000 formula (usually simple ones, less than 100 bytes) affect the overall performance of Zookeeper? To detect update, there are 2 approaches: poll all the formulas, or use watcher. Which approach would be better?
>  
> The monitoring data is not latency sensitive – the process that put the data of the last hour into Kafka only runs at 5th or 10th minute of the hour. We are allowed to take one more hour to process the data (which means that we can see the 8AM data at 10AM at the latest).
>  
> Best regards,
> Thanh Hong.
>  
> From: Chris Horrocks [mailto:chris@hor.rocks] 
> Sent: Wednesday, 27 July, 2016 7:28 PM
> To: user@flume.apache.org
> Subject: Re: Is it a good idea to use Flume Interceptor to process data?
>  
> Some rough initial thoughts:
>  
> This is interesting but you might need to elaborate on how you've achieved attaching an interceptor to a channel (and why, in lieu of attaching it to the source):
> we attach the Interceptor to the channel
> Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low overhead as possible, particularily if it's monitoring data that's latency sensitive. For storing the calculations variables for consumption by the interceptor I'd go with something like ZooKeeper. 
>  
>  
> --
> Chris Horrocks
>  
>  
> On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'> wrote:
> Hi,
>  
> To give some background: We are currently buffering monitoring data into Kafka, where each message in Kafka records several metrics at a point in time.
> For each of the record, we need to perform some calculation based on the metrics in the record, append the results (multiple of them) to the record and send the resulting record into a data store (let’s call it DS1). All data required for the calculation are encapsulated in the record, essentially making this an embarrassingly parallel problem.
> The formula for the calculation is stored in a different data store (let’s call it DS2), and can be changed (add/delete/modified by user). We are not required to react to the change immediately, but we should do so in reasonable time (e.g. 5 minutes).
>  
> Currently, we have prototyped an implementation which implements the data processing as described above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the channel. As described above, the Interceptor will be reading the formula from DS1 regularly for any change, and will be responsible for processing the data as they come in from Kafka.
>  
> We are aware of other streaming processing frameworks such as Spark of Kafka. However, the implementation above is motivated by the fact that Flume has provided reliable streaming, and we want to reuse as much code as possible.
>  
> Is this usage of Flume a good idea in term of performance and scalability?
>  
> Best regards,
> Hong Dai Thanh.

RE: Is it a good idea to use Flume Interceptor to process data?

Posted by Thanh Hong Dai <hd...@tma.com.vn>.

Hi,

 

We actually attach the Interceptor to the source, as you have said. Sorry for the confusion.

 

(I also found out that I wrote “other streaming processing frameworks such as Spark of Kafka”, which should be read as “other streaming processing frameworks such as Spark or Storm”)

 

Thanks for the suggestion about Zookeeper. We are aware of the configuration storage functionality of Zookeeper, but we don’t have much experience using it. Would storing around 5000 formula (usually simple ones, less than 100 bytes) affect the overall performance of Zookeeper? To detect update, there are 2 approaches: poll all the formulas, or use watcher. Which approach would be better?

 

The monitoring data is not latency sensitive – the process that put the data of the last hour into Kafka only runs at 5th or 10th minute of the hour. We are allowed to take one more hour to process the data (which means that we can see the 8AM data at 10AM at the latest).

 

Best regards,

Thanh Hong.

 

From: Chris Horrocks [mailto:chris@hor.rocks] 
Sent: Wednesday, 27 July, 2016 7:28 PM
To: user@flume.apache.org
Subject: Re: Is it a good idea to use Flume Interceptor to process data?

 

Some rough initial thoughts:

 

This is interesting but you might need to elaborate on how you've achieved attaching an interceptor to a channel (and why, in lieu of attaching it to the source):

we attach the Interceptor to the channel 

Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low overhead as possible, particularily if it's monitoring data that's latency sensitive. For storing the calculations variables for consumption by the interceptor I'd go with something like ZooKeeper. 

 

 

-- 

Chris Horrocks

 

 

On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'> wrote:

Hi, 

  

To give some background: We are currently buffering monitoring data into Kafka, where each message in Kafka records several metrics at a point in time. 

For each of the record, we need to perform some calculation based on the metrics in the record, append the results (multiple of them) to the record and send the resulting record into a data store (let’s call it DS1). All data required for the calculation are encapsulated in the record, essentially making this an embarrassingly parallel problem. 

The formula for the calculation is stored in a different data store (let’s call it DS2), and can be changed (add/delete/modified by user). We are not required to react to the change immediately, but we should do so in reasonable time (e.g. 5 minutes). 

  

Currently, we have prototyped an implementation which implements the data processing as described above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the channel. As described above, the Interceptor will be reading the formula from DS1 regularly for any change, and will be responsible for processing the data as they come in from Kafka. 

  

We are aware of other streaming processing frameworks such as Spark of Kafka. However, the implementation above is motivated by the fact that Flume has provided reliable streaming, and we want to reuse as much code as possible. 

  

Is this usage of Flume a good idea in term of performance and scalability? 

  

Best regards, 

Hong Dai Thanh.

Re: Is it a good idea to use Flume Interceptor to process data?

Posted by Chris Horrocks <ch...@hor.rocks>.

Some rough initial thoughts:

This is interesting but you might need to elaborate on how you've achieved attaching an interceptor to a channel (and why, in lieu of attaching it to the source):

we attach the Interceptor to the channel

Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low overhead as possible, particularily if it's monitoring data that's latency sensitive. For storing the calculations variables for consumption by the interceptor I'd go with something like ZooKeeper.

--
Chris Horrocks

On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'> wrote:

Hi,

To give some background: We are currently buffering monitoring data into Kafka, where each message in Kafka records several metrics at a point in time.

For each of the record, we need to perform some calculation based on the metrics in the record, append the results (multiple of them) to the record and send the resulting record into a data store (let’s call it DS1). All data required for the calculation are encapsulated in the record, essentially making this an embarrassingly parallel problem.

The formula for the calculation is stored in a different data store (let’s call it DS2), and can be changed (add/delete/modified by user). We are not required to react to the change immediately, but we should do so in reasonable time (e.g. 5 minutes).

Currently, we have prototyped an implementation which implements the data processing as described above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the channel. As described above, the Interceptor will be reading the formula from DS1 regularly for any change, and will be responsible for processing the data as they come in from Kafka.

We are aware of other streaming processing frameworks such as Spark of Kafka. However, the implementation above is motivated by the fact that Flume has provided reliable streaming, and we want to reuse as much code as possible.

Is this usage of Flume a good idea in term of performance and scalability?

Best regards,

Hong Dai Thanh.