You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Hauke Hans <ha...@posteo.de> on 2018/01/29 10:04:40 UTC

Advice or best practices on adding metadata to stream events

Hi everyone,

I am fairly new to the world of stream processing and I was wondering about best practices when needing to add metadata to a stream in Flink (or stream processing in general). Searching for examples/discussions of this topic did not yield the results I was hoping for, so I figured should try asking here.

Imagine the following (fictional) use case:

We have a stream of events, let's say some kind of transaction events (as in buyer/seller). Let's also say that we have different "types" of sellers which have a specific pricing model, which is manually changed up to several times a day by account managers. These pricing models are saved in a SQL database. I now want to build a streaming application with Flink that is showing the current turnover rate per seller in a N-minute window. For this purpose I need to know which pricing model needs to be applied for a given event in the stream when processing it.

My naive first idea would be to simply fire an sql query on the invocation of my WindowFunctions apply method (probably with some caching). Would this be a reasonable thing to do? It kind of feels wrong to me. Or is there a more 'streamy' kind of way? I could imagine somehow turning the metadata Database into a stream source and then joining both streams, but this seems a lot more involved than the previous idea. Or am I maybe approaching the problem the completely wrong way? As I said, I'm very new to the whole stream processing thing.

I would be super glad if anyone could point me to any resources discussing a use case like this, or share your experience/opinion on this topic!

Best regards,
Hauke

Re: Advice or best practices on adding metadata to stream events

Posted by Hung <un...@gmail.com>.

So there are three ways. 
1. make your model as stream source
2. let master read the model once, distribute it via constructor, and update
it periodically
3. let worker read the model and update it periodically(you mentioned)

option 3 would be problematic if you scale a lot and use many parallelisms
because there are too many connections.

option 2 is the best, if you don't have to update your model. otherwise you
have to restart your flink job to get the new model, or implement this
update logic your own.

option 1 for me is the best if you need to update the model. so you can
control how often you read

Best,

Sendoh



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/