You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by le...@tutanota.com on 2016/05/30 11:41:30 UTC
Elegantly sharing state in a streaming environment
Hello Flink team,
How can i partition and share static state among instances of a streaming
operator?
I have a huge list of keys and values, which are used to filter tuples in a
stream. The list does not change. Currently i am sharing the list with each
operator instance via the constructor, although only a subset of the list is
required per operator (the assignment of subset to operator instance is
known). I cannot use DataSet based functions in a streaming execution
environment to assign sub lists. I also cannot use DataStream based
partitioning functions as the list is static, i.e. not a DataStream. The
dilemma exists as i am mixing static (DataSet type) content with streaming
content. Is there any other approach aside from using an additional tool
(e.g. distributed cache)?
Thanks in advance.
Regards
Leon
Re: Elegantly sharing state in a streaming environment
Posted by Ufuk Celebi <uc...@apache.org>.
Aljoscha is working to properly expose this in Flink. The design
document is here:
https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#heading=h.pqg5z6g0mjm7
On Mon, May 30, 2016 at 2:31 PM, Philippe CAPARROY
<ph...@orange.fr> wrote:
>
> Just transform the list in a DataStream. A datastream can be finite.
>
>
> One solution, in the context of a Streaming environment is to use Kafka, or
> any other distributed broker, although Flink ships with a KafkaSource.
>
>
>
> 1)Create a Kafka Topic dedicated to your list of key/values. Inject your
> values into this topic, partitionned by the keys. So that you recover the
> keys in Flink.
>
>
>
> 2) Create a source for the stream of tuple your analysing -> output1
> (Tuples).
>
>
>
> 3) Create a KafkaSource, and parse/recover your key value pairs from this
> source (e.g a first map operator) : map1 -> output 2 (K,V), then :
>
>
>
>
>
>
>
> a) If you need all key/Value pairs at each operator :
> broadcast all partitions from the output 1 to the analysis operator
>
>
>
> b) if you dont need all key/values pairs, just chain
> output1 to the analysis operator. Partitioning of K,V pairs will depend on
> Kafka partitioning strategy, and can be controlled in Flink anyway.
>
>
>
> 4) The analysis operator : will perform a RichCoFlatMapFunction, and can be
> Checkpointed.
>
> When receiving K,V pairs from output2, store them in a local state.
>
> When receiving tuple, should be able to to filter with the help of the local
> state, and propagate downstream or not.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> Message du 30/05/16 13:41
>> De : leon_mclare@tutanota.com
>> A : "User" <us...@flink.apache.org>
>> Copie à :
>> Objet : Elegantly sharing state in a streaming environment
>
>>
>>Hello Flink team,
>
> How can i partition and share static state among instances of a streaming
> operator?
>
> I have a huge list of keys and values, which are used to filter tuples in a
> stream. The list does not change. Currently i am sharing the list with each
> operator instance via the constructor, although only a subset of the list is
> required per operator (the assignment of subset to operator instance is
> known). I cannot use DataSet based functions in a streaming execution
> environment to assign sub lists. I also cannot use DataStream based
> partitioning functions as the list is static, i.e. not a DataStream. The
> dilemma exists as i am mixing static (DataSet type) content with streaming
> content. Is there any other approach aside from using an additional tool
> (e.g. distributed cache)?
>
> Thanks in advance.
>
> Regards
> Leon
>
>
>
Re: re: Elegantly sharing state in a streaming environment
Posted by le...@tutanota.com.
Dear Philippe,
that is exactly what i need. Thank you for the concise explanation.
This approach is excellent, as it also permits the values to be easily
updated externally.
Kind regards
Leon
30. May 2016 14:31 by philippe.caparroy@orange.fr:
>
>
> Just transform the list in a DataStream. A datastream can be finite.
>
>
> One solution, in the context of a Streaming environment is to use Kafka, or
> any other distributed broker, although Flink ships with a KafkaSource.
>
>
>
> 1)Create a Kafka Topic dedicated to your list of key/values. Inject your
> values into this topic, partitionned by the keys. So that you recover the
> keys in Flink.
>
>
>
> 2) Create a source for the stream of tuple your analysing -> output1
> (Tuples).
>
>
>
> 3) Create a KafkaSource, and parse/recover your key value pairs from this
> source (e.g a first map operator) : map1 -> output 2 (K,V), then :
>
>
>
>
>
>
>
> a) If you need all key/Value pairs at each operator :
> broadcast all partitions from the output 1 to the analysis operator
>
>
>
> b) if you dont need all key/values pairs, just chain
> output1 to the analysis operator. Partitioning of K,V pairs will depend on
> Kafka partitioning strategy, and can be controlled in Flink anyway.
>
>
>
> 4) The analysis operator : will perform a RichCoFlatMapFunction, and can
> be Checkpointed.
>
> When receiving K,V pairs from output2, store them in a local state.
>
> When receiving tuple, should be able to to filter with the help of the
> local state, and propagate downstream or not.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> > Message du 30/05/16 13:41
>> > De : >> leon_mclare@tutanota.com
>> > A : "User" <>> user@flink.apache.org>> >
>> > Copie à :
>> > Objet : Elegantly sharing state in a streaming environment
>> >
>> >Hello Flink team,
>>
>> How can i partition and share static state among instances of a streaming
>> operator?
>>
>> I have a huge list of keys and values, which are used to filter tuples in
>> a stream. The list does not change. Currently i am sharing the list with
>> each operator instance via the constructor, although only a subset of the
>> list is required per operator (the assignment of subset to operator
>> instance is known). I cannot use DataSet based functions in a streaming
>> execution environment to assign sub lists. I also cannot use DataStream
>> based partitioning functions as the list is static, i.e. not a DataStream.
>> The dilemma exists as i am mixing static (DataSet type) content with
>> streaming content. Is there any other approach aside from using an
>> additional tool (e.g. distributed cache)?
>>
>> Thanks in advance.
>>
>> Regards
>> Leon
>>
>>
>>
>>
re: Elegantly sharing state in a streaming environment
Posted by Philippe CAPARROY <ph...@orange.fr>.
Just transform the list in a DataStream. A datastream can be finite.
One solution, in the context of a Streaming environment is to use Kafka, or any other distributed broker, although Flink ships with a KafkaSource.
1)Create a Kafka Topic dedicated to your list of key/values. Inject your values into this topic, partitionned by the keys. So that you recover the keys in Flink.
2) Create a source for the stream of tuple your analysing -> output1 (Tuples).
3) Create a KafkaSource, and parse/recover your key value pairs from this source (e.g a first map operator) : map1 -> output 2 (K,V), then :
a) If you need all key/Value pairs at each operator : broadcast all partitions from the output 1 to the analysis operator
b) if you dont need all key/values pairs, just chain output1 to the analysis operator. Partitioning of K,V pairs will depend on Kafka partitioning strategy, and can be controlled in Flink anyway.
4) The analysis operator : will perform a RichCoFlatMapFunction, and can be Checkpointed.
When receiving K,V pairs from output2, store them in a local state.
When receiving tuple, should be able to to filter with the help of the local state, and propagate downstream or not.
> Message du 30/05/16 13:41
> De : leon_mclare@tutanota.com
> A : "User"
> Copie à :
> Objet : Elegantly sharing state in a streaming environment
>
>Hello Flink team,
How can i partition and share static state among instances of a streaming operator?
I have a huge list of keys and values, which are used to filter tuples in a stream. The list does not change. Currently i am sharing the list with each operator instance via the constructor, although only a subset of the list is required per operator (the assignment of subset to operator instance is known). I cannot use DataSet based functions in a streaming execution environment to assign sub lists. I also cannot use DataStream based partitioning functions as the list is static, i.e. not a DataStream. The dilemma exists as i am mixing static (DataSet type) content with streaming content. Is there any other approach aside from using an additional tool (e.g. distributed cache)?
Thanks in advance.
Regards
Leon