You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Mohil Khare <mo...@prosimo.io> on 2020/01/26 23:31:34 UTC

Using very slow changing stream of data (KV) as side input

Hi,
This is Mohil Khare from San Jose, California. I work in an early stage
startup: Prosimo.
We use Apache beam with gcp dataflow for all real time stats processing
with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.

I am trying to solve the following use with sideinputs.

INPUT:
1. We have a continuous stream of data coming from pubsub topicA. This data
can be put in KV Pcollection and each data item can be uniquely identified
with certain key.
2. We have a very slow changing stream of data coming from pubsub topicB
i.e. you can say that stream of data comes for few mins on topicB followed
by no activity for a long time period.   This stream of data can be again
put in KV PCollection with same keys as above. NOTE: after long inactivity,
it is possible that data comes for only certain keys.

DESIRED OUTPUT/PROCESSING:
1. I want to use KV PCollection as sideinput to enrich data arriving in
topicA. I think View.asMap can be a good choice for it.
2. After enriching data in topic A using sideinput data from topic B, write
to GCS in a fixed window of 10 minutes
2.  Want to continue using above PCollectionView as sideinput as long as no
new data arrives in topicB.
3. Whenever new data arrives in topicB, want to update PCollectionView Map
only for set of Keys that arrived in new stream.

My question is what should be the best approach to tackle this use case? I
will really appreciate if someone can suggest some good solution.

Thanks and Regards
Mohil Khare

Re: Using very slow changing stream of data (KV) as side input

Posted by Mikhail Gryzykhin <mi...@google.com>.
Hi Mohil,

Please, take a look at.
https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs


Also, I have design doc out that handles similar case. I'm working on
prototyping it in python atm.
https://lists.apache.org/thread.html/r792fcf4b6adbce79ea1eb81592d29a3cee7aef768ba4615ac2d078ad%40%3Cdev.beam.apache.org%3E


Regards,
--Mikhail

On Mon, Jan 27, 2020 at 8:56 AM Mohil Khare <mo...@prosimo.io> wrote:

> Hi,
> This is Mohil Khare from San Jose, California. I work in an early stage
> startup: Prosimo.
> We use Apache beam with gcp dataflow for all real time stats processing
> with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.
>
> I am trying to solve the following use with sideinputs.
>
> INPUT:
> 1. We have a continuous stream of data coming from pubsub topicA. This
> data can be put in KV Pcollection and each data item can be uniquely
> identified with certain key.
> 2. We have a very slow changing stream of data coming from pubsub topicB
> i.e. you can say that stream of data comes for few mins on topicB followed
> by no activity for a long time period.   This stream of data can be again
> put in KV PCollection with same keys as above. NOTE: after long inactivity,
> it is possible that data comes for only certain keys.
>
> DESIRED OUTPUT/PROCESSING:
> 1. I want to use KV PCollection as sideinput to enrich data arriving in
> topicA. I think View.asMap can be a good choice for it.
> 2. After enriching data in topic A using sideinput data from topic B,
> write to GCS in a fixed window of 10 minutes
> 2.  Want to continue using above PCollectionView as sideinput as long as
> no new data arrives in topicB.
> 3. Whenever new data arrives in topicB, want to update PCollectionView Map
> only for set of Keys that arrived in new stream.
>
> My question is what should be the best approach to tackle this use case? I
> will really appreciate if someone can suggest some good solution.
>
> Thanks and Regards
> Mohil Khare
>
>
>
>
>

Re: Using very slow changing stream of data (KV) as side input

Posted by Mikhail Gryzykhin <mi...@google.com>.
Hi Mohil,

Please, take a look at.
https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs


Also, I have design doc out that handles similar case. I'm working on
prototyping it in python atm.
https://lists.apache.org/thread.html/r792fcf4b6adbce79ea1eb81592d29a3cee7aef768ba4615ac2d078ad%40%3Cdev.beam.apache.org%3E


Regards,
--Mikhail

On Mon, Jan 27, 2020 at 8:56 AM Mohil Khare <mo...@prosimo.io> wrote:

> Hi,
> This is Mohil Khare from San Jose, California. I work in an early stage
> startup: Prosimo.
> We use Apache beam with gcp dataflow for all real time stats processing
> with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.
>
> I am trying to solve the following use with sideinputs.
>
> INPUT:
> 1. We have a continuous stream of data coming from pubsub topicA. This
> data can be put in KV Pcollection and each data item can be uniquely
> identified with certain key.
> 2. We have a very slow changing stream of data coming from pubsub topicB
> i.e. you can say that stream of data comes for few mins on topicB followed
> by no activity for a long time period.   This stream of data can be again
> put in KV PCollection with same keys as above. NOTE: after long inactivity,
> it is possible that data comes for only certain keys.
>
> DESIRED OUTPUT/PROCESSING:
> 1. I want to use KV PCollection as sideinput to enrich data arriving in
> topicA. I think View.asMap can be a good choice for it.
> 2. After enriching data in topic A using sideinput data from topic B,
> write to GCS in a fixed window of 10 minutes
> 2.  Want to continue using above PCollectionView as sideinput as long as
> no new data arrives in topicB.
> 3. Whenever new data arrives in topicB, want to update PCollectionView Map
> only for set of Keys that arrived in new stream.
>
> My question is what should be the best approach to tackle this use case? I
> will really appreciate if someone can suggest some good solution.
>
> Thanks and Regards
> Mohil Khare
>
>
>
>
>