You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Praveen K Viswanathan <ha...@gmail.com> on 2020/06/21 02:13:57 UTC

Designing an existing pipeline in Beam

Hello Everyone,

I am in the process of implementing an existing pipeline (written using
Java and Kafka) in Apache Beam. The data from the source stream is
contrived and had to go through several steps of enrichment using REST API
calls and parsing of JSON data. The key
transformation in the existing pipeline is in shown below (a super high
level flow)

*Method A*
----Calls *Method B*
      ----Creates *Map 1, Map 2*
----Calls *Method C*
     ----Read *Map 2*
     ----Create *Map 3*
----*Method C*
     ----Read *Map 3* and
     ----update *Map 1*

The Map we use are multi-level maps and I am thinking of having
PCollections for each Maps and pass them as side inputs in a DoFn wherever
I have transformations that need two or more Maps. But there are certain
tasks which I want to make sure that I am following right approach, for
instance updating one of the side input maps inside a DoFn.

These are my initial thoughts/questions and I would like to get some expert
advice on how we typically design such an interleaved transformation in
Apache Beam. Appreciate your valuable insights on this.

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Posted by Praveen K Viswanathan <ha...@gmail.com>.

Thanks Luke, I would like to try the latter approach. Would be able to
share any pseudo-code or point to any example on how to call a common
method inside a DoFn's, let's say, ProcessElement method?

On Tue, Jun 23, 2020 at 6:35 PM Luke Cwik <lc...@google.com> wrote:

> You can apply the same DoFn / Transform instance multiple times in the
> graph or you can follow regular development practices where the common code
> is factored into a method and two different DoFn's invoke it.
>
> On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan <
> harish.praveen@gmail.com> wrote:
>
>> Hi Luke - Thanks for the explanation. The limitation due to directed
>> graph processing and the option of external storage clears most of the
>> questions I had with respect to designing this pipeline. I do have one more
>> scenario to clarify on this thread.
>>
>> If I had a certain piece of logic that I had to use in more than one
>> DoFns how do we do that. In a normal Java application, we can put it as a
>> separate method and call it wherever we want. Is it possible to replicate
>> something like that in Beam's DoFn?
>>
>> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Beam is really about parallelizing the processing. Using a single DoFn
>>> that does everything is fine as long as the DoFn can process elements in
>>> parallel (e.g. upstream source produces lots of elements). Composing
>>> multiple DoFns is great for re-use and testing but it isn't strictly
>>> necessary. Also, Beam doesn't support back edges in the processing graph so
>>> all data flows in one direction and you can't have a cycle. This only
>>> allows for map 1 to producie map 2 which then produces map 3 which is then
>>> used to update map 1 if all of that logic is within a single DoFn/Transform
>>> or you create a cycle using an external system such as write to Kafka topic
>>> X and read from Kafka topic X within the same pipeline or update a database
>>> downstream from where it is read. There is a lot of ordering complexity and
>>> stale data issues whenever using an external store to create a cycle though.
>>>
>>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
>>> harish.praveen@gmail.com> wrote:
>>>
>>>> Another way to put this question is, how do we write a beam pipeline
>>>> for an existing pipeline (in Java) that has a dozen of custom objects and
>>>> you have to work with multiple HashMaps of those custom objects in order to
>>>> transform it. Currently, I am writing a beam pipeline by using the same
>>>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>>>> inside a DoFn*. Is this the optimal way or does Beam offer something
>>>> else?
>>>>
>>>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>>>> harish.praveen@gmail.com> wrote:
>>>>
>>>>> Hi Luke,
>>>>>
>>>>> We can say Map 2 as a kind of a template using which you want to
>>>>> enrich data in Map 1. As I mentioned in my previous post, this is a high
>>>>> level scenario.
>>>>>
>>>>> All these logic are spread across several classes (with ~4K lines of
>>>>> code in total). As in any Java application,
>>>>>
>>>>> 1. The code has been modularized with multiple method calls
>>>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>>>> 3. Accessing the attributes of the custom object using getters and
>>>>> setters.
>>>>>
>>>>> This is a common pattern in a normal Java application but I have not
>>>>> seen such an example of code in Beam.
>>>>>
>>>>>
>>>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> Who reads map 1?
>>>>>> Can it be stale?
>>>>>>
>>>>>> It is unclear what you are trying to do in parallel and why you
>>>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>>>
>>>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>>>> harish.praveen@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> I am in the process of implementing an existing pipeline (written
>>>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>>>> contrived and had to go through several steps of enrichment using REST API
>>>>>>> calls and parsing of JSON data. The key
>>>>>>> transformation in the existing pipeline is in shown below (a super
>>>>>>> high level flow)
>>>>>>>
>>>>>>> *Method A*
>>>>>>> ----Calls *Method B*
>>>>>>>       ----Creates *Map 1, Map 2*
>>>>>>> ----Calls *Method C*
>>>>>>>      ----Read *Map 2*
>>>>>>>      ----Create *Map 3*
>>>>>>> ----*Method C*
>>>>>>>      ----Read *Map 3* and
>>>>>>>      ----update *Map 1*
>>>>>>>
>>>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>>>>>> I have transformations that need two or more Maps. But there are certain
>>>>>>> tasks which I want to make sure that I am following right approach, for
>>>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>>>
>>>>>>> These are my initial thoughts/questions and I would like to get some
>>>>>>> expert advice on how we typically design such an interleaved transformation
>>>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Praveen K Viswanathan
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Praveen K Viswanathan
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Praveen K Viswanathan
>>>>
>>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Posted by Luke Cwik <lc...@google.com>.

You can apply the same DoFn / Transform instance multiple times in the
graph or you can follow regular development practices where the common code
is factored into a method and two different DoFn's invoke it.

On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan <
harish.praveen@gmail.com> wrote:

> Hi Luke - Thanks for the explanation. The limitation due to directed graph
> processing and the option of external storage clears most of the questions
> I had with respect to designing this pipeline. I do have one more scenario
> to clarify on this thread.
>
> If I had a certain piece of logic that I had to use in more than one DoFns
> how do we do that. In a normal Java application, we can put it as a
> separate method and call it wherever we want. Is it possible to replicate
> something like that in Beam's DoFn?
>
> On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <lc...@google.com> wrote:
>
>> Beam is really about parallelizing the processing. Using a single DoFn
>> that does everything is fine as long as the DoFn can process elements in
>> parallel (e.g. upstream source produces lots of elements). Composing
>> multiple DoFns is great for re-use and testing but it isn't strictly
>> necessary. Also, Beam doesn't support back edges in the processing graph so
>> all data flows in one direction and you can't have a cycle. This only
>> allows for map 1 to producie map 2 which then produces map 3 which is then
>> used to update map 1 if all of that logic is within a single DoFn/Transform
>> or you create a cycle using an external system such as write to Kafka topic
>> X and read from Kafka topic X within the same pipeline or update a database
>> downstream from where it is read. There is a lot of ordering complexity and
>> stale data issues whenever using an external store to create a cycle though.
>>
>> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
>> harish.praveen@gmail.com> wrote:
>>
>>> Another way to put this question is, how do we write a beam pipeline for
>>> an existing pipeline (in Java) that has a dozen of custom objects and you
>>> have to work with multiple HashMaps of those custom objects in order to
>>> transform it. Currently, I am writing a beam pipeline by using the same
>>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>>> inside a DoFn*. Is this the optimal way or does Beam offer something
>>> else?
>>>
>>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>>> harish.praveen@gmail.com> wrote:
>>>
>>>> Hi Luke,
>>>>
>>>> We can say Map 2 as a kind of a template using which you want to enrich
>>>> data in Map 1. As I mentioned in my previous post, this is a high level
>>>> scenario.
>>>>
>>>> All these logic are spread across several classes (with ~4K lines of
>>>> code in total). As in any Java application,
>>>>
>>>> 1. The code has been modularized with multiple method calls
>>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>>> 3. Accessing the attributes of the custom object using getters and
>>>> setters.
>>>>
>>>> This is a common pattern in a normal Java application but I have not
>>>> seen such an example of code in Beam.
>>>>
>>>>
>>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> Who reads map 1?
>>>>> Can it be stale?
>>>>>
>>>>> It is unclear what you are trying to do in parallel and why you
>>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>>
>>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>>> harish.praveen@gmail.com> wrote:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> I am in the process of implementing an existing pipeline (written
>>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>>> contrived and had to go through several steps of enrichment using REST API
>>>>>> calls and parsing of JSON data. The key
>>>>>> transformation in the existing pipeline is in shown below (a super
>>>>>> high level flow)
>>>>>>
>>>>>> *Method A*
>>>>>> ----Calls *Method B*
>>>>>>       ----Creates *Map 1, Map 2*
>>>>>> ----Calls *Method C*
>>>>>>      ----Read *Map 2*
>>>>>>      ----Create *Map 3*
>>>>>> ----*Method C*
>>>>>>      ----Read *Map 3* and
>>>>>>      ----update *Map 1*
>>>>>>
>>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>>>>> I have transformations that need two or more Maps. But there are certain
>>>>>> tasks which I want to make sure that I am following right approach, for
>>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>>
>>>>>> These are my initial thoughts/questions and I would like to get some
>>>>>> expert advice on how we typically design such an interleaved transformation
>>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Praveen K Viswanathan
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Praveen K Viswanathan
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>
> --
> Thanks,
> Praveen K Viswanathan
>

Re: Designing an existing pipeline in Beam

Posted by Praveen K Viswanathan <ha...@gmail.com>.

Hi Luke - Thanks for the explanation. The limitation due to directed graph
processing and the option of external storage clears most of the questions
I had with respect to designing this pipeline. I do have one more scenario
to clarify on this thread.

If I had a certain piece of logic that I had to use in more than one DoFns
how do we do that. In a normal Java application, we can put it as a
separate method and call it wherever we want. Is it possible to replicate
something like that in Beam's DoFn?

On Tue, Jun 23, 2020 at 3:47 PM Luke Cwik <lc...@google.com> wrote:

> Beam is really about parallelizing the processing. Using a single DoFn
> that does everything is fine as long as the DoFn can process elements in
> parallel (e.g. upstream source produces lots of elements). Composing
> multiple DoFns is great for re-use and testing but it isn't strictly
> necessary. Also, Beam doesn't support back edges in the processing graph so
> all data flows in one direction and you can't have a cycle. This only
> allows for map 1 to producie map 2 which then produces map 3 which is then
> used to update map 1 if all of that logic is within a single DoFn/Transform
> or you create a cycle using an external system such as write to Kafka topic
> X and read from Kafka topic X within the same pipeline or update a database
> downstream from where it is read. There is a lot of ordering complexity and
> stale data issues whenever using an external store to create a cycle though.
>
> On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
> harish.praveen@gmail.com> wrote:
>
>> Another way to put this question is, how do we write a beam pipeline for
>> an existing pipeline (in Java) that has a dozen of custom objects and you
>> have to work with multiple HashMaps of those custom objects in order to
>> transform it. Currently, I am writing a beam pipeline by using the same
>> Custom objects, getters and setters and HashMap<CustomObjects> *but
>> inside a DoFn*. Is this the optimal way or does Beam offer something
>> else?
>>
>> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
>> harish.praveen@gmail.com> wrote:
>>
>>> Hi Luke,
>>>
>>> We can say Map 2 as a kind of a template using which you want to enrich
>>> data in Map 1. As I mentioned in my previous post, this is a high level
>>> scenario.
>>>
>>> All these logic are spread across several classes (with ~4K lines of
>>> code in total). As in any Java application,
>>>
>>> 1. The code has been modularized with multiple method calls
>>> 2. Passing around HashMaps<CustomObject> as argument to each method
>>> 3. Accessing the attributes of the custom object using getters and
>>> setters.
>>>
>>> This is a common pattern in a normal Java application but I have not
>>> seen such an example of code in Beam.
>>>
>>>
>>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Who reads map 1?
>>>> Can it be stale?
>>>>
>>>> It is unclear what you are trying to do in parallel and why you
>>>> wouldn't stick all this logic into a single DoFn / stateful DoFn.
>>>>
>>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>>> harish.praveen@gmail.com> wrote:
>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I am in the process of implementing an existing pipeline (written
>>>>> using Java and Kafka) in Apache Beam. The data from the source stream is
>>>>> contrived and had to go through several steps of enrichment using REST API
>>>>> calls and parsing of JSON data. The key
>>>>> transformation in the existing pipeline is in shown below (a super
>>>>> high level flow)
>>>>>
>>>>> *Method A*
>>>>> ----Calls *Method B*
>>>>>       ----Creates *Map 1, Map 2*
>>>>> ----Calls *Method C*
>>>>>      ----Read *Map 2*
>>>>>      ----Create *Map 3*
>>>>> ----*Method C*
>>>>>      ----Read *Map 3* and
>>>>>      ----update *Map 1*
>>>>>
>>>>> The Map we use are multi-level maps and I am thinking of having
>>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>>>> I have transformations that need two or more Maps. But there are certain
>>>>> tasks which I want to make sure that I am following right approach, for
>>>>> instance updating one of the side input maps inside a DoFn.
>>>>>
>>>>> These are my initial thoughts/questions and I would like to get some
>>>>> expert advice on how we typically design such an interleaved transformation
>>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Praveen K Viswanathan
>>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Posted by Luke Cwik <lc...@google.com>.

Beam is really about parallelizing the processing. Using a single DoFn that
does everything is fine as long as the DoFn can process elements in
parallel (e.g. upstream source produces lots of elements). Composing
multiple DoFns is great for re-use and testing but it isn't strictly
necessary. Also, Beam doesn't support back edges in the processing graph so
all data flows in one direction and you can't have a cycle. This only
allows for map 1 to producie map 2 which then produces map 3 which is then
used to update map 1 if all of that logic is within a single DoFn/Transform
or you create a cycle using an external system such as write to Kafka topic
X and read from Kafka topic X within the same pipeline or update a database
downstream from where it is read. There is a lot of ordering complexity and
stale data issues whenever using an external store to create a cycle though.

On Mon, Jun 22, 2020 at 6:02 PM Praveen K Viswanathan <
harish.praveen@gmail.com> wrote:

> Another way to put this question is, how do we write a beam pipeline for
> an existing pipeline (in Java) that has a dozen of custom objects and you
> have to work with multiple HashMaps of those custom objects in order to
> transform it. Currently, I am writing a beam pipeline by using the same
> Custom objects, getters and setters and HashMap<CustomObjects> *but
> inside a DoFn*. Is this the optimal way or does Beam offer something else?
>
> On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
> harish.praveen@gmail.com> wrote:
>
>> Hi Luke,
>>
>> We can say Map 2 as a kind of a template using which you want to enrich
>> data in Map 1. As I mentioned in my previous post, this is a high level
>> scenario.
>>
>> All these logic are spread across several classes (with ~4K lines of code
>> in total). As in any Java application,
>>
>> 1. The code has been modularized with multiple method calls
>> 2. Passing around HashMaps<CustomObject> as argument to each method
>> 3. Accessing the attributes of the custom object using getters and
>> setters.
>>
>> This is a common pattern in a normal Java application but I have not seen
>> such an example of code in Beam.
>>
>>
>> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Who reads map 1?
>>> Can it be stale?
>>>
>>> It is unclear what you are trying to do in parallel and why you wouldn't
>>> stick all this logic into a single DoFn / stateful DoFn.
>>>
>>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>>> harish.praveen@gmail.com> wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I am in the process of implementing an existing pipeline (written using
>>>> Java and Kafka) in Apache Beam. The data from the source stream is
>>>> contrived and had to go through several steps of enrichment using REST API
>>>> calls and parsing of JSON data. The key
>>>> transformation in the existing pipeline is in shown below (a super high
>>>> level flow)
>>>>
>>>> *Method A*
>>>> ----Calls *Method B*
>>>>       ----Creates *Map 1, Map 2*
>>>> ----Calls *Method C*
>>>>      ----Read *Map 2*
>>>>      ----Create *Map 3*
>>>> ----*Method C*
>>>>      ----Read *Map 3* and
>>>>      ----update *Map 1*
>>>>
>>>> The Map we use are multi-level maps and I am thinking of having
>>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>>> I have transformations that need two or more Maps. But there are certain
>>>> tasks which I want to make sure that I am following right approach, for
>>>> instance updating one of the side input maps inside a DoFn.
>>>>
>>>> These are my initial thoughts/questions and I would like to get some
>>>> expert advice on how we typically design such an interleaved transformation
>>>> in Apache Beam. Appreciate your valuable insights on this.
>>>>
>>>> --
>>>> Thanks,
>>>> Praveen K Viswanathan
>>>>
>>>
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>
>
> --
> Thanks,
> Praveen K Viswanathan
>

Re: Designing an existing pipeline in Beam

Posted by Praveen K Viswanathan <ha...@gmail.com>.

Another way to put this question is, how do we write a beam pipeline for
an existing pipeline (in Java) that has a dozen of custom objects and you
have to work with multiple HashMaps of those custom objects in order to
transform it. Currently, I am writing a beam pipeline by using the same
Custom objects, getters and setters and HashMap<CustomObjects> *but inside
a DoFn*. Is this the optimal way or does Beam offer something else?

On Mon, Jun 22, 2020 at 3:47 PM Praveen K Viswanathan <
harish.praveen@gmail.com> wrote:

> Hi Luke,
>
> We can say Map 2 as a kind of a template using which you want to enrich
> data in Map 1. As I mentioned in my previous post, this is a high level
> scenario.
>
> All these logic are spread across several classes (with ~4K lines of code
> in total). As in any Java application,
>
> 1. The code has been modularized with multiple method calls
> 2. Passing around HashMaps<CustomObject> as argument to each method
> 3. Accessing the attributes of the custom object using getters and setters.
>
> This is a common pattern in a normal Java application but I have not seen
> such an example of code in Beam.
>
>
> On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:
>
>> Who reads map 1?
>> Can it be stale?
>>
>> It is unclear what you are trying to do in parallel and why you wouldn't
>> stick all this logic into a single DoFn / stateful DoFn.
>>
>> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
>> harish.praveen@gmail.com> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am in the process of implementing an existing pipeline (written using
>>> Java and Kafka) in Apache Beam. The data from the source stream is
>>> contrived and had to go through several steps of enrichment using REST API
>>> calls and parsing of JSON data. The key
>>> transformation in the existing pipeline is in shown below (a super high
>>> level flow)
>>>
>>> *Method A*
>>> ----Calls *Method B*
>>>       ----Creates *Map 1, Map 2*
>>> ----Calls *Method C*
>>>      ----Read *Map 2*
>>>      ----Create *Map 3*
>>> ----*Method C*
>>>      ----Read *Map 3* and
>>>      ----update *Map 1*
>>>
>>> The Map we use are multi-level maps and I am thinking of having
>>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>>> I have transformations that need two or more Maps. But there are certain
>>> tasks which I want to make sure that I am following right approach, for
>>> instance updating one of the side input maps inside a DoFn.
>>>
>>> These are my initial thoughts/questions and I would like to get some
>>> expert advice on how we typically design such an interleaved transformation
>>> in Apache Beam. Appreciate your valuable insights on this.
>>>
>>> --
>>> Thanks,
>>> Praveen K Viswanathan
>>>
>>
>
> --
> Thanks,
> Praveen K Viswanathan
>


-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Posted by Praveen K Viswanathan <ha...@gmail.com>.

Hi Luke,

We can say Map 2 as a kind of a template using which you want to enrich
data in Map 1. As I mentioned in my previous post, this is a high level
scenario.

All these logic are spread across several classes (with ~4K lines of code
in total). As in any Java application,

1. The code has been modularized with multiple method calls
2. Passing around HashMaps<CustomObject> as argument to each method
3. Accessing the attributes of the custom object using getters and setters.

This is a common pattern in a normal Java application but I have not seen
such an example of code in Beam.


On Mon, Jun 22, 2020 at 8:23 AM Luke Cwik <lc...@google.com> wrote:

> Who reads map 1?
> Can it be stale?
>
> It is unclear what you are trying to do in parallel and why you wouldn't
> stick all this logic into a single DoFn / stateful DoFn.
>
> On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
> harish.praveen@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I am in the process of implementing an existing pipeline (written using
>> Java and Kafka) in Apache Beam. The data from the source stream is
>> contrived and had to go through several steps of enrichment using REST API
>> calls and parsing of JSON data. The key
>> transformation in the existing pipeline is in shown below (a super high
>> level flow)
>>
>> *Method A*
>> ----Calls *Method B*
>>       ----Creates *Map 1, Map 2*
>> ----Calls *Method C*
>>      ----Read *Map 2*
>>      ----Create *Map 3*
>> ----*Method C*
>>      ----Read *Map 3* and
>>      ----update *Map 1*
>>
>> The Map we use are multi-level maps and I am thinking of having
>> PCollections for each Maps and pass them as side inputs in a DoFn wherever
>> I have transformations that need two or more Maps. But there are certain
>> tasks which I want to make sure that I am following right approach, for
>> instance updating one of the side input maps inside a DoFn.
>>
>> These are my initial thoughts/questions and I would like to get some
>> expert advice on how we typically design such an interleaved transformation
>> in Apache Beam. Appreciate your valuable insights on this.
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Designing an existing pipeline in Beam

Posted by Luke Cwik <lc...@google.com>.

Who reads map 1?
Can it be stale?

It is unclear what you are trying to do in parallel and why you wouldn't
stick all this logic into a single DoFn / stateful DoFn.

On Sat, Jun 20, 2020 at 7:14 PM Praveen K Viswanathan <
harish.praveen@gmail.com> wrote:

> Hello Everyone,
>
> I am in the process of implementing an existing pipeline (written using
> Java and Kafka) in Apache Beam. The data from the source stream is
> contrived and had to go through several steps of enrichment using REST API
> calls and parsing of JSON data. The key
> transformation in the existing pipeline is in shown below (a super high
> level flow)
>
> *Method A*
> ----Calls *Method B*
>       ----Creates *Map 1, Map 2*
> ----Calls *Method C*
>      ----Read *Map 2*
>      ----Create *Map 3*
> ----*Method C*
>      ----Read *Map 3* and
>      ----update *Map 1*
>
> The Map we use are multi-level maps and I am thinking of having
> PCollections for each Maps and pass them as side inputs in a DoFn wherever
> I have transformations that need two or more Maps. But there are certain
> tasks which I want to make sure that I am following right approach, for
> instance updating one of the side input maps inside a DoFn.
>
> These are my initial thoughts/questions and I would like to get some
> expert advice on how we typically design such an interleaved transformation
> in Apache Beam. Appreciate your valuable insights on this.
>
> --
> Thanks,
> Praveen K Viswanathan
>