You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by amit kumar <ak...@gmail.com> on 2020/02/21 04:19:05 UTC

Discrete Transforms vs One Single transform

Hi All,

I am looking for inputs to understand the effects of converting multiple
discrete transforms into one single transformation. (and performing all
steps into one single PTransform).

What is better approach, multiple discrete transforms vs one single
transform with lambdas and multiple functions ?

I wanted to understand the effect of combining multiple transforms into one
single transform and doing everything in a lambda via Functions, will there
be any affect in performance or debugging, metrics or any other factors and
best practices?

Version A
    PCollection<MyType> myRecords = pbegin
        .apply("Kinesis Source", readfromKinesis()) //transform1
        .apply(MapElements
            .into(TypeDescriptors.strings())
            .via(record -> new String(record.getDataAsBytes())))
//transform2
        .apply(convertByteStringToJsonNode()) //transform3
        .apply(schematizeElements()); //transform4

Version B
 PCollection<MyType> myRecords = pbegin
        .apply("Kinesis Source", readfromKinesis()) transform1
        .apply( inputKinesisRecord -> {
        String record = inputKinesisRecord.getDataAsBytes();
        JsonNode jsonNode = convertByteStringToJsonNode(record);
            SchematizedElement outputElement =
getSchematzedElement(jsonNode))
            return outputElement;  }) transform2


Thanks in advance!
Amit

Re: Discrete Transforms vs One Single transform

Posted by Jeff Klukas <jk...@mozilla.com>.
Also note that runners in many cases will fuse discrete transforms together
for efficiency, so while you might worry about performance degradation from
breaking your logic into many discrete transforms, that likely won't be an
issue in practice.

Also note that you have the option of defining composite transforms [0]
that compose a series of smaller discrete transforms, but present an object
that follows the same API. Depending on your needs for modularity and
reuse, this can be a nice way of factoring out logic from your top-level
pipeline while still taking advantage of best practices for using Beam's
built-in transforms.

[0]
https://beam.apache.org/documentation/programming-guide/#composite-transforms

On Fri, Feb 21, 2020 at 1:08 AM Luke Cwik <lc...@google.com> wrote:

> Use discrete transforms.
>
> If you merge them all into one transform you will lose visibility into the
> different parts and will be rebuilding what already exists to provide that
> visibility. You'll also be rebuilding that APIs that help users combine all
> their functions together. You'll actually find that you'll be rebuilding
> lots of what Apache Beam provides.
>
> On Thu, Feb 20, 2020 at 8:19 PM amit kumar <ak...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am looking for inputs to understand the effects of converting multiple
>> discrete transforms into one single transformation. (and performing all
>> steps into one single PTransform).
>>
>> What is better approach, multiple discrete transforms vs one single
>> transform with lambdas and multiple functions ?
>>
>> I wanted to understand the effect of combining multiple transforms into
>> one single transform and doing everything in a lambda via Functions, will
>> there be any affect in performance or debugging, metrics or any other
>> factors and best practices?
>>
>> Version A
>>     PCollection<MyType> myRecords = pbegin
>>         .apply("Kinesis Source", readfromKinesis()) //transform1
>>         .apply(MapElements
>>             .into(TypeDescriptors.strings())
>>             .via(record -> new String(record.getDataAsBytes())))
>> //transform2
>>         .apply(convertByteStringToJsonNode()) //transform3
>>         .apply(schematizeElements()); //transform4
>>
>> Version B
>>  PCollection<MyType> myRecords = pbegin
>>         .apply("Kinesis Source", readfromKinesis()) transform1
>>         .apply( inputKinesisRecord -> {
>>         String record = inputKinesisRecord.getDataAsBytes();
>>         JsonNode jsonNode = convertByteStringToJsonNode(record);
>>             SchematizedElement outputElement =
>> getSchematzedElement(jsonNode))
>>             return outputElement;  }) transform2
>>
>>
>> Thanks in advance!
>> Amit
>>
>

Re: Discrete Transforms vs One Single transform

Posted by Luke Cwik <lc...@google.com>.
Use discrete transforms.

If you merge them all into one transform you will lose visibility into the
different parts and will be rebuilding what already exists to provide that
visibility. You'll also be rebuilding that APIs that help users combine all
their functions together. You'll actually find that you'll be rebuilding
lots of what Apache Beam provides.

On Thu, Feb 20, 2020 at 8:19 PM amit kumar <ak...@gmail.com> wrote:

> Hi All,
>
> I am looking for inputs to understand the effects of converting multiple
> discrete transforms into one single transformation. (and performing all
> steps into one single PTransform).
>
> What is better approach, multiple discrete transforms vs one single
> transform with lambdas and multiple functions ?
>
> I wanted to understand the effect of combining multiple transforms into
> one single transform and doing everything in a lambda via Functions, will
> there be any affect in performance or debugging, metrics or any other
> factors and best practices?
>
> Version A
>     PCollection<MyType> myRecords = pbegin
>         .apply("Kinesis Source", readfromKinesis()) //transform1
>         .apply(MapElements
>             .into(TypeDescriptors.strings())
>             .via(record -> new String(record.getDataAsBytes())))
> //transform2
>         .apply(convertByteStringToJsonNode()) //transform3
>         .apply(schematizeElements()); //transform4
>
> Version B
>  PCollection<MyType> myRecords = pbegin
>         .apply("Kinesis Source", readfromKinesis()) transform1
>         .apply( inputKinesisRecord -> {
>         String record = inputKinesisRecord.getDataAsBytes();
>         JsonNode jsonNode = convertByteStringToJsonNode(record);
>             SchematizedElement outputElement =
> getSchematzedElement(jsonNode))
>             return outputElement;  }) transform2
>
>
> Thanks in advance!
> Amit
>