You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ahmed Abualsaud via dev <de...@beam.apache.org> on 2023/05/30 14:41:47 UTC

Proposal to reduce the steps to make a Java transform portable

Hey everyone,

I was looking at how we use SchemaTransforms in our expansion service. From
what I see, there may be a redundant step in developing SchemaTransforms.
Currently, we have 3 pieces:
- SchemaTransformProvider [1]
- A configuration object
- SchemaTransform [2]

The API is generally used like this:
1. The SchemaTransformProvider takes a configuration object and returns a
SchemaTransform
2. The SchemaTransform is used to build a PTransform according to the
configuration

In these steps, the SchemaTransform class seems unnecessary. We can combine
the two steps if we have SchemaTransformProvider return the PTransform
directly.

We can then remove the SchemaTransform class as it will be obsolete. This
should be safe to do; the only place it's used in our API is here [3], and
that can be simplified if we make this change (we'd just trim `
.buildTransform()` off the end as `provider.from(configRow)` will directly
return the PTransform).

I'd like to first mention that I was not involved in the design process of
this API so I may be missing some information on why it was set up this way.

A few developers already raised questions about how there's seemingly
unnecessary boilerplate involved in making a Java transform portable. I
wasn't involved in the design process of this API so I may be missing some
information, but my assumption is this was designed to follow the pattern
of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
SchemaIO[5] -> PTransform. However, with the newer SchemaTransformProvider
API, we dropped a few methods and reduced the SchemaTransform class to have
a generic buildTransform() method. See the example of
PubsubReadSchemaTransformProvider [6], where the SchemaTransform interface
and buildTransform method are implemented just to satisfy the requirement
that SchemaTransformProvider::from return a SchemaTransform.

I'm bringing this up because if we are looking to encourage contribution to
cross-language use cases, we should make it simpler and less convoluted to
develop portable transforms.

There are a number of SchemaTransforms already developed, but applying
these changes to them should be straightforward. If people think this is a
good idea, I can open a PR and implement them.

Best,
Ahmed

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
[3]
https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
[4]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
[5]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
[6]
https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Kenneth Knowles <ke...@apache.org>.
+1 in general (I will dig into details too, but later)

I will add a couple more steps that I dislike, which exist only for
historical obsolete reasons:

The whole existence of runners/core-construction-java and everything in it.
The main reason it exists is to make it so that main SDK has no dependency
on protobuf. We thought maybe the portable model would be generic enough to
have multiple possible encodings. That idea is obsolete. We are using
protobuf.

So that whole module, everything about generating proto for transforms,
could be in the core SDK to remove the need for the service loader pattern,
which often breaks when users build uber jars, and just generally is
overused and needless complexity here.

Not sure how this relates in detail to your proposal, but I just wanted to
weigh in with a big +1 to making it easy to make a portable schema-based
transform by default. Imagine a world where the Java-centric API is
actually a wrapper on the schema transform!

Kenn

On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <de...@beam.apache.org>
wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and returns a
> SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete. This
> should be safe to do; the only place it's used in our API is here [3], and
> that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design process of
> this API so I may be missing some information on why it was set up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented just
> to satisfy the requirement that SchemaTransformProvider::from return a
> SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage contribution
> to cross-language use cases, we should make it simpler and less convoluted
> to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
> [6]
> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Steven van Rossum via dev <de...@beam.apache.org>.
>
> Is it actually necessary for a PTransform that is configured via the
> Schema mechanism to also be one that uses RowCoder? Those strike me as two
> separate concerns and unnecessarily limiting
>

+1 to this, I have a wip branch somewhere with a coder for
https://github.com/apache/incubator-fury/ to benchmark performance in Beam
as a coder and compare it to the published benchmarks.
With a portable encoding scheme across a broad selection of languages it
could be an interesting candidate for a custom row oriented serialization
scheme tied to schema awareness in Beam.
Decoupling the coder from schemas seems like it would be a prerequisite
change to improve columnar transfer and processing by removing the need to
transpose a batch to/from Row as is currently the case with the Arrow
extension.

On Thu, Mar 7, 2024 at 6:18 PM Ahmed Abualsaud via dev <de...@beam.apache.org>
wrote:

> In the same vein of reducing boilerplate to make a schema-aware transform,
> I've opened a small PR to relax the requirement of 3 methods:
> https://github.com/apache/beam/pull/30560
>
> Would appreciate someone taking a look!
>
> On Thu, Jun 29, 2023 at 9:43 AM Ahmed Abualsaud <ah...@google.com>
> wrote:
>
>> Can someone take a quick look at
>> https://github.com/apache/beam/pull/27202? If things look good, let's
>> try getting it in before the release cut as I'm also updating our
>> cross-language documentation and would like to include these changes.
>>
>> Thank you,
>> Ahmed
>>
>> On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <re...@google.com> wrote:
>>
>>> The goal was to make schema transforms as efficient as hand-written
>>> coders. We found the avro encoding/decoding to often be quite inefficient,
>>> which is one reason we didn't use it for schemas.
>>>
>>> Our schema encoding is internal to Beam though, and not suggested for
>>> use external to a pipeline. For sources or sinks we still recommend using
>>> Avro (or proto).
>>>
>>> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>>>> ahmedabualsaud@google.com> wrote:
>>>>
>>>>> Thank you all for your input. I have a PR for the changes I mentioned
>>>>> in my initial email: https://github.com/apache/beam/pull/27202.
>>>>> Please review when you get a chance!
>>>>>
>>>>> > perhaps we should consider just going to something Avro for portable
>>>>> coding rather than something custom
>>>>>
>>>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>>>> elements? We would still run into the problem Cham mentioned earlier (of
>>>>> making sure existing PTransform inputs/outputs are compatible with
>>>>> cross-language-valid types).
>>>>>
>>>>
>>>> I don't remember why Avro was rejected in favor of our own encoding
>>>> format, but it probably doesn't make sense to revisit that without
>>>> understanding the full history. I do know that avro versioning and diamond
>>>> dependencies cause a lot of pain for users and there's a concerted effort
>>>> to remove Avro from Beam core altogether.
>>>>
>>>> In any case, this is quite orthogonal to the proposal here which we
>>>> should move forward on.
>>>>
>>>>
>>>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Sure, I get that… though perhaps we should consider just going to
>>>>>> something Avro for portable coding rather than something custom.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>>>> chamikara@google.com> wrote:
>>>>>>
>>>>>>> Input/output PCollection types at least have to be portable Beam
>>>>>>> types [1] for cross-language to work.
>>>>>>>
>>>>>>> I think we restricted schema-aware transforms to PCollection<Row>
>>>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cham
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is it actually necessary for a PTransform that is configured via
>>>>>>>> the Schema mechanism to also be one that uses RowCoder? Those strike me as
>>>>>>>> two separate concerns and unnecessarily limiting.
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>>>> chamikara@google.com> wrote:
>>>>>>>>
>>>>>>>>> +1 for the simplification.
>>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>>>> PTransforms to be schema-aware transform compatible (for example, change
>>>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>>>> types). But this makes sense for new transforms.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <
>>>>>>>>>> byronellis@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>>>>>> DRY.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>>>> provides no value.
>>>>>>>>>>>>
>>>>>>>>>>>> I would love it if we also looked into ways to auto-generate
>>>>>>>>>>>> the SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>>>>>>> itself.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>>>> - A configuration object
>>>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object
>>>>>>>>>>>>> and returns a SchemaTransform
>>>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>>>> to the configuration
>>>>>>>>>>>>>
>>>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary.
>>>>>>>>>>>>> We can combine the two steps if we have SchemaTransformProvider return the
>>>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to first mention that I was not involved in
>>>>>>>>>>>>> the design process of this API so I may be missing some information on why
>>>>>>>>>>>>> it was set up this way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However,
>>>>>>>>>>>>> with the newer SchemaTransformProvider API, we dropped a few methods and
>>>>>>>>>>>>> reduced the SchemaTransform class to have a generic buildTransform()
>>>>>>>>>>>>> method. See the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Ahmed
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>>>> [3]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>>>> [4]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>>>> [5]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>>>> [6]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Ahmed Abualsaud via dev <de...@beam.apache.org>.
In the same vein of reducing boilerplate to make a schema-aware transform,
I've opened a small PR to relax the requirement of 3 methods:
https://github.com/apache/beam/pull/30560

Would appreciate someone taking a look!

On Thu, Jun 29, 2023 at 9:43 AM Ahmed Abualsaud <ah...@google.com>
wrote:

> Can someone take a quick look at https://github.com/apache/beam/pull/27202?
> If things look good, let's try getting it in before the release cut as I'm
> also updating our cross-language documentation and would like to include
> these changes.
>
> Thank you,
> Ahmed
>
> On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <re...@google.com> wrote:
>
>> The goal was to make schema transforms as efficient as hand-written
>> coders. We found the avro encoding/decoding to often be quite inefficient,
>> which is one reason we didn't use it for schemas.
>>
>> Our schema encoding is internal to Beam though, and not suggested for use
>> external to a pipeline. For sources or sinks we still recommend using Avro
>> (or proto).
>>
>> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>>> ahmedabualsaud@google.com> wrote:
>>>
>>>> Thank you all for your input. I have a PR for the changes I mentioned
>>>> in my initial email: https://github.com/apache/beam/pull/27202. Please
>>>> review when you get a chance!
>>>>
>>>> > perhaps we should consider just going to something Avro for portable
>>>> coding rather than something custom
>>>>
>>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>>> elements? We would still run into the problem Cham mentioned earlier (of
>>>> making sure existing PTransform inputs/outputs are compatible with
>>>> cross-language-valid types).
>>>>
>>>
>>> I don't remember why Avro was rejected in favor of our own encoding
>>> format, but it probably doesn't make sense to revisit that without
>>> understanding the full history. I do know that avro versioning and diamond
>>> dependencies cause a lot of pain for users and there's a concerted effort
>>> to remove Avro from Beam core altogether.
>>>
>>> In any case, this is quite orthogonal to the proposal here which we
>>> should move forward on.
>>>
>>>
>>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com>
>>>> wrote:
>>>>
>>>>> Sure, I get that… though perhaps we should consider just going to
>>>>> something Avro for portable coding rather than something custom.
>>>>>
>>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>> Input/output PCollection types at least have to be portable Beam
>>>>>> types [1] for cross-language to work.
>>>>>>
>>>>>> I think we restricted schema-aware transforms to PCollection<Row>
>>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>>
>>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>>>>>> separate concerns and unnecessarily limiting.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>>> chamikara@google.com> wrote:
>>>>>>>
>>>>>>>> +1 for the simplification.
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>
>>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>>
>>>>>>>>
>>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>>> PTransforms to be schema-aware transform compatible (for example, change
>>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>>> types). But this makes sense for new transforms.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <
>>>>>>>>> byronellis@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>>>>> DRY.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>>> provides no value.
>>>>>>>>>>>
>>>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>>>>>> itself.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>>> - A configuration object
>>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>>
>>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>>> to the configuration
>>>>>>>>>>>>
>>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return the
>>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>>
>>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>>>>>>> up this way.
>>>>>>>>>>>>
>>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However,
>>>>>>>>>>>> with the newer SchemaTransformProvider API, we dropped a few methods and
>>>>>>>>>>>> reduced the SchemaTransform class to have a generic buildTransform()
>>>>>>>>>>>> method. See the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ahmed
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>>> [3]
>>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>>> [6]
>>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Ahmed Abualsaud via dev <de...@beam.apache.org>.
Can someone take a quick look at https://github.com/apache/beam/pull/27202?
If things look good, let's try getting it in before the release cut as I'm
also updating our cross-language documentation and would like to include
these changes.

Thank you,
Ahmed

On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <re...@google.com> wrote:

> The goal was to make schema transforms as efficient as hand-written
> coders. We found the avro encoding/decoding to often be quite inefficient,
> which is one reason we didn't use it for schemas.
>
> Our schema encoding is internal to Beam though, and not suggested for use
> external to a pipeline. For sources or sinks we still recommend using Avro
> (or proto).
>
> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>> ahmedabualsaud@google.com> wrote:
>>
>>> Thank you all for your input. I have a PR for the changes I mentioned in
>>> my initial email: https://github.com/apache/beam/pull/27202. Please
>>> review when you get a chance!
>>>
>>> > perhaps we should consider just going to something Avro for portable
>>> coding rather than something custom
>>>
>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>> elements? We would still run into the problem Cham mentioned earlier (of
>>> making sure existing PTransform inputs/outputs are compatible with
>>> cross-language-valid types).
>>>
>>
>> I don't remember why Avro was rejected in favor of our own encoding
>> format, but it probably doesn't make sense to revisit that without
>> understanding the full history. I do know that avro versioning and diamond
>> dependencies cause a lot of pain for users and there's a concerted effort
>> to remove Avro from Beam core altogether.
>>
>> In any case, this is quite orthogonal to the proposal here which we
>> should move forward on.
>>
>>
>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com>
>>> wrote:
>>>
>>>> Sure, I get that… though perhaps we should consider just going to
>>>> something Avro for portable coding rather than something custom.
>>>>
>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>> Input/output PCollection types at least have to be portable Beam types
>>>>> [1] for cross-language to work.
>>>>>
>>>>> I think we restricted schema-aware transforms to PCollection<Row>
>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>
>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>>>>> separate concerns and unnecessarily limiting.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>> chamikara@google.com> wrote:
>>>>>>
>>>>>>> +1 for the simplification.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>> robertwb@google.com> wrote:
>>>>>>>
>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>
>>>>>>>
>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>> PTransforms to be schema-aware transform compatible (for example, change
>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>> types). But this makes sense for new transforms.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>>>> DRY.
>>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>> provides no value.
>>>>>>>>>>
>>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>>>>> itself.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>> - A configuration object
>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>
>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>> to the configuration
>>>>>>>>>>>
>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return the
>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>
>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>
>>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>>>>>> up this way.
>>>>>>>>>>>
>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and reduced
>>>>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. See
>>>>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>
>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>
>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ahmed
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>> [3]
>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>> [4]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>> [5]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>> [6]
>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>
>>>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Reuven Lax via dev <de...@beam.apache.org>.
The goal was to make schema transforms as efficient as hand-written coders.
We found the avro encoding/decoding to often be quite inefficient, which is
one reason we didn't use it for schemas.

Our schema encoding is internal to Beam though, and not suggested for use
external to a pipeline. For sources or sinks we still recommend using Avro
(or proto).

On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <ro...@google.com> wrote:

> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <ah...@google.com>
> wrote:
>
>> Thank you all for your input. I have a PR for the changes I mentioned in
>> my initial email: https://github.com/apache/beam/pull/27202. Please
>> review when you get a chance!
>>
>> > perhaps we should consider just going to something Avro for portable
>> coding rather than something custom
>>
>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>> elements? We would still run into the problem Cham mentioned earlier (of
>> making sure existing PTransform inputs/outputs are compatible with
>> cross-language-valid types).
>>
>
> I don't remember why Avro was rejected in favor of our own encoding
> format, but it probably doesn't make sense to revisit that without
> understanding the full history. I do know that avro versioning and diamond
> dependencies cause a lot of pain for users and there's a concerted effort
> to remove Avro from Beam core altogether.
>
> In any case, this is quite orthogonal to the proposal here which we should
> move forward on.
>
>
>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com>
>> wrote:
>>
>>> Sure, I get that… though perhaps we should consider just going to
>>> something Avro for portable coding rather than something custom.
>>>
>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> Input/output PCollection types at least have to be portable Beam types
>>>> [1] for cross-language to work.
>>>>
>>>> I think we restricted schema-aware transforms to PCollection<Row> since
>>>> Row was expected to be an efficient replacement for arbitrary portable Beam
>>>> types (not sure how true that is in practice currently).
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>
>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>>>> wrote:
>>>>
>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>>>> separate concerns and unnecessarily limiting.
>>>>>
>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>> +1 for the simplification.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>
>>>>>>
>>>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>>>> to be schema-aware transform compatible (for example, change input/output
>>>>>> types and make sure parameters take Beam Schema compatible types). But this
>>>>>> makes sense for new transforms.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>>> DRY.
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>> provides no value.
>>>>>>>>>
>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>>>> itself.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>> - A configuration object
>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>
>>>>>>>>>> The API is generally used like this:
>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>>>>> the configuration
>>>>>>>>>>
>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return the
>>>>>>>>>> PTransform directly.
>>>>>>>>>>
>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>
>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>>>>> up this way.
>>>>>>>>>>
>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and reduced
>>>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. See
>>>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>
>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>
>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Ahmed
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>> [3]
>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>> [4]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>> [5]
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>> [6]
>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>
>>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.
On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <ah...@google.com>
wrote:

> Thank you all for your input. I have a PR for the changes I mentioned in
> my initial email: https://github.com/apache/beam/pull/27202. Please
> review when you get a chance!
>
> > perhaps we should consider just going to something Avro for portable
> coding rather than something custom
>
> Did you mean using some Avro object (GenericRecord?) besides Beam Row
> elements? We would still run into the problem Cham mentioned earlier (of
> making sure existing PTransform inputs/outputs are compatible with
> cross-language-valid types).
>

I don't remember why Avro was rejected in favor of our own encoding format,
but it probably doesn't make sense to revisit that without
understanding the full history. I do know that avro versioning and diamond
dependencies cause a lot of pain for users and there's a concerted effort
to remove Avro from Beam core altogether.

In any case, this is quite orthogonal to the proposal here which we should
move forward on.


> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com>
> wrote:
>
>> Sure, I get that… though perhaps we should consider just going to
>> something Avro for portable coding rather than something custom.
>>
>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Input/output PCollection types at least have to be portable Beam types
>>> [1] for cross-language to work.
>>>
>>> I think we restricted schema-aware transforms to PCollection<Row> since
>>> Row was expected to be an efficient replacement for arbitrary portable Beam
>>> types (not sure how true that is in practice currently).
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>
>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>>> wrote:
>>>
>>>> Is it actually necessary for a PTransform that is configured via the
>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>>> separate concerns and unnecessarily limiting.
>>>>
>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>> +1 for the simplification.
>>>>>
>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>
>>>>>
>>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>>> to be schema-aware transform compatible (for example, change input/output
>>>>> types and make sure parameters take Beam Schema compatible types). But this
>>>>> makes sense for new transforms.
>>>>>
>>>>>
>>>>>
>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>> DRY.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> +1 to this simplification, it's a historical artifact that provides
>>>>>>>> no value.
>>>>>>>>
>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>>> itself.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>> - A configuration object
>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>
>>>>>>>>> The API is generally used like this:
>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>> returns a SchemaTransform
>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>>>> the configuration
>>>>>>>>>
>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return the
>>>>>>>>> PTransform directly.
>>>>>>>>>
>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>> will directly return the PTransform).
>>>>>>>>>
>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>>>> up this way.
>>>>>>>>>
>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and reduced
>>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. See
>>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>
>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>
>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Ahmed
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>> [2]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>> [3]
>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>> [4]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>> [5]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>> [6]
>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>
>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Ahmed Abualsaud via dev <de...@beam.apache.org>.
Thank you all for your input. I have a PR for the changes I mentioned in my
initial email: https://github.com/apache/beam/pull/27202. Please review
when you get a chance!

> perhaps we should consider just going to something Avro for portable
coding rather than something custom

Did you mean using some Avro object (GenericRecord?) besides Beam Row
elements? We would still run into the problem Cham mentioned earlier (of
making sure existing PTransform inputs/outputs are compatible with
cross-language-valid types).

Ahmed

On Tue, May 30, 2023 at 10:53 PM Byron Ellis <by...@google.com> wrote:

> Sure, I get that… though perhaps we should consider just going to
> something Avro for portable coding rather than something custom.
>
> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Input/output PCollection types at least have to be portable Beam types
>> [1] for cross-language to work.
>>
>> I think we restricted schema-aware transforms to PCollection<Row> since
>> Row was expected to be an efficient replacement for arbitrary portable Beam
>> types (not sure how true that is in practice currently).
>>
>> Thanks,
>> Cham
>>
>> [1]
>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>
>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com>
>> wrote:
>>
>>> Is it actually necessary for a PTransform that is configured via the
>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>> separate concerns and unnecessarily limiting.
>>>
>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> +1 for the simplification.
>>>>
>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>> allows for easy consumption from both Java and cross-language.
>>>>>
>>>>
>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>> to be schema-aware transform compatible (for example, change input/output
>>>> types and make sure parameters take Beam Schema compatible types). But this
>>>> makes sense for new transforms.
>>>>
>>>>
>>>>
>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>> DRY.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> +1 to this simplification, it's a historical artifact that provides
>>>>>>> no value.
>>>>>>>
>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>> itself.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>> - A configuration object
>>>>>>>> - SchemaTransform [2]
>>>>>>>>
>>>>>>>> The API is generally used like this:
>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>> returns a SchemaTransform
>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>>> the configuration
>>>>>>>>
>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>>>>>> combine the two steps if we have SchemaTransformProvider return the
>>>>>>>> PTransform directly.
>>>>>>>>
>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>> obsolete. This should be safe to do; the only place it's used in our API is
>>>>>>>> here [3], and that can be simplified if we make this change (we'd just trim
>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)` will
>>>>>>>> directly return the PTransform).
>>>>>>>>
>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>>> up this way.
>>>>>>>>
>>>>>>>> A few developers already raised questions about how there's
>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and reduced
>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. See
>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>
>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>
>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ahmed
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>> [2]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>> [3]
>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>> [4]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>> [5]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>> [6]
>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>
>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Byron Ellis via dev <de...@beam.apache.org>.
Sure, I get that… though perhaps we should consider just going to something
Avro for portable coding rather than something custom.

On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <ch...@google.com>
wrote:

> Input/output PCollection types at least have to be portable Beam types [1]
> for cross-language to work.
>
> I think we restricted schema-aware transforms to PCollection<Row> since
> Row was expected to be an efficient replacement for arbitrary portable Beam
> types (not sure how true that is in practice currently).
>
> Thanks,
> Cham
>
> [1]
> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>
> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com> wrote:
>
>> Is it actually necessary for a PTransform that is configured via the
>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>> separate concerns and unnecessarily limiting.
>>
>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> +1 for the simplification.
>>>
>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>>>> the transform. Hopefully (1) could be done in a concise way that allows for
>>>> easy consumption from both Java and cross-language.
>>>>
>>>
>>> +1 but I think the hard part today is to convert existing PTransforms to
>>> be schema-aware transform compatible (for example, change input/output
>>> types and make sure parameters take Beam Schema compatible types). But this
>>> makes sense for new transforms.
>>>
>>>
>>>
>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>>> wrote:
>>>>
>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>> DRY.
>>>>>
>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> +1 to this simplification, it's a historical artifact that provides
>>>>>> no value.
>>>>>>
>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>> itself.
>>>>>>
>>>>>>
>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>> - SchemaTransformProvider [1]
>>>>>>> - A configuration object
>>>>>>> - SchemaTransform [2]
>>>>>>>
>>>>>>> The API is generally used like this:
>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>> returns a SchemaTransform
>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>> the configuration
>>>>>>>
>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>>>>> combine the two steps if we have SchemaTransformProvider return the
>>>>>>> PTransform directly.
>>>>>>>
>>>>>>> We can then remove the SchemaTransform class as it will be obsolete.
>>>>>>> This should be safe to do; the only place it's used in our API is here [3],
>>>>>>> and that can be simplified if we make this change (we'd just trim `
>>>>>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>>>>>> directly return the PTransform).
>>>>>>>
>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>> process of this API so I may be missing some information on why it was set
>>>>>>> up this way.
>>>>>>>
>>>>>>> A few developers already raised questions about how there's
>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>> portable. I wasn't involved in the design process of this API so I may be
>>>>>>> missing some information, but my assumption is this was designed to follow
>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with the
>>>>>>> newer SchemaTransformProvider API, we dropped a few methods and reduced the
>>>>>>> SchemaTransform class to have a generic buildTransform() method. See the
>>>>>>> example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>> SchemaTransform interface and buildTransform method are implemented
>>>>>>> just to satisfy the requirement that SchemaTransformProvider::from
>>>>>>> return a SchemaTransform.
>>>>>>>
>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>>> less convoluted to develop portable transforms.
>>>>>>>
>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>> applying these changes to them should be straightforward. If people think
>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>
>>>>>>> Best,
>>>>>>> Ahmed
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>> [3]
>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>> [4]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>> [5]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>> [6]
>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>
>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Chamikara Jayalath via dev <de...@beam.apache.org>.
Input/output PCollection types at least have to be portable Beam types [1]
for cross-language to work.

I think we restricted schema-aware transforms to PCollection<Row> since Row
was expected to be an efficient replacement for arbitrary portable Beam
types (not sure how true that is in practice currently).

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829

On Tue, May 30, 2023 at 1:54 PM Byron Ellis <by...@google.com> wrote:

> Is it actually necessary for a PTransform that is configured via the
> Schema mechanism to also be one that uses RowCoder? Those strike me as two
> separate concerns and unnecessarily limiting.
>
> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> +1 for the simplification.
>>
>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>>> the transform. Hopefully (1) could be done in a concise way that allows for
>>> easy consumption from both Java and cross-language.
>>>
>>
>> +1 but I think the hard part today is to convert existing PTransforms to
>> be schema-aware transform compatible (for example, change input/output
>> types and make sure parameters take Beam Schema compatible types). But this
>> makes sense for new transforms.
>>
>>
>>
>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>>> wrote:
>>>
>>>> Or perhaps the other way around? If you have a Schema we can
>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>> DRY.
>>>>
>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> +1 to this simplification, it's a historical artifact that provides no
>>>>> value.
>>>>>
>>>>> I would love it if we also looked into ways to auto-generate the
>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>> itself.
>>>>>
>>>>>
>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>> - SchemaTransformProvider [1]
>>>>>> - A configuration object
>>>>>> - SchemaTransform [2]
>>>>>>
>>>>>> The API is generally used like this:
>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>> returns a SchemaTransform
>>>>>> 2. The SchemaTransform is used to build a PTransform according to the
>>>>>> configuration
>>>>>>
>>>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>>>> combine the two steps if we have SchemaTransformProvider return the
>>>>>> PTransform directly.
>>>>>>
>>>>>> We can then remove the SchemaTransform class as it will be obsolete.
>>>>>> This should be safe to do; the only place it's used in our API is here [3],
>>>>>> and that can be simplified if we make this change (we'd just trim `
>>>>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>>>>> directly return the PTransform).
>>>>>>
>>>>>> I'd like to first mention that I was not involved in the design
>>>>>> process of this API so I may be missing some information on why it was set
>>>>>> up this way.
>>>>>>
>>>>>> A few developers already raised questions about how there's seemingly
>>>>>> unnecessary boilerplate involved in making a Java transform portable. I
>>>>>> wasn't involved in the design process of this API so I may be missing some
>>>>>> information, but my assumption is this was designed to follow the pattern
>>>>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>>>>> SchemaIO[5] -> PTransform. However, with the newer
>>>>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>>>>> SchemaTransform class to have a generic buildTransform() method. See the
>>>>>> example of PubsubReadSchemaTransformProvider [6], where the
>>>>>> SchemaTransform interface and buildTransform method are implemented
>>>>>> just to satisfy the requirement that SchemaTransformProvider::from
>>>>>> return a SchemaTransform.
>>>>>>
>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>>> less convoluted to develop portable transforms.
>>>>>>
>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>> applying these changes to them should be straightforward. If people think
>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>
>>>>>> Best,
>>>>>> Ahmed
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>> [3]
>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>> [4]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>> [5]
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>> [6]
>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>
>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Byron Ellis via dev <de...@beam.apache.org>.
Is it actually necessary for a PTransform that is configured via the Schema
mechanism to also be one that uses RowCoder? Those strike me as two
separate concerns and unnecessarily limiting.

On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <ch...@google.com>
wrote:

> +1 for the simplification.
>
> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>> the transform. Hopefully (1) could be done in a concise way that allows for
>> easy consumption from both Java and cross-language.
>>
>
> +1 but I think the hard part today is to convert existing PTransforms to
> be schema-aware transform compatible (for example, change input/output
> types and make sure parameters take Beam Schema compatible types). But this
> makes sense for new transforms.
>
>
>
>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
>> wrote:
>>
>>> Or perhaps the other way around? If you have a Schema we can
>>> auto-generate the associated builder on the PTransform? Either way, more
>>> DRY.
>>>
>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> +1 to this simplification, it's a historical artifact that provides no
>>>> value.
>>>>
>>>> I would love it if we also looked into ways to auto-generate the
>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>> small number of arguments, or uses the standard builder pattern...),
>>>> ideally with something as simple as adding a decorator to the PTransform
>>>> itself.
>>>>
>>>>
>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I was looking at how we use SchemaTransforms in our expansion service.
>>>>> From what I see, there may be a redundant step in developing
>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>> - SchemaTransformProvider [1]
>>>>> - A configuration object
>>>>> - SchemaTransform [2]
>>>>>
>>>>> The API is generally used like this:
>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>> returns a SchemaTransform
>>>>> 2. The SchemaTransform is used to build a PTransform according to the
>>>>> configuration
>>>>>
>>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>>> combine the two steps if we have SchemaTransformProvider return the
>>>>> PTransform directly.
>>>>>
>>>>> We can then remove the SchemaTransform class as it will be obsolete.
>>>>> This should be safe to do; the only place it's used in our API is here [3],
>>>>> and that can be simplified if we make this change (we'd just trim `
>>>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>>>> directly return the PTransform).
>>>>>
>>>>> I'd like to first mention that I was not involved in the design
>>>>> process of this API so I may be missing some information on why it was set
>>>>> up this way.
>>>>>
>>>>> A few developers already raised questions about how there's seemingly
>>>>> unnecessary boilerplate involved in making a Java transform portable. I
>>>>> wasn't involved in the design process of this API so I may be missing some
>>>>> information, but my assumption is this was designed to follow the pattern
>>>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>>>> SchemaIO[5] -> PTransform. However, with the newer
>>>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>>>> SchemaTransform class to have a generic buildTransform() method. See the
>>>>> example of PubsubReadSchemaTransformProvider [6], where the
>>>>> SchemaTransform interface and buildTransform method are implemented
>>>>> just to satisfy the requirement that SchemaTransformProvider::from
>>>>> return a SchemaTransform.
>>>>>
>>>>> I'm bringing this up because if we are looking to encourage
>>>>> contribution to cross-language use cases, we should make it simpler and
>>>>> less convoluted to develop portable transforms.
>>>>>
>>>>> There are a number of SchemaTransforms already developed, but applying
>>>>> these changes to them should be straightforward. If people think this is a
>>>>> good idea, I can open a PR and implement them.
>>>>>
>>>>> Best,
>>>>> Ahmed
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>> [4]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>> [5]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>> [6]
>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>
>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Chamikara Jayalath via dev <de...@beam.apache.org>.
+1 for the simplification.

On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
wrote:

> Yeah. Essentially one needs do (1) name the arguments and (2) implement
> the transform. Hopefully (1) could be done in a concise way that allows for
> easy consumption from both Java and cross-language.
>

+1 but I think the hard part today is to convert existing PTransforms to be
schema-aware transform compatible (for example, change input/output types
and make sure parameters take Beam Schema compatible types). But this makes
sense for new transforms.



> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com>
> wrote:
>
>> Or perhaps the other way around? If you have a Schema we can
>> auto-generate the associated builder on the PTransform? Either way, more
>> DRY.
>>
>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 to this simplification, it's a historical artifact that provides no
>>> value.
>>>
>>> I would love it if we also looked into ways to auto-generate the
>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>> small number of arguments, or uses the standard builder pattern...),
>>> ideally with something as simple as adding a decorator to the PTransform
>>> itself.
>>>
>>>
>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> I was looking at how we use SchemaTransforms in our expansion service.
>>>> From what I see, there may be a redundant step in developing
>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>> - SchemaTransformProvider [1]
>>>> - A configuration object
>>>> - SchemaTransform [2]
>>>>
>>>> The API is generally used like this:
>>>> 1. The SchemaTransformProvider takes a configuration object and returns
>>>> a SchemaTransform
>>>> 2. The SchemaTransform is used to build a PTransform according to the
>>>> configuration
>>>>
>>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>>> combine the two steps if we have SchemaTransformProvider return the
>>>> PTransform directly.
>>>>
>>>> We can then remove the SchemaTransform class as it will be obsolete.
>>>> This should be safe to do; the only place it's used in our API is here [3],
>>>> and that can be simplified if we make this change (we'd just trim `
>>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>>> directly return the PTransform).
>>>>
>>>> I'd like to first mention that I was not involved in the design process
>>>> of this API so I may be missing some information on why it was set up this
>>>> way.
>>>>
>>>> A few developers already raised questions about how there's seemingly
>>>> unnecessary boilerplate involved in making a Java transform portable. I
>>>> wasn't involved in the design process of this API so I may be missing some
>>>> information, but my assumption is this was designed to follow the pattern
>>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>>> SchemaIO[5] -> PTransform. However, with the newer
>>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>>> SchemaTransform class to have a generic buildTransform() method. See the
>>>> example of PubsubReadSchemaTransformProvider [6], where the
>>>> SchemaTransform interface and buildTransform method are implemented
>>>> just to satisfy the requirement that SchemaTransformProvider::from
>>>> return a SchemaTransform.
>>>>
>>>> I'm bringing this up because if we are looking to encourage
>>>> contribution to cross-language use cases, we should make it simpler and
>>>> less convoluted to develop portable transforms.
>>>>
>>>> There are a number of SchemaTransforms already developed, but applying
>>>> these changes to them should be straightforward. If people think this is a
>>>> good idea, I can open a PR and implement them.
>>>>
>>>> Best,
>>>> Ahmed
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>> [3]
>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>> [4]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>> [5]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>> [6]
>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>
>>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.
Yeah. Essentially one needs do (1) name the arguments and (2) implement the
transform. Hopefully (1) could be done in a concise way that allows for
easy consumption from both Java and cross-language.

On Tue, May 30, 2023 at 12:25 PM Byron Ellis <by...@google.com> wrote:

> Or perhaps the other way around? If you have a Schema we can auto-generate
> the associated builder on the PTransform? Either way, more DRY.
>
> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to this simplification, it's a historical artifact that provides no
>> value.
>>
>> I would love it if we also looked into ways to auto-generate the
>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>> small number of arguments, or uses the standard builder pattern...),
>> ideally with something as simple as adding a decorator to the PTransform
>> itself.
>>
>>
>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I was looking at how we use SchemaTransforms in our expansion service.
>>> From what I see, there may be a redundant step in developing
>>> SchemaTransforms. Currently, we have 3 pieces:
>>> - SchemaTransformProvider [1]
>>> - A configuration object
>>> - SchemaTransform [2]
>>>
>>> The API is generally used like this:
>>> 1. The SchemaTransformProvider takes a configuration object and returns
>>> a SchemaTransform
>>> 2. The SchemaTransform is used to build a PTransform according to the
>>> configuration
>>>
>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>> combine the two steps if we have SchemaTransformProvider return the
>>> PTransform directly.
>>>
>>> We can then remove the SchemaTransform class as it will be obsolete.
>>> This should be safe to do; the only place it's used in our API is here [3],
>>> and that can be simplified if we make this change (we'd just trim `
>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>> directly return the PTransform).
>>>
>>> I'd like to first mention that I was not involved in the design process
>>> of this API so I may be missing some information on why it was set up this
>>> way.
>>>
>>> A few developers already raised questions about how there's seemingly
>>> unnecessary boilerplate involved in making a Java transform portable. I
>>> wasn't involved in the design process of this API so I may be missing some
>>> information, but my assumption is this was designed to follow the pattern
>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>> SchemaIO[5] -> PTransform. However, with the newer
>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>> SchemaTransform class to have a generic buildTransform() method. See the
>>> example of PubsubReadSchemaTransformProvider [6], where the
>>> SchemaTransform interface and buildTransform method are implemented
>>> just to satisfy the requirement that SchemaTransformProvider::from
>>> return a SchemaTransform.
>>>
>>> I'm bringing this up because if we are looking to encourage contribution
>>> to cross-language use cases, we should make it simpler and less convoluted
>>> to develop portable transforms.
>>>
>>> There are a number of SchemaTransforms already developed, but applying
>>> these changes to them should be straightforward. If people think this is a
>>> good idea, I can open a PR and implement them.
>>>
>>> Best,
>>> Ahmed
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>> [3]
>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>> [5]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>> [6]
>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>
>>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Byron Ellis via dev <de...@beam.apache.org>.
Or perhaps the other way around? If you have a Schema we can auto-generate
the associated builder on the PTransform? Either way, more DRY.

On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> +1 to this simplification, it's a historical artifact that provides no
> value.
>
> I would love it if we also looked into ways to auto-generate the
> SchemaTransformProvider (e.g. via introspection if a transform takes a
> small number of arguments, or uses the standard builder pattern...),
> ideally with something as simple as adding a decorator to the PTransform
> itself.
>
>
> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone,
>>
>> I was looking at how we use SchemaTransforms in our expansion service.
>> From what I see, there may be a redundant step in developing
>> SchemaTransforms. Currently, we have 3 pieces:
>> - SchemaTransformProvider [1]
>> - A configuration object
>> - SchemaTransform [2]
>>
>> The API is generally used like this:
>> 1. The SchemaTransformProvider takes a configuration object and returns a
>> SchemaTransform
>> 2. The SchemaTransform is used to build a PTransform according to the
>> configuration
>>
>> In these steps, the SchemaTransform class seems unnecessary. We can
>> combine the two steps if we have SchemaTransformProvider return the
>> PTransform directly.
>>
>> We can then remove the SchemaTransform class as it will be obsolete. This
>> should be safe to do; the only place it's used in our API is here [3], and
>> that can be simplified if we make this change (we'd just trim `
>> .buildTransform()` off the end as `provider.from(configRow)` will
>> directly return the PTransform).
>>
>> I'd like to first mention that I was not involved in the design process
>> of this API so I may be missing some information on why it was set up this
>> way.
>>
>> A few developers already raised questions about how there's seemingly
>> unnecessary boilerplate involved in making a Java transform portable. I
>> wasn't involved in the design process of this API so I may be missing some
>> information, but my assumption is this was designed to follow the pattern
>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>> SchemaIO[5] -> PTransform. However, with the newer
>> SchemaTransformProvider API, we dropped a few methods and reduced the
>> SchemaTransform class to have a generic buildTransform() method. See the
>> example of PubsubReadSchemaTransformProvider [6], where the
>> SchemaTransform interface and buildTransform method are implemented just
>> to satisfy the requirement that SchemaTransformProvider::from return a
>> SchemaTransform.
>>
>> I'm bringing this up because if we are looking to encourage contribution
>> to cross-language use cases, we should make it simpler and less convoluted
>> to develop portable transforms.
>>
>> There are a number of SchemaTransforms already developed, but applying
>> these changes to them should be straightforward. If people think this is a
>> good idea, I can open a PR and implement them.
>>
>> Best,
>> Ahmed
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>> [3]
>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>> [5]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>> [6]
>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>
>

Re: Proposal to reduce the steps to make a Java transform portable

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.
+1 to this simplification, it's a historical artifact that provides no
value.

I would love it if we also looked into ways to auto-generate the
SchemaTransformProvider (e.g. via introspection if a transform takes a
small number of arguments, or uses the standard builder pattern...),
ideally with something as simple as adding a decorator to the PTransform
itself.


On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <de...@beam.apache.org>
wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and returns a
> SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete. This
> should be safe to do; the only place it's used in our API is here [3], and
> that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design process of
> this API so I may be missing some information on why it was set up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented just
> to satisfy the requirement that SchemaTransformProvider::from return a
> SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage contribution
> to cross-language use cases, we should make it simpler and less convoluted
> to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
> [6]
> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>