You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Anton Kedin <ke...@google.com> on 2018/10/03 18:25:47 UTC

Java SDK Extensions

Hi dev@,

*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
understand.

*Current State:*

I was looking at `sdks/java/extensions`[1] and realized that I don't know
what half of those things are. Only `join library` and `sorter` seem to be
documented and discoverable on Beam website, under SDKs section [2].

Here's the list of all extensions with my questions/comments:
 - *google-cloud-platform-core*. What is this? Is this used in GCP IOs? If
so, is `extensions` the right place for it? If it is, then why is it a
`-core` extension? It feels like it's a utility package, not an extension;
 - *jackson*. I can guess what it is but we should document it somewhere;
 - *join-library*. It is documented, but I think we should add more
documentation to explain how it works, maybe some caveats, and link to/from
the `CoGBK` section of the doc;
 - *protobuf*. I can probably guess what is it. Is 'extensions' the right
place for it though? We use this library in IOs (`PubsubsIO.readProtos()`),
should we move it to IO then? Same as with GCP extension, feels like a
utility library, not an extension;
 - *sketching*. No idea what to expect from this without reading the code;
 - *sorter*. Documented on the website;
 - *sql*. This looks familiar :) It is documented but not linked from the
extensions section, it's unclear whether it's the whole SQL or just some
related components;

[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

*Questions:*

 - should we minimally document (at least describe) all extensions and add
at least short readme.md's with the links to the Beam website?
 - is it a right thing to depend on `extensions` in other components like
IOs?
 - would it make sense to move some things out of 'extensions'? E.g. IO
components to IO or utility package, SQL into new DSLs package;

*Opinion:*

Maybe I am misunderstanding the intent and meaning of 'extensions', but
from my perspective:

 - I think that extensions should be more or less isolated from the Beam
SDK itself, so that if you delete or modify them, no Beam-internal changes
will be required (changes to something that's not an extension). And my
feeling is that they should provide value by themselves to users other than
SDK authors. They are called 'extensions', not 'critical components' or
'sdk utilities';

 - I don't think that IOs should depend on 'extensions'. Otherwise the
question is, is it ok for other components, like runners, to do the same?
Or even core?

 - I think there are few distinguishable classes of things in 'extensions'
right now:
     - collections of `PTransforms` with some business logic (Sorter, Join,
Sketch);
     - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
     - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
used almost as a standalone SDK. Things like Euphoria will probably end up
in the same class;
     - utility libraries shared by some parts of the SDK and unclear if
they are valuable by themselves to external users (Protobuf, GCP core);
   To me the business logic and parsing libraries do make sense to stay in
extensions, but probably under different subdirectories. I think it will
make sense to split others out of extensions into separate parts of the
SDK.

 - I think we should add readme.md's with short descriptions and links to
Beam website;

Thoughts, comments?


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

Re: Java SDK Extensions

Posted by Ben Chambers <bc...@apache.org>.
On Wed, Oct 3, 2018 at 12:16 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Anton,
>
> jackson is the json extension as we have XML. Agree that it should be
> documented.
>
> Agree about join-library.
>
> sketching is some statistic extensions providing ready to use stats
> CombineFn.
>
> Regards
> JB
>
> On 03/10/2018 20:25, Anton Kedin wrote:
> > Hi dev@,
> >
> > *TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
> > understand.
> >
> > *Current State:*
> > *
> > *
> > I was looking at `sdks/java/extensions`[1] and realized that I don't
> > know what half of those things are. Only `join library` and `sorter`
> > seem to be documented and discoverable on Beam website, under SDKs
> > section [2].
> >
> > Here's the list of all extensions with my questions/comments:
> >   - /google-cloud-platform-core/. What is this? Is this used in GCP IOs?
> > If so, is `extensions` the right place for it? If it is, then why is it
> > a `-core` extension? It feels like it's a utility package, not an
> extension;
> >   - /jackson/. I can guess what it is but we should document it
> somewhere;
> >   - /join-library/. It is documented, but I think we should add more
> > documentation to explain how it works, maybe some caveats, and link
> > to/from the `CoGBK` section of the doc;
>

Should also probably indicate that using the join-library twice on the same
with 3 input collections is less efficient than a single CoGBK with those 3
input collections.


> >   - /protobuf/. I can probably guess what is it. Is 'extensions' the
> > right place for it though? We use this library in IOs
> > (`PubsubsIO.readProtos()`), should we move it to IO then? Same as with
> > GCP extension, feels like a utility library, not an extension;
> >   - /sketching/. No idea what to expect from this without reading the
> code;
> >   - /sorter/. Documented on the website;
> >   - /sql/. This looks familiar :) It is documented but not linked from
> > the extensions section, it's unclear whether it's the whole SQL or just
> > some related components;
> >
> > [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> > [2]: https://beam.apache.org/documentation/sdks/java-extensions/
> >
> > *Questions:*
> >
> >   - should we minimally document (at least describe) all extensions and
> > add at least short readme.md's with the links to the Beam website?
> >   - is it a right thing to depend on `extensions` in other components
> > like IOs?
> >   - would it make sense to move some things out of 'extensions'? E.g. IO
> > components to IO or utility package, SQL into new DSLs package;
> >
> > *Opinion:*
> > *
> > *
> > Maybe I am misunderstanding the intent and meaning of 'extensions', but
> > from my perspective:
> > *
> > *
> >   - I think that extensions should be more or less isolated from the
> > Beam SDK itself, so that if you delete or modify them, no Beam-internal
> > changes will be required (changes to something that's not an extension).
> > And my feeling is that they should provide value by themselves to users
> > other than SDK authors. They are called 'extensions', not 'critical
> > components' or 'sdk utilities';
> >
> >   - I don't think that IOs should depend on 'extensions'. Otherwise the
> > question is, is it ok for other components, like runners, to do the
> > same? Or even core?
> >
> >   - I think there are few distinguishable classes of things in
> > 'extensions' right now:
> >       - collections of `PTransforms` with some business logic (Sorter,
> > Join, Sketch);
> >       - collections of `PTransforms` with focus parsing (Jackson,
> Protobuf);
> >       - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
> > used almost as a standalone SDK. Things like Euphoria will probably end
> > up in the same class;
> >       - utility libraries shared by some parts of the SDK and unclear if
> > they are valuable by themselves to external users (Protobuf, GCP core);
> >     To me the business logic and parsing libraries do make sense to stay
> > in extensions, but probably under different subdirectories. I think it
> > will make sense to split others out of extensions into separate parts of
> > the SDK.
> >
> >   - I think we should add readme.md's with short descriptions and links
> > to Beam website;
> >
> > Thoughts, comments?
> >
> >
> > [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> > [2]: https://beam.apache.org/documentation/sdks/java-extensions/
>

Re: Java SDK Extensions

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Anton,

jackson is the json extension as we have XML. Agree that it should be 
documented.

Agree about join-library.

sketching is some statistic extensions providing ready to use stats 
CombineFn.

Regards
JB

On 03/10/2018 20:25, Anton Kedin wrote:
> Hi dev@,
> 
> *TL;DR:* `sdks/java/extensions` is hard to discover, navigate and 
> understand.
> 
> *Current State:*
> *
> *
> I was looking at `sdks/java/extensions`[1] and realized that I don't 
> know what half of those things are. Only `join library` and `sorter` 
> seem to be documented and discoverable on Beam website, under SDKs 
> section [2].
> 
> Here's the list of all extensions with my questions/comments:
>   - /google-cloud-platform-core/. What is this? Is this used in GCP IOs? 
> If so, is `extensions` the right place for it? If it is, then why is it 
> a `-core` extension? It feels like it's a utility package, not an extension;
>   - /jackson/. I can guess what it is but we should document it somewhere;
>   - /join-library/. It is documented, but I think we should add more 
> documentation to explain how it works, maybe some caveats, and link 
> to/from the `CoGBK` section of the doc;
>   - /protobuf/. I can probably guess what is it. Is 'extensions' the 
> right place for it though? We use this library in IOs 
> (`PubsubsIO.readProtos()`), should we move it to IO then? Same as with 
> GCP extension, feels like a utility library, not an extension;
>   - /sketching/. No idea what to expect from this without reading the code;
>   - /sorter/. Documented on the website;
>   - /sql/. This looks familiar :) It is documented but not linked from 
> the extensions section, it's unclear whether it's the whole SQL or just 
> some related components;
> 
> [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> [2]: https://beam.apache.org/documentation/sdks/java-extensions/
> 
> *Questions:*
> 
>   - should we minimally document (at least describe) all extensions and 
> add at least short readme.md's with the links to the Beam website?
>   - is it a right thing to depend on `extensions` in other components 
> like IOs?
>   - would it make sense to move some things out of 'extensions'? E.g. IO 
> components to IO or utility package, SQL into new DSLs package;
> 
> *Opinion:*
> *
> *
> Maybe I am misunderstanding the intent and meaning of 'extensions', but 
> from my perspective:
> *
> *
>   - I think that extensions should be more or less isolated from the 
> Beam SDK itself, so that if you delete or modify them, no Beam-internal 
> changes will be required (changes to something that's not an extension). 
> And my feeling is that they should provide value by themselves to users 
> other than SDK authors. They are called 'extensions', not 'critical 
> components' or 'sdk utilities';
> 
>   - I don't think that IOs should depend on 'extensions'. Otherwise the 
> question is, is it ok for other components, like runners, to do the 
> same? Or even core?
> 
>   - I think there are few distinguishable classes of things in 
> 'extensions' right now:
>       - collections of `PTransforms` with some business logic (Sorter, 
> Join, Sketch);
>       - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
>       - DSLs; SQL DSL with more than just a few `PTransforms`, it can be 
> used almost as a standalone SDK. Things like Euphoria will probably end 
> up in the same class;
>       - utility libraries shared by some parts of the SDK and unclear if 
> they are valuable by themselves to external users (Protobuf, GCP core);
>     To me the business logic and parsing libraries do make sense to stay 
> in extensions, but probably under different subdirectories. I think it 
> will make sense to split others out of extensions into separate parts of 
> the SDK.
> 
>   - I think we should add readme.md's with short descriptions and links 
> to Beam website;
> 
> Thoughts, comments?
> 
> 
> [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> [2]: https://beam.apache.org/documentation/sdks/java-extensions/