You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Shannon Duncan <jo...@liveramp.com> on 2019/07/03 22:03:49 UTC

Python Utilities

I have been writing a bunch of utilities for the python SDK such as joins,
selections, composite transforms, etc...

I am working with my company to see if I can open source the utilities.
Would it be best to post them on a separate PyPi project, or to PR them
into the beam SDK? I assume if they let me open source it they will want
some attribution or something like that.

Thanks,
Shannon

Re: Python Utilities

Posted by Reuven Lax <re...@google.com>.
On Wed, Jul 10, 2019 at 9:56 AM Rui Wang <ru...@google.com> wrote:

> The second link points to the first join utility in Beam. The idea is
> similar: people can use the utility to do joins without writing them own.
> BeamSQL also uses it.
>
> The first link points to Schema API. I actually thought Schema API also
> uses the join utility, and turns out it doesn't (I am not sure what's the
> reason though).
>

The Schema one is more general as well, in that it supports joining N
inputs.


>
> Basically I think it's encouraged to reuse the same join utility if
> possible.
>
> -Rui
>
> On Wed, Jul 10, 2019 at 8:01 AM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> So it seams that the Java SDK has two different Join libraries?
>>
>> With Schema:
>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>> And Another one:
>> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java
>>
>> So how does it handle that?
>>
>> On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan <
>> joseph.duncan@liveramp.com> wrote:
>>
>>> Yeah these are for local testing right now. I was hoping to gain insight
>>> on better naming.
>>>
>>> I was thinking of creating an "extras" module.
>>>
>>> On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <ro...@google.com> wrote:
>>>
>>>> Hi Shannon,
>>>>
>>>> Thanks for sharing the repo! I took a quick look and I have a concern
>>>> with the naming of the transforms.
>>>>
>>>> Currently, Beam Java already have "Select" and "Join" transforms.
>>>> However, they work on schemas, a feature that is not yet implemented in
>>>> Beam Python. (See
>>>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>> )
>>>>
>>>> To maintain consistency between SDKs, I think it is good to avoid
>>>> having two different transforms with the same name but different functions.
>>>> So maybe you can consider renaming the transforms or/and putting it in an
>>>> extension Python module, instead of the main ones?
>>>>
>>>> Best,
>>>> Robin
>>>>
>>>> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <
>>>> joseph.duncan@liveramp.com> wrote:
>>>>
>>>>> As a follow up. Here is the repo that contains the utilities for now.
>>>>> https://github.com/shadowcodex/apache-beam-utilities. Will put
>>>>> together a proper PR as code gets closer to production quality.
>>>>>
>>>>> - Shannon
>>>>>
>>>>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <
>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>
>>>>>> Thanks Frederik,
>>>>>>
>>>>>> That's exactly where I was looking. I did get permission to open
>>>>>> source the utilities module. So I'm going to throw them up on my personal
>>>>>> github soon and share with the email group for a look over.
>>>>>>
>>>>>> I'm going to work on the utilities there because it's a quick dev
>>>>>> environment and then once they are ready for proper PR I'll begin working
>>>>>> them into the actual SDK for a PR.
>>>>>>
>>>>>> I also joined the slack #beam and #beam-python channels, I was unsure
>>>>>> of where most collaborators discussed items.
>>>>>>
>>>>>> - Shannon
>>>>>>
>>>>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Shannon,
>>>>>>>
>>>>>>> This is probably a good starting point:
>>>>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>>>>>> .
>>>>>>>
>>>>>>> Frederik
>>>>>>>
>>>>>>> [image: https://ml6.eu]
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>>>>>
>>>>>>>
>>>>>>> * Frederik Bode*
>>>>>>>
>>>>>>> ML6 Ghent
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>>>>>> +32 4 92 78 96 18
>>>>>>>
>>>>>>>
>>>>>>> **** DISCLAIMER ****
>>>>>>>
>>>>>>> This email and any files transmitted with it are confidential and
>>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>>> addressed. If you have received this email in error please notify the
>>>>>>> system manager. This message contains confidential information and is
>>>>>>> intended only for the individual named. If you are not the named addressee
>>>>>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>>>> intended recipient you are notified that disclosing, copying, distributing
>>>>>>> or taking any action in reliance on the contents of this information is
>>>>>>> strictly prohibited.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <
>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>
>>>>>>>> I'm sure I could use some of the existing aggregations as a guide
>>>>>>>> on how to make aggregations to fill the gap of missing ones. Such as
>>>>>>>> creating Sum/Max/Min.
>>>>>>>>
>>>>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>>>>>> unless you are thinking of a different type of GroupBy?
>>>>>>>>
>>>>>>>> - Shannon
>>>>>>>>
>>>>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Valentyn,
>>>>>>>>>>
>>>>>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>>>>>> modify. These are really just shortcut PTransforms that I am working on to
>>>>>>>>>> simplify creating pipelines.
>>>>>>>>>>
>>>>>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>>>>>
>>>>>>>>>> - Inner Join
>>>>>>>>>> - Left Outer Join
>>>>>>>>>> - Right Outer Join
>>>>>>>>>> - Full Outer Join
>>>>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key
>>>>>>>>>> for the joins)
>>>>>>>>>> - Select (very simple filter that returns only items you want
>>>>>>>>>> from the dictionary) (allows for defining a default nullValue)
>>>>>>>>>>
>>>>>>>>>> Currently these operations only work with dictionaries, but I'd
>>>>>>>>>> be interested to see how it would work for <K,V> tuples.
>>>>>>>>>>
>>>>>>>>>> I'm new to python so they may not be optimized or the best way,
>>>>>>>>>> but from my understanding these seem to be the best way to do these types
>>>>>>>>>> of operations. Essentially I created a pipeline to be able to convert a
>>>>>>>>>> simple sql query into a flow of these utilities. Using prepareKey to define
>>>>>>>>>> your joining key, joining, and then selecting from the join allows you to
>>>>>>>>>> do a lot of powerful manipulation in a simple / familiar way.
>>>>>>>>>>
>>>>>>>>>> If this is something that we'd like to add to the Beam SDK I
>>>>>>>>>> don't mind looking at the contributor license agreement, and conversing
>>>>>>>>>> more on how to get them in.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shannon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Shannon,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>>>>>>> to date with new releases of Beam.
>>>>>>>>>>>
>>>>>>>>>>> I encourage you to take a look at
>>>>>>>>>>> https://beam.apache.org/contribute/ for general advice on how
>>>>>>>>>>> to get started. To echo some points mentioned in the guide:
>>>>>>>>>>>
>>>>>>>>>>> - If your change is large or it is your first change, it is a
>>>>>>>>>>> good idea to discuss it on the dev@ mailing list
>>>>>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>>>>>> email it to the dev@ mailing list.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Valentyn
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have been writing a bunch of utilities for the python SDK
>>>>>>>>>>>> such as joins, selections, composite transforms, etc...
>>>>>>>>>>>>
>>>>>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>>>>>>> want some attribution or something like that.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shannon
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Python Utilities

Posted by Rui Wang <ru...@google.com>.
The second link points to the first join utility in Beam. The idea is
similar: people can use the utility to do joins without writing them own.
BeamSQL also uses it.

The first link points to Schema API. I actually thought Schema API also
uses the join utility, and turns out it doesn't (I am not sure what's the
reason though).

Basically I think it's encouraged to reuse the same join utility if
possible.

-Rui

On Wed, Jul 10, 2019 at 8:01 AM Shannon Duncan <jo...@liveramp.com>
wrote:

> So it seams that the Java SDK has two different Join libraries?
>
> With Schema:
> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> And Another one:
> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java
>
> So how does it handle that?
>
> On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> Yeah these are for local testing right now. I was hoping to gain insight
>> on better naming.
>>
>> I was thinking of creating an "extras" module.
>>
>> On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <ro...@google.com> wrote:
>>
>>> Hi Shannon,
>>>
>>> Thanks for sharing the repo! I took a quick look and I have a concern
>>> with the naming of the transforms.
>>>
>>> Currently, Beam Java already have "Select" and "Join" transforms.
>>> However, they work on schemas, a feature that is not yet implemented in
>>> Beam Python. (See
>>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>> )
>>>
>>> To maintain consistency between SDKs, I think it is good to avoid having
>>> two different transforms with the same name but different functions. So
>>> maybe you can consider renaming the transforms or/and putting it in an
>>> extension Python module, instead of the main ones?
>>>
>>> Best,
>>> Robin
>>>
>>> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <
>>> joseph.duncan@liveramp.com> wrote:
>>>
>>>> As a follow up. Here is the repo that contains the utilities for now.
>>>> https://github.com/shadowcodex/apache-beam-utilities. Will put
>>>> together a proper PR as code gets closer to production quality.
>>>>
>>>> - Shannon
>>>>
>>>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <
>>>> joseph.duncan@liveramp.com> wrote:
>>>>
>>>>> Thanks Frederik,
>>>>>
>>>>> That's exactly where I was looking. I did get permission to open
>>>>> source the utilities module. So I'm going to throw them up on my personal
>>>>> github soon and share with the email group for a look over.
>>>>>
>>>>> I'm going to work on the utilities there because it's a quick dev
>>>>> environment and then once they are ready for proper PR I'll begin working
>>>>> them into the actual SDK for a PR.
>>>>>
>>>>> I also joined the slack #beam and #beam-python channels, I was unsure
>>>>> of where most collaborators discussed items.
>>>>>
>>>>> - Shannon
>>>>>
>>>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu>
>>>>> wrote:
>>>>>
>>>>>> Hi Shannon,
>>>>>>
>>>>>> This is probably a good starting point:
>>>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>>>>> .
>>>>>>
>>>>>> Frederik
>>>>>>
>>>>>> [image: https://ml6.eu]
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>>>>
>>>>>>
>>>>>> * Frederik Bode*
>>>>>>
>>>>>> ML6 Ghent
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>>>>> +32 4 92 78 96 18
>>>>>>
>>>>>>
>>>>>> **** DISCLAIMER ****
>>>>>>
>>>>>> This email and any files transmitted with it are confidential and
>>>>>> intended solely for the use of the individual or entity to whom they are
>>>>>> addressed. If you have received this email in error please notify the
>>>>>> system manager. This message contains confidential information and is
>>>>>> intended only for the individual named. If you are not the named addressee
>>>>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>>> intended recipient you are notified that disclosing, copying, distributing
>>>>>> or taking any action in reliance on the contents of this information is
>>>>>> strictly prohibited.
>>>>>>
>>>>>>
>>>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <
>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>
>>>>>>> I'm sure I could use some of the existing aggregations as a guide on
>>>>>>> how to make aggregations to fill the gap of missing ones. Such as creating
>>>>>>> Sum/Max/Min.
>>>>>>>
>>>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>>>>> unless you are thinking of a different type of GroupBy?
>>>>>>>
>>>>>>> - Shannon
>>>>>>>
>>>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>>>>>
>>>>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Rui
>>>>>>>>
>>>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Valentyn,
>>>>>>>>>
>>>>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>>>>> modify. These are really just shortcut PTransforms that I am working on to
>>>>>>>>> simplify creating pipelines.
>>>>>>>>>
>>>>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>>>>
>>>>>>>>> - Inner Join
>>>>>>>>> - Left Outer Join
>>>>>>>>> - Right Outer Join
>>>>>>>>> - Full Outer Join
>>>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key
>>>>>>>>> for the joins)
>>>>>>>>> - Select (very simple filter that returns only items you want from
>>>>>>>>> the dictionary) (allows for defining a default nullValue)
>>>>>>>>>
>>>>>>>>> Currently these operations only work with dictionaries, but I'd be
>>>>>>>>> interested to see how it would work for <K,V> tuples.
>>>>>>>>>
>>>>>>>>> I'm new to python so they may not be optimized or the best way,
>>>>>>>>> but from my understanding these seem to be the best way to do these types
>>>>>>>>> of operations. Essentially I created a pipeline to be able to convert a
>>>>>>>>> simple sql query into a flow of these utilities. Using prepareKey to define
>>>>>>>>> your joining key, joining, and then selecting from the join allows you to
>>>>>>>>> do a lot of powerful manipulation in a simple / familiar way.
>>>>>>>>>
>>>>>>>>> If this is something that we'd like to add to the Beam SDK I don't
>>>>>>>>> mind looking at the contributor license agreement, and conversing more on
>>>>>>>>> how to get them in.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shannon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Shannon,
>>>>>>>>>>
>>>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>>>>>> to date with new releases of Beam.
>>>>>>>>>>
>>>>>>>>>> I encourage you to take a look at
>>>>>>>>>> https://beam.apache.org/contribute/ for general advice on how to
>>>>>>>>>> get started. To echo some points mentioned in the guide:
>>>>>>>>>>
>>>>>>>>>> - If your change is large or it is your first change, it is a
>>>>>>>>>> good idea to discuss it on the dev@ mailing list
>>>>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>>>>> email it to the dev@ mailing list.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Valentyn
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I have been writing a bunch of utilities for the python SDK such
>>>>>>>>>>> as joins, selections, composite transforms, etc...
>>>>>>>>>>>
>>>>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>>>>>> want some attribution or something like that.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shannon
>>>>>>>>>>>
>>>>>>>>>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
So it seams that the Java SDK has two different Join libraries?

With Schema:
https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
And Another one:
https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java

So how does it handle that?

On Mon, Jul 8, 2019 at 12:39 PM Shannon Duncan <jo...@liveramp.com>
wrote:

> Yeah these are for local testing right now. I was hoping to gain insight
> on better naming.
>
> I was thinking of creating an "extras" module.
>
> On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <ro...@google.com> wrote:
>
>> Hi Shannon,
>>
>> Thanks for sharing the repo! I took a quick look and I have a concern
>> with the naming of the transforms.
>>
>> Currently, Beam Java already have "Select" and "Join" transforms.
>> However, they work on schemas, a feature that is not yet implemented in
>> Beam Python. (See
>> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>> )
>>
>> To maintain consistency between SDKs, I think it is good to avoid having
>> two different transforms with the same name but different functions. So
>> maybe you can consider renaming the transforms or/and putting it in an
>> extension Python module, instead of the main ones?
>>
>> Best,
>> Robin
>>
>> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> As a follow up. Here is the repo that contains the utilities for now.
>>> https://github.com/shadowcodex/apache-beam-utilities. Will put together
>>> a proper PR as code gets closer to production quality.
>>>
>>> - Shannon
>>>
>>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <
>>> joseph.duncan@liveramp.com> wrote:
>>>
>>>> Thanks Frederik,
>>>>
>>>> That's exactly where I was looking. I did get permission to open source
>>>> the utilities module. So I'm going to throw them up on my personal github
>>>> soon and share with the email group for a look over.
>>>>
>>>> I'm going to work on the utilities there because it's a quick dev
>>>> environment and then once they are ready for proper PR I'll begin working
>>>> them into the actual SDK for a PR.
>>>>
>>>> I also joined the slack #beam and #beam-python channels, I was unsure
>>>> of where most collaborators discussed items.
>>>>
>>>> - Shannon
>>>>
>>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu>
>>>> wrote:
>>>>
>>>>> Hi Shannon,
>>>>>
>>>>> This is probably a good starting point:
>>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>>>> .
>>>>>
>>>>> Frederik
>>>>>
>>>>> [image: https://ml6.eu]
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>>>
>>>>>
>>>>> * Frederik Bode*
>>>>>
>>>>> ML6 Ghent
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>>>> +32 4 92 78 96 18
>>>>>
>>>>>
>>>>> **** DISCLAIMER ****
>>>>>
>>>>> This email and any files transmitted with it are confidential and
>>>>> intended solely for the use of the individual or entity to whom they are
>>>>> addressed. If you have received this email in error please notify the
>>>>> system manager. This message contains confidential information and is
>>>>> intended only for the individual named. If you are not the named addressee
>>>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>>>> the sender immediately by e-mail if you have received this e-mail by
>>>>> mistake and delete this e-mail from your system. If you are not the
>>>>> intended recipient you are notified that disclosing, copying, distributing
>>>>> or taking any action in reliance on the contents of this information is
>>>>> strictly prohibited.
>>>>>
>>>>>
>>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <
>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>
>>>>>> I'm sure I could use some of the existing aggregations as a guide on
>>>>>> how to make aggregations to fill the gap of missing ones. Such as creating
>>>>>> Sum/Max/Min.
>>>>>>
>>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>>>> unless you are thinking of a different type of GroupBy?
>>>>>>
>>>>>> - Shannon
>>>>>>
>>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>>>>
>>>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>>>
>>>>>>>
>>>>>>> -Rui
>>>>>>>
>>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>
>>>>>>>> Thanks Valentyn,
>>>>>>>>
>>>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>>>> modify. These are really just shortcut PTransforms that I am working on to
>>>>>>>> simplify creating pipelines.
>>>>>>>>
>>>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>>>
>>>>>>>> - Inner Join
>>>>>>>> - Left Outer Join
>>>>>>>> - Right Outer Join
>>>>>>>> - Full Outer Join
>>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key
>>>>>>>> for the joins)
>>>>>>>> - Select (very simple filter that returns only items you want from
>>>>>>>> the dictionary) (allows for defining a default nullValue)
>>>>>>>>
>>>>>>>> Currently these operations only work with dictionaries, but I'd be
>>>>>>>> interested to see how it would work for <K,V> tuples.
>>>>>>>>
>>>>>>>> I'm new to python so they may not be optimized or the best way, but
>>>>>>>> from my understanding these seem to be the best way to do these types of
>>>>>>>> operations. Essentially I created a pipeline to be able to convert a simple
>>>>>>>> sql query into a flow of these utilities. Using prepareKey to define your
>>>>>>>> joining key, joining, and then selecting from the join allows you to do a
>>>>>>>> lot of powerful manipulation in a simple / familiar way.
>>>>>>>>
>>>>>>>> If this is something that we'd like to add to the Beam SDK I don't
>>>>>>>> mind looking at the contributor license agreement, and conversing more on
>>>>>>>> how to get them in.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shannon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Shannon,
>>>>>>>>>
>>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>>>>> to date with new releases of Beam.
>>>>>>>>>
>>>>>>>>> I encourage you to take a look at
>>>>>>>>> https://beam.apache.org/contribute/ for general advice on how to
>>>>>>>>> get started. To echo some points mentioned in the guide:
>>>>>>>>>
>>>>>>>>> - If your change is large or it is your first change, it is a good
>>>>>>>>> idea to discuss it on the dev@ mailing list
>>>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>>>> email it to the dev@ mailing list.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Valentyn
>>>>>>>>>
>>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>>
>>>>>>>>>> I have been writing a bunch of utilities for the python SDK such
>>>>>>>>>> as joins, selections, composite transforms, etc...
>>>>>>>>>>
>>>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>>>>> want some attribution or something like that.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shannon
>>>>>>>>>>
>>>>>>>>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
Yeah these are for local testing right now. I was hoping to gain insight on
better naming.

I was thinking of creating an "extras" module.

On Mon, Jul 8, 2019, 12:28 PM Robin Qiu <ro...@google.com> wrote:

> Hi Shannon,
>
> Thanks for sharing the repo! I took a quick look and I have a concern with
> the naming of the transforms.
>
> Currently, Beam Java already have "Select" and "Join" transforms. However,
> they work on schemas, a feature that is not yet implemented in Beam Python.
> (See
> https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> )
>
> To maintain consistency between SDKs, I think it is good to avoid having
> two different transforms with the same name but different functions. So
> maybe you can consider renaming the transforms or/and putting it in an
> extension Python module, instead of the main ones?
>
> Best,
> Robin
>
> On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> As a follow up. Here is the repo that contains the utilities for now.
>> https://github.com/shadowcodex/apache-beam-utilities. Will put together
>> a proper PR as code gets closer to production quality.
>>
>> - Shannon
>>
>> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> Thanks Frederik,
>>>
>>> That's exactly where I was looking. I did get permission to open source
>>> the utilities module. So I'm going to throw them up on my personal github
>>> soon and share with the email group for a look over.
>>>
>>> I'm going to work on the utilities there because it's a quick dev
>>> environment and then once they are ready for proper PR I'll begin working
>>> them into the actual SDK for a PR.
>>>
>>> I also joined the slack #beam and #beam-python channels, I was unsure of
>>> where most collaborators discussed items.
>>>
>>> - Shannon
>>>
>>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu>
>>> wrote:
>>>
>>>> Hi Shannon,
>>>>
>>>> This is probably a good starting point:
>>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>>> .
>>>>
>>>> Frederik
>>>>
>>>> [image: https://ml6.eu]
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>>
>>>>
>>>> * Frederik Bode*
>>>>
>>>> ML6 Ghent
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>>> +32 4 92 78 96 18
>>>>
>>>>
>>>> **** DISCLAIMER ****
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the use of the individual or entity to whom they are
>>>> addressed. If you have received this email in error please notify the
>>>> system manager. This message contains confidential information and is
>>>> intended only for the individual named. If you are not the named addressee
>>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>>> the sender immediately by e-mail if you have received this e-mail by
>>>> mistake and delete this e-mail from your system. If you are not the
>>>> intended recipient you are notified that disclosing, copying, distributing
>>>> or taking any action in reliance on the contents of this information is
>>>> strictly prohibited.
>>>>
>>>>
>>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <jo...@liveramp.com>
>>>> wrote:
>>>>
>>>>> I'm sure I could use some of the existing aggregations as a guide on
>>>>> how to make aggregations to fill the gap of missing ones. Such as creating
>>>>> Sum/Max/Min.
>>>>>
>>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>>> unless you are thinking of a different type of GroupBy?
>>>>>
>>>>> - Shannon
>>>>>
>>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>>>
>>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>>
>>>>>>
>>>>>> -Rui
>>>>>>
>>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>
>>>>>>> Thanks Valentyn,
>>>>>>>
>>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>>> modify. These are really just shortcut PTransforms that I am working on to
>>>>>>> simplify creating pipelines.
>>>>>>>
>>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>>
>>>>>>> - Inner Join
>>>>>>> - Left Outer Join
>>>>>>> - Right Outer Join
>>>>>>> - Full Outer Join
>>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key
>>>>>>> for the joins)
>>>>>>> - Select (very simple filter that returns only items you want from
>>>>>>> the dictionary) (allows for defining a default nullValue)
>>>>>>>
>>>>>>> Currently these operations only work with dictionaries, but I'd be
>>>>>>> interested to see how it would work for <K,V> tuples.
>>>>>>>
>>>>>>> I'm new to python so they may not be optimized or the best way, but
>>>>>>> from my understanding these seem to be the best way to do these types of
>>>>>>> operations. Essentially I created a pipeline to be able to convert a simple
>>>>>>> sql query into a flow of these utilities. Using prepareKey to define your
>>>>>>> joining key, joining, and then selecting from the join allows you to do a
>>>>>>> lot of powerful manipulation in a simple / familiar way.
>>>>>>>
>>>>>>> If this is something that we'd like to add to the Beam SDK I don't
>>>>>>> mind looking at the contributor license agreement, and conversing more on
>>>>>>> how to get them in.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shannon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> Hi Shannon,
>>>>>>>>
>>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>>>> to date with new releases of Beam.
>>>>>>>>
>>>>>>>> I encourage you to take a look at
>>>>>>>> https://beam.apache.org/contribute/ for general advice on how to
>>>>>>>> get started. To echo some points mentioned in the guide:
>>>>>>>>
>>>>>>>> - If your change is large or it is your first change, it is a good
>>>>>>>> idea to discuss it on the dev@ mailing list
>>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>>> email it to the dev@ mailing list.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Valentyn
>>>>>>>>
>>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>>
>>>>>>>>> I have been writing a bunch of utilities for the python SDK such
>>>>>>>>> as joins, selections, composite transforms, etc...
>>>>>>>>>
>>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>>>> want some attribution or something like that.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shannon
>>>>>>>>>
>>>>>>>>

Re: Python Utilities

Posted by Robin Qiu <ro...@google.com>.
Hi Shannon,

Thanks for sharing the repo! I took a quick look and I have a concern with
the naming of the transforms.

Currently, Beam Java already have "Select" and "Join" transforms. However,
they work on schemas, a feature that is not yet implemented in Beam Python.
(See
https://github.com/apache/beam/tree/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
)

To maintain consistency between SDKs, I think it is good to avoid having
two different transforms with the same name but different functions. So
maybe you can consider renaming the transforms or/and putting it in an
extension Python module, instead of the main ones?

Best,
Robin

On Mon, Jul 8, 2019 at 9:19 AM Shannon Duncan <jo...@liveramp.com>
wrote:

> As a follow up. Here is the repo that contains the utilities for now.
> https://github.com/shadowcodex/apache-beam-utilities. Will put together a
> proper PR as code gets closer to production quality.
>
> - Shannon
>
> On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> Thanks Frederik,
>>
>> That's exactly where I was looking. I did get permission to open source
>> the utilities module. So I'm going to throw them up on my personal github
>> soon and share with the email group for a look over.
>>
>> I'm going to work on the utilities there because it's a quick dev
>> environment and then once they are ready for proper PR I'll begin working
>> them into the actual SDK for a PR.
>>
>> I also joined the slack #beam and #beam-python channels, I was unsure of
>> where most collaborators discussed items.
>>
>> - Shannon
>>
>> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu>
>> wrote:
>>
>>> Hi Shannon,
>>>
>>> This is probably a good starting point:
>>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>>> .
>>>
>>> Frederik
>>>
>>> [image: https://ml6.eu]
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>>
>>>
>>> * Frederik Bode*
>>>
>>> ML6 Ghent
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>>> +32 4 92 78 96 18
>>>
>>>
>>> **** DISCLAIMER ****
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. If you have received this email in error please notify the
>>> system manager. This message contains confidential information and is
>>> intended only for the individual named. If you are not the named addressee
>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>> the sender immediately by e-mail if you have received this e-mail by
>>> mistake and delete this e-mail from your system. If you are not the
>>> intended recipient you are notified that disclosing, copying, distributing
>>> or taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>>
>>>
>>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <jo...@liveramp.com>
>>> wrote:
>>>
>>>> I'm sure I could use some of the existing aggregations as a guide on
>>>> how to make aggregations to fill the gap of missing ones. Such as creating
>>>> Sum/Max/Min.
>>>>
>>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>>> unless you are thinking of a different type of GroupBy?
>>>>
>>>> - Shannon
>>>>
>>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>>
>>>>>
>>>>> -Rui
>>>>>
>>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>
>>>>>> Thanks Valentyn,
>>>>>>
>>>>>> I'll outline the utilities and accept any suggestions to add /
>>>>>> modify. These are really just shortcut PTransforms that I am working on to
>>>>>> simplify creating pipelines.
>>>>>>
>>>>>> Currently the utilities contain the following PTransforms:
>>>>>>
>>>>>> - Inner Join
>>>>>> - Left Outer Join
>>>>>> - Right Outer Join
>>>>>> - Full Outer Join
>>>>>> - PrepareKey (For selecting items in a dictionary to act as a key for
>>>>>> the joins)
>>>>>> - Select (very simple filter that returns only items you want from
>>>>>> the dictionary) (allows for defining a default nullValue)
>>>>>>
>>>>>> Currently these operations only work with dictionaries, but I'd be
>>>>>> interested to see how it would work for <K,V> tuples.
>>>>>>
>>>>>> I'm new to python so they may not be optimized or the best way, but
>>>>>> from my understanding these seem to be the best way to do these types of
>>>>>> operations. Essentially I created a pipeline to be able to convert a simple
>>>>>> sql query into a flow of these utilities. Using prepareKey to define your
>>>>>> joining key, joining, and then selecting from the join allows you to do a
>>>>>> lot of powerful manipulation in a simple / familiar way.
>>>>>>
>>>>>> If this is something that we'd like to add to the Beam SDK I don't
>>>>>> mind looking at the contributor license agreement, and conversing more on
>>>>>> how to get them in.
>>>>>>
>>>>>> Thanks,
>>>>>> Shannon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> Hi Shannon,
>>>>>>>
>>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>>> to date with new releases of Beam.
>>>>>>>
>>>>>>> I encourage you to take a look at
>>>>>>> https://beam.apache.org/contribute/ for general advice on how to
>>>>>>> get started. To echo some points mentioned in the guide:
>>>>>>>
>>>>>>> - If your change is large or it is your first change, it is a good
>>>>>>> idea to discuss it on the dev@ mailing list
>>>>>>> - For large changes create a design doc (template, examples) and
>>>>>>> email it to the dev@ mailing list.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Valentyn
>>>>>>>
>>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>>
>>>>>>>> I have been writing a bunch of utilities for the python SDK such as
>>>>>>>> joins, selections, composite transforms, etc...
>>>>>>>>
>>>>>>>> I am working with my company to see if I can open source the
>>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>>> want some attribution or something like that.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shannon
>>>>>>>>
>>>>>>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
As a follow up. Here is the repo that contains the utilities for now.
https://github.com/shadowcodex/apache-beam-utilities. Will put together a
proper PR as code gets closer to production quality.

- Shannon

On Mon, Jul 8, 2019 at 9:20 AM Shannon Duncan <jo...@liveramp.com>
wrote:

> Thanks Frederik,
>
> That's exactly where I was looking. I did get permission to open source
> the utilities module. So I'm going to throw them up on my personal github
> soon and share with the email group for a look over.
>
> I'm going to work on the utilities there because it's a quick dev
> environment and then once they are ready for proper PR I'll begin working
> them into the actual SDK for a PR.
>
> I also joined the slack #beam and #beam-python channels, I was unsure of
> where most collaborators discussed items.
>
> - Shannon
>
> On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu> wrote:
>
>> Hi Shannon,
>>
>> This is probably a good starting point:
>> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
>> .
>>
>> Frederik
>>
>> [image: https://ml6.eu]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>>
>>
>> * Frederik Bode*
>>
>> ML6 Ghent
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
>> +32 4 92 78 96 18
>>
>>
>> **** DISCLAIMER ****
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail if you have received this e-mail by
>> mistake and delete this e-mail from your system. If you are not the
>> intended recipient you are notified that disclosing, copying, distributing
>> or taking any action in reliance on the contents of this information is
>> strictly prohibited.
>>
>>
>> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> I'm sure I could use some of the existing aggregations as a guide on how
>>> to make aggregations to fill the gap of missing ones. Such as creating
>>> Sum/Max/Min.
>>>
>>> GroupBy is really already handled with GroupByKey and CoGroupByKey
>>> unless you are thinking of a different type of GroupBy?
>>>
>>> - Shannon
>>>
>>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>>
>>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>>
>>>>
>>>> -Rui
>>>>
>>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>>> joseph.duncan@liveramp.com> wrote:
>>>>
>>>>> Thanks Valentyn,
>>>>>
>>>>> I'll outline the utilities and accept any suggestions to add / modify.
>>>>> These are really just shortcut PTransforms that I am working on to simplify
>>>>> creating pipelines.
>>>>>
>>>>> Currently the utilities contain the following PTransforms:
>>>>>
>>>>> - Inner Join
>>>>> - Left Outer Join
>>>>> - Right Outer Join
>>>>> - Full Outer Join
>>>>> - PrepareKey (For selecting items in a dictionary to act as a key for
>>>>> the joins)
>>>>> - Select (very simple filter that returns only items you want from the
>>>>> dictionary) (allows for defining a default nullValue)
>>>>>
>>>>> Currently these operations only work with dictionaries, but I'd be
>>>>> interested to see how it would work for <K,V> tuples.
>>>>>
>>>>> I'm new to python so they may not be optimized or the best way, but
>>>>> from my understanding these seem to be the best way to do these types of
>>>>> operations. Essentially I created a pipeline to be able to convert a simple
>>>>> sql query into a flow of these utilities. Using prepareKey to define your
>>>>> joining key, joining, and then selecting from the join allows you to do a
>>>>> lot of powerful manipulation in a simple / familiar way.
>>>>>
>>>>> If this is something that we'd like to add to the Beam SDK I don't
>>>>> mind looking at the contributor license agreement, and conversing more on
>>>>> how to get them in.
>>>>>
>>>>> Thanks,
>>>>> Shannon
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> Hi Shannon,
>>>>>>
>>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>>> to date with new releases of Beam.
>>>>>>
>>>>>> I encourage you to take a look at https://beam.apache.org/contribute/ for
>>>>>> general advice on how to get started. To echo some points mentioned in the
>>>>>> guide:
>>>>>>
>>>>>> - If your change is large or it is your first change, it is a good
>>>>>> idea to discuss it on the dev@ mailing list
>>>>>> - For large changes create a design doc (template, examples) and
>>>>>> email it to the dev@ mailing list.
>>>>>>
>>>>>> Thanks,
>>>>>> Valentyn
>>>>>>
>>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>>
>>>>>>> I have been writing a bunch of utilities for the python SDK such as
>>>>>>> joins, selections, composite transforms, etc...
>>>>>>>
>>>>>>> I am working with my company to see if I can open source the
>>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>>> want some attribution or something like that.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shannon
>>>>>>>
>>>>>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
Thanks Frederik,

That's exactly where I was looking. I did get permission to open source the
utilities module. So I'm going to throw them up on my personal github soon
and share with the email group for a look over.

I'm going to work on the utilities there because it's a quick dev
environment and then once they are ready for proper PR I'll begin working
them into the actual SDK for a PR.

I also joined the slack #beam and #beam-python channels, I was unsure of
where most collaborators discussed items.

- Shannon

On Mon, Jul 8, 2019 at 9:09 AM Frederik Bode <fr...@ml6.eu> wrote:

> Hi Shannon,
>
> This is probably a good starting point:
> https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
> .
>
> Frederik
>
> [image: https://ml6.eu]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>
>
>
> * Frederik Bode*
>
> ML6 Ghent
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
> +32 4 92 78 96 18
>
>
> **** DISCLAIMER ****
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>
>
> On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> I'm sure I could use some of the existing aggregations as a guide on how
>> to make aggregations to fill the gap of missing ones. Such as creating
>> Sum/Max/Min.
>>
>> GroupBy is really already handled with GroupByKey and CoGroupByKey unless
>> you are thinking of a different type of GroupBy?
>>
>> - Shannon
>>
>> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>>
>>> Maybe also adding Aggregation/GroupBy as utilities?
>>>
>>>
>>> -Rui
>>>
>>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <
>>> joseph.duncan@liveramp.com> wrote:
>>>
>>>> Thanks Valentyn,
>>>>
>>>> I'll outline the utilities and accept any suggestions to add / modify.
>>>> These are really just shortcut PTransforms that I am working on to simplify
>>>> creating pipelines.
>>>>
>>>> Currently the utilities contain the following PTransforms:
>>>>
>>>> - Inner Join
>>>> - Left Outer Join
>>>> - Right Outer Join
>>>> - Full Outer Join
>>>> - PrepareKey (For selecting items in a dictionary to act as a key for
>>>> the joins)
>>>> - Select (very simple filter that returns only items you want from the
>>>> dictionary) (allows for defining a default nullValue)
>>>>
>>>> Currently these operations only work with dictionaries, but I'd be
>>>> interested to see how it would work for <K,V> tuples.
>>>>
>>>> I'm new to python so they may not be optimized or the best way, but
>>>> from my understanding these seem to be the best way to do these types of
>>>> operations. Essentially I created a pipeline to be able to convert a simple
>>>> sql query into a flow of these utilities. Using prepareKey to define your
>>>> joining key, joining, and then selecting from the join allows you to do a
>>>> lot of powerful manipulation in a simple / familiar way.
>>>>
>>>> If this is something that we'd like to add to the Beam SDK I don't mind
>>>> looking at the contributor license agreement, and conversing more on how to
>>>> get them in.
>>>>
>>>> Thanks,
>>>> Shannon
>>>>
>>>>
>>>>
>>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> Hi Shannon,
>>>>>
>>>>> Thanks for considering a contribution to Beam Python SDK. With a
>>>>> direct contribution to Beam SDK, your change will reach larger audience of
>>>>> users, and you will not have to maintain a separate project and keep it up
>>>>> to date with new releases of Beam.
>>>>>
>>>>> I encourage you to take a look at https://beam.apache.org/contribute/ for
>>>>> general advice on how to get started. To echo some points mentioned in the
>>>>> guide:
>>>>>
>>>>> - If your change is large or it is your first change, it is a good
>>>>> idea to discuss it on the dev@ mailing list
>>>>> - For large changes create a design doc (template, examples) and email
>>>>> it to the dev@ mailing list.
>>>>>
>>>>> Thanks,
>>>>> Valentyn
>>>>>
>>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>>> joseph.duncan@liveramp.com> wrote:
>>>>>
>>>>>> I have been writing a bunch of utilities for the python SDK such as
>>>>>> joins, selections, composite transforms, etc...
>>>>>>
>>>>>> I am working with my company to see if I can open source the
>>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>>> want some attribution or something like that.
>>>>>>
>>>>>> Thanks,
>>>>>> Shannon
>>>>>>
>>>>>

Re: Python Utilities

Posted by Frederik Bode <fr...@ml6.eu>.
Hi Shannon,

This is probably a good starting point:
https://github.com/apache/beam/blob/2d5e493abf39ee6fc89831bb0b7ec9fee592b9c5/sdks/python/apache_beam/transforms/combiners.py#L68
.

Frederik

[image: https://ml6.eu]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=>


* Frederik Bode*

ML6 Ghent
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=>
+32 4 92 78 96 18


**** DISCLAIMER ****

This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.


On Mon, 8 Jul 2019 at 15:40, Shannon Duncan <jo...@liveramp.com>
wrote:

> I'm sure I could use some of the existing aggregations as a guide on how
> to make aggregations to fill the gap of missing ones. Such as creating
> Sum/Max/Min.
>
> GroupBy is really already handled with GroupByKey and CoGroupByKey unless
> you are thinking of a different type of GroupBy?
>
> - Shannon
>
> On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:
>
>> Maybe also adding Aggregation/GroupBy as utilities?
>>
>>
>> -Rui
>>
>> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> Thanks Valentyn,
>>>
>>> I'll outline the utilities and accept any suggestions to add / modify.
>>> These are really just shortcut PTransforms that I am working on to simplify
>>> creating pipelines.
>>>
>>> Currently the utilities contain the following PTransforms:
>>>
>>> - Inner Join
>>> - Left Outer Join
>>> - Right Outer Join
>>> - Full Outer Join
>>> - PrepareKey (For selecting items in a dictionary to act as a key for
>>> the joins)
>>> - Select (very simple filter that returns only items you want from the
>>> dictionary) (allows for defining a default nullValue)
>>>
>>> Currently these operations only work with dictionaries, but I'd be
>>> interested to see how it would work for <K,V> tuples.
>>>
>>> I'm new to python so they may not be optimized or the best way, but from
>>> my understanding these seem to be the best way to do these types of
>>> operations. Essentially I created a pipeline to be able to convert a simple
>>> sql query into a flow of these utilities. Using prepareKey to define your
>>> joining key, joining, and then selecting from the join allows you to do a
>>> lot of powerful manipulation in a simple / familiar way.
>>>
>>> If this is something that we'd like to add to the Beam SDK I don't mind
>>> looking at the contributor license agreement, and conversing more on how to
>>> get them in.
>>>
>>> Thanks,
>>> Shannon
>>>
>>>
>>>
>>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> Hi Shannon,
>>>>
>>>> Thanks for considering a contribution to Beam Python SDK. With a direct
>>>> contribution to Beam SDK, your change will reach larger audience of users,
>>>> and you will not have to maintain a separate project and keep it up to date
>>>> with new releases of Beam.
>>>>
>>>> I encourage you to take a look at https://beam.apache.org/contribute/ for
>>>> general advice on how to get started. To echo some points mentioned in the
>>>> guide:
>>>>
>>>> - If your change is large or it is your first change, it is a good idea
>>>> to discuss it on the dev@ mailing list
>>>> - For large changes create a design doc (template, examples) and email
>>>> it to the dev@ mailing list.
>>>>
>>>> Thanks,
>>>> Valentyn
>>>>
>>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>>> joseph.duncan@liveramp.com> wrote:
>>>>
>>>>> I have been writing a bunch of utilities for the python SDK such as
>>>>> joins, selections, composite transforms, etc...
>>>>>
>>>>> I am working with my company to see if I can open source the
>>>>> utilities. Would it be best to post them on a separate PyPi project, or to
>>>>> PR them into the beam SDK? I assume if they let me open source it they will
>>>>> want some attribution or something like that.
>>>>>
>>>>> Thanks,
>>>>> Shannon
>>>>>
>>>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
I'm sure I could use some of the existing aggregations as a guide on how to
make aggregations to fill the gap of missing ones. Such as creating
Sum/Max/Min.

GroupBy is really already handled with GroupByKey and CoGroupByKey unless
you are thinking of a different type of GroupBy?

- Shannon

On Sun, Jul 7, 2019 at 10:47 PM Rui Wang <ru...@google.com> wrote:

> Maybe also adding Aggregation/GroupBy as utilities?
>
>
> -Rui
>
> On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> Thanks Valentyn,
>>
>> I'll outline the utilities and accept any suggestions to add / modify.
>> These are really just shortcut PTransforms that I am working on to simplify
>> creating pipelines.
>>
>> Currently the utilities contain the following PTransforms:
>>
>> - Inner Join
>> - Left Outer Join
>> - Right Outer Join
>> - Full Outer Join
>> - PrepareKey (For selecting items in a dictionary to act as a key for the
>> joins)
>> - Select (very simple filter that returns only items you want from the
>> dictionary) (allows for defining a default nullValue)
>>
>> Currently these operations only work with dictionaries, but I'd be
>> interested to see how it would work for <K,V> tuples.
>>
>> I'm new to python so they may not be optimized or the best way, but from
>> my understanding these seem to be the best way to do these types of
>> operations. Essentially I created a pipeline to be able to convert a simple
>> sql query into a flow of these utilities. Using prepareKey to define your
>> joining key, joining, and then selecting from the join allows you to do a
>> lot of powerful manipulation in a simple / familiar way.
>>
>> If this is something that we'd like to add to the Beam SDK I don't mind
>> looking at the contributor license agreement, and conversing more on how to
>> get them in.
>>
>> Thanks,
>> Shannon
>>
>>
>>
>> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Hi Shannon,
>>>
>>> Thanks for considering a contribution to Beam Python SDK. With a direct
>>> contribution to Beam SDK, your change will reach larger audience of users,
>>> and you will not have to maintain a separate project and keep it up to date
>>> with new releases of Beam.
>>>
>>> I encourage you to take a look at https://beam.apache.org/contribute/ for
>>> general advice on how to get started. To echo some points mentioned in the
>>> guide:
>>>
>>> - If your change is large or it is your first change, it is a good idea
>>> to discuss it on the dev@ mailing list
>>> - For large changes create a design doc (template, examples) and email
>>> it to the dev@ mailing list.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <
>>> joseph.duncan@liveramp.com> wrote:
>>>
>>>> I have been writing a bunch of utilities for the python SDK such as
>>>> joins, selections, composite transforms, etc...
>>>>
>>>> I am working with my company to see if I can open source the utilities.
>>>> Would it be best to post them on a separate PyPi project, or to PR them
>>>> into the beam SDK? I assume if they let me open source it they will want
>>>> some attribution or something like that.
>>>>
>>>> Thanks,
>>>> Shannon
>>>>
>>>

Re: Python Utilities

Posted by Rui Wang <ru...@google.com>.
Maybe also adding Aggregation/GroupBy as utilities?


-Rui

On Sun, Jul 7, 2019 at 1:46 PM Shannon Duncan <jo...@liveramp.com>
wrote:

> Thanks Valentyn,
>
> I'll outline the utilities and accept any suggestions to add / modify.
> These are really just shortcut PTransforms that I am working on to simplify
> creating pipelines.
>
> Currently the utilities contain the following PTransforms:
>
> - Inner Join
> - Left Outer Join
> - Right Outer Join
> - Full Outer Join
> - PrepareKey (For selecting items in a dictionary to act as a key for the
> joins)
> - Select (very simple filter that returns only items you want from the
> dictionary) (allows for defining a default nullValue)
>
> Currently these operations only work with dictionaries, but I'd be
> interested to see how it would work for <K,V> tuples.
>
> I'm new to python so they may not be optimized or the best way, but from
> my understanding these seem to be the best way to do these types of
> operations. Essentially I created a pipeline to be able to convert a simple
> sql query into a flow of these utilities. Using prepareKey to define your
> joining key, joining, and then selecting from the join allows you to do a
> lot of powerful manipulation in a simple / familiar way.
>
> If this is something that we'd like to add to the Beam SDK I don't mind
> looking at the contributor license agreement, and conversing more on how to
> get them in.
>
> Thanks,
> Shannon
>
>
>
> On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Hi Shannon,
>>
>> Thanks for considering a contribution to Beam Python SDK. With a direct
>> contribution to Beam SDK, your change will reach larger audience of users,
>> and you will not have to maintain a separate project and keep it up to date
>> with new releases of Beam.
>>
>> I encourage you to take a look at https://beam.apache.org/contribute/ for
>> general advice on how to get started. To echo some points mentioned in the
>> guide:
>>
>> - If your change is large or it is your first change, it is a good idea
>> to discuss it on the dev@ mailing list
>> - For large changes create a design doc (template, examples) and email it
>> to the dev@ mailing list.
>>
>> Thanks,
>> Valentyn
>>
>> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> I have been writing a bunch of utilities for the python SDK such as
>>> joins, selections, composite transforms, etc...
>>>
>>> I am working with my company to see if I can open source the utilities.
>>> Would it be best to post them on a separate PyPi project, or to PR them
>>> into the beam SDK? I assume if they let me open source it they will want
>>> some attribution or something like that.
>>>
>>> Thanks,
>>> Shannon
>>>
>>

Re: Python Utilities

Posted by Shannon Duncan <jo...@liveramp.com>.
Thanks Valentyn,

I'll outline the utilities and accept any suggestions to add / modify.
These are really just shortcut PTransforms that I am working on to simplify
creating pipelines.

Currently the utilities contain the following PTransforms:

- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- PrepareKey (For selecting items in a dictionary to act as a key for the
joins)
- Select (very simple filter that returns only items you want from the
dictionary) (allows for defining a default nullValue)

Currently these operations only work with dictionaries, but I'd be
interested to see how it would work for <K,V> tuples.

I'm new to python so they may not be optimized or the best way, but from my
understanding these seem to be the best way to do these types of
operations. Essentially I created a pipeline to be able to convert a simple
sql query into a flow of these utilities. Using prepareKey to define your
joining key, joining, and then selecting from the join allows you to do a
lot of powerful manipulation in a simple / familiar way.

If this is something that we'd like to add to the Beam SDK I don't mind
looking at the contributor license agreement, and conversing more on how to
get them in.

Thanks,
Shannon



On Wed, Jul 3, 2019 at 5:16 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Hi Shannon,
>
> Thanks for considering a contribution to Beam Python SDK. With a direct
> contribution to Beam SDK, your change will reach larger audience of users,
> and you will not have to maintain a separate project and keep it up to date
> with new releases of Beam.
>
> I encourage you to take a look at https://beam.apache.org/contribute/ for
> general advice on how to get started. To echo some points mentioned in the
> guide:
>
> - If your change is large or it is your first change, it is a good idea to
> discuss it on the dev@ mailing list
> - For large changes create a design doc (template, examples) and email it
> to the dev@ mailing list.
>
> Thanks,
> Valentyn
>
> On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> I have been writing a bunch of utilities for the python SDK such as
>> joins, selections, composite transforms, etc...
>>
>> I am working with my company to see if I can open source the utilities.
>> Would it be best to post them on a separate PyPi project, or to PR them
>> into the beam SDK? I assume if they let me open source it they will want
>> some attribution or something like that.
>>
>> Thanks,
>> Shannon
>>
>

Re: Python Utilities

Posted by Valentyn Tymofieiev <va...@google.com>.
Hi Shannon,

Thanks for considering a contribution to Beam Python SDK. With a direct
contribution to Beam SDK, your change will reach larger audience of users,
and you will not have to maintain a separate project and keep it up to date
with new releases of Beam.

I encourage you to take a look at https://beam.apache.org/contribute/ for
general advice on how to get started. To echo some points mentioned in the
guide:

- If your change is large or it is your first change, it is a good idea to
discuss it on the dev@ mailing list
- For large changes create a design doc (template, examples) and email it
to the dev@ mailing list.

Thanks,
Valentyn

On Wed, Jul 3, 2019 at 3:04 PM Shannon Duncan <jo...@liveramp.com>
wrote:

> I have been writing a bunch of utilities for the python SDK such as joins,
> selections, composite transforms, etc...
>
> I am working with my company to see if I can open source the utilities.
> Would it be best to post them on a separate PyPi project, or to PR them
> into the beam SDK? I assume if they let me open source it they will want
> some attribution or something like that.
>
> Thanks,
> Shannon
>