You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Rui Wang (Jira)" <ji...@apache.org> on 2020/04/30 20:01:00 UTC

[jira] [Commented] (BEAM-9825) Transforms for Intersect, Difference and Commons

    [ https://issues.apache.org/jira/browse/BEAM-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096943#comment-17096943 ] 

Rui Wang commented on BEAM-9825:
--------------------------------

Hello Darshan,

Thanks for opening this Jira.

First of all, I think you proposal is to implement a few more composed transforms to further encapsulate SQL's UNION/INTERSECT/EXCEPT. Right now BeamSQL implements such set operation by two steps: a GoGroup and then a filter [1]. Thus your proposal will further merge these two steps into one single composed transform. 


To me I am ok to have these transforms implemented into [2], because such set operations in relation has clear semantic and build SQL on schema operations  is the ultimate goal for BeamSQL. Further more, other users can reuse such transforms than doing a two-step operation.

If you want to open a PR, please consider the following advices:
1. Use term from SQL. E.g. name your transforms as UNION, INTERSECT, EXCEPT (or MINUS)
2. Support SET ALL and SET DISTINCT semantics
3. Migrate BeamSQL SET implementation to your implementation.


[1]: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSetOperatorRelBase.java#L86
[2]: https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
 

> Transforms for Intersect, Difference and Commons 
> -------------------------------------------------
>
>                 Key: BEAM-9825
>                 URL: https://issues.apache.org/jira/browse/BEAM-9825
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Darshan Jani
>            Assignee: Darshan Jani
>            Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> I'd like to propose following new high-level transforms.
>  * Intersect
> Compute the intersection between elements of two PCollection.
> Given _leftCollection_ and _rightCollection_, this transform returns a collection containing elements that common to both _leftCollection_ and _rightCollection_
>  
>  * Difference
> Compute the difference between elements of two PCollection.
> Given _leftCollection_ and _rightCollection_, this transform returns a collection containing elements that are in _leftCollection_ but not in _rightCollection_
>  * Commons
> Find the elements that are commons to two PCollection, similar like the Unix
> comm utility.
> Given _leftCollection_ and rightCollection, this transform returns a CommonsResults with following:
>  # elements only in _leftCollection_
>  # elements only in _rightCollection_
>  # elements in both collections
> I would like to work on this changes and submit a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)