You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Fabian Hueske (JIRA)" <ji...@apache.org> on 2016/05/18 08:53:12 UTC

[jira] [Commented] (FLINK-3910) New self-join operator

    [ https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288635#comment-15288635 ] 

Fabian Hueske commented on FLINK-3910:
--------------------------------------

I think it is a good idea to have strategies for self-joins. At the moment you can join a data set with itself in two ways:

- Use a regular join: {{dataset.join(dataset)}}. In this case, Flink will treat the input as two inputs, i.e., depending on the chosen strategy shuffle it twice, sort it twice and possibly temporarily buffer the input.
- Use a reduce function and manually implement the join as done in the TriangleEnumeration example. Here the problem is that the join must be manually implemented and is not done in managed memory and might fail.

I would not add a dedicated {{selfjoin}} method to {{DataSet}} because this can be automatically detected if both inputs of a join are identical. Extending {{JoinHint}} with strategies for self joins sounds good to me.

[~greghogan] can you describe the driver strategies that you are planning to implement for self joins? What will characterize the skewed and non-skewed variants?

> New self-join operator
> ----------------------
>
>                 Key: FLINK-3910
>                 URL: https://issues.apache.org/jira/browse/FLINK-3910
>             Project: Flink
>          Issue Type: New Feature
>          Components: DataSet API, Java API, Scala API
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
>
> Flink currently provides inner- and outer-joins as well as cogroup and the non-keyed cross. {{JoinOperator}} hints at future support for semi- and anti-joins.
> Many Gelly algorithms perform a self-join [0]. Still pending reviews, FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java. A {{SelfJoinHint}} will select between skewed and non-skewed implementations.
> The object-reuse-disabled case can be simply handled with a new {{Operator}}. The object-reuse-enabled case requires either {{CopyableValue}} types (as in the code above) or a custom driver which has access to the serializer (or making the serializer accessible to rich functions, and I think there be dragons).
> If the idea of a self-join is agreeable, I'd like to work out a rough implementation and go from there.
> [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)