You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stanislav Bytsko (JIRA)" <ji...@apache.org> on 2019/04/27 06:41:00 UTC
[jira] [Updated] (SPARK-27582) Add Dataset DSL for left_anti and left_semi joins

     [ https://issues.apache.org/jira/browse/SPARK-27582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stanislav Bytsko updated SPARK-27582:
-------------------------------------
    Description: 
Currently we have
{code:java}
org.apache.spark.sql.Dataset[T]#joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
{code}
which explicitly excludes left_anti and left_semi joins, which is understandable, because result would have different signature.
I think it's easily fixed drawback, which accepts 2 solutions I can think of:
- Extend current joinWith to return null for second (_2) item in the tuple. Not ideal as no-one likes nulls, but workable, as client should be able to handle that by doing {code}.map(_._1){code} immediately afterwards
- Add 2 new methods {code}org.apache.spark.sql.Dataset[T]#joinSemiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} and {code}org.apache.spark.sql.Dataset[T]#joinAntiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} which is much nicer, but adds 2 methods to the API. Method names could be semiJoinWith and antiJoinWith, which is more logical, but not sorted properly in the list of org.apache.spark.sql.Dataset methods




  was:
Currently we have
{code:java}
org.apache.spark.sql.Dataset[T]#joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
{code}
which explicitly excludes left_anti and left_semi joins, which is understandable, because result would have different signature.
I think it's easily fixed drawback, which accepts 2 solutions I can think of:
- Extend current joinWith to return null for second (_2) item in the tuple. Not ideal as no-one likes nulls, but workable, as client should be able to handle that by doing {code}.map(_._1){code} immediately afterwards
- Add 2 new methods {code}org.apache.spark.sql.Dataset[T]#joinSemiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} and {code}org.apache.spark.sql.Dataset[T]#joinAntiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} which is much nicer, but adds 2 methods to the API. names could be semiJoinWith and antiJoinWith, which is more logical, but not sorted properly in the list of org.apache.spark.sql.Dataset methods





> Add Dataset DSL for left_anti and left_semi joins
> -------------------------------------------------
>
>                 Key: SPARK-27582
>                 URL: https://issues.apache.org/jira/browse/SPARK-27582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.2
>            Reporter: Stanislav Bytsko
>            Priority: Major
>
> Currently we have
> {code:java}
> org.apache.spark.sql.Dataset[T]#joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
> {code}
> which explicitly excludes left_anti and left_semi joins, which is understandable, because result would have different signature.
> I think it's easily fixed drawback, which accepts 2 solutions I can think of:
> - Extend current joinWith to return null for second (_2) item in the tuple. Not ideal as no-one likes nulls, but workable, as client should be able to handle that by doing {code}.map(_._1){code} immediately afterwards
> - Add 2 new methods {code}org.apache.spark.sql.Dataset[T]#joinSemiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} and {code}org.apache.spark.sql.Dataset[T]#joinAntiWith[U](other: Dataset[U], condition: Column): Dataset[T]{code} which is much nicer, but adds 2 methods to the API. Method names could be semiJoinWith and antiJoinWith, which is more logical, but not sorted properly in the list of org.apache.spark.sql.Dataset methods



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org