You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Valeria Vasylieva (JIRA)" <ji...@apache.org> on 2019/03/19 12:45:00 UTC
[jira] [Commented] (SPARK-26739) Standardized Join Types for DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796045#comment-16796045 ] 

Valeria Vasylieva commented on SPARK-26739:
-------------------------------------------

[~slehan] hi! I strongly agree this change will make Dataset/DataFrame API users much happier, me at least. 

I would like to work on this task.

 

There are several questions though:

1) Will this change be accepted, if implemented?

2) If yes, do we need to mark methods using strings as deprecated?

 

[~hyukjin.kwon], [~rxin@databricks.com], [~cloud_fan] could one of you please answer these questions?

Thank you!

> Standardized Join Types for DataFrames
> --------------------------------------
>
>                 Key: SPARK-26739
>                 URL: https://issues.apache.org/jira/browse/SPARK-26739
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Skyler Lehan
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon.
> Currently, in the join functions on [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors that aren't evident at compile time. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be successful?
> The new approach would use constants or *more preferably an enum*, something like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated.
> h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account.
> {color:#FF0000}*It is heavily recommended that enums, and not string constants are used.*{color} String constants are presented as a possible solution but not the ideal solution.
> *If enums are used (preferred):*
> The following join function signatures would be added to the Dataset API:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): DataFrame
> {code}
> The following functions would be deprecated:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
> {code}
> A new enum would be created called JoinType. Developers would be encouraged to adopt using JoinType instead of the literal strings.
> *If string constants are used:*
> No current API changes, however a new Scala object with string constants would be defined like so:
> {code:java}
> object JoinType {
>   final val INNER: String = "inner"
>   final val LEFT_OUTER: String = "left_outer"
> }
> {code}
> This approach would not allow for compile time checking of the join types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org