You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/04/26 09:14:12 UTC

[jira] [Assigned] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

     [ https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-14761:
------------------------------------

    Assignee:     (was: Apache Spark)

> PySpark DataFrame.join should reject invalid join methods even when join columns are not specified
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14761
>                 URL: https://issues.apache.org/jira/browse/SPARK-14761
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>            Reporter: Josh Rosen
>            Priority: Minor
>              Labels: starter
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
>     def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
>             on = [on]
>         if on is None or len(on) == 0:
>             jdf = self._jdf.join(other._jdf)
>         elif isinstance(on[0], basestring):
>             if how is None:
>                 jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
>             else:
>                 assert isinstance(how, basestring), "how should be basestring"
>                 jdf = self._jdf.join(other._jdf, self._jseq(on), how)
>         else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I think that we should refactor this to first process all arguments and then call the three-argument {{_.jdf.join}}. This would handle the above invalid example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org