You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/05/29 12:39:47 UTC

[GitHub] spark pull request #21449: [SPARK-24385][SQL] Resolve self-join condition am...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/21449

    [SPARK-24385][SQL] Resolve self-join condition ambiguity for all BinaryComparisons

    ## What changes were proposed in this pull request?
    
    In `Dataset.join` we have a small hack for resolving ambiguity in the column name for self-joins. The current code supports only `EqualTo`, but we may have other conditions involving columns on self joins: in general any `BinaryComparison` can be specified and faces the same issue.
    
    The PR extends the fix to all `BinaryComparison`s.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-24385

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21449
    
----
commit 92cb513416c5dd0e9fa690c25cfae0565471a5e1
Author: Marco Gaido <ma...@...>
Date:   2018-05-29T12:34:57Z

    [SPARK-24385][SQL] Resolve self-join condition ambiguity for all BinaryComparisons

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3732/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit [`e8a5fa3`](https://github.com/apache/spark/commit/e8a5fa33d56187a6e30e81ba9439cd097fff5b2c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit [`b8d5057`](https://github.com/apache/spark/commit/b8d50570b7b172ef310fdfb12b01be1598ff5481).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    > In the short term we should make the behavior of EqualTo and EqualNullSafe identical.
    
    This seems pretty safe and reasonable to me


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91298/testReport)** for PR 21449 at commit [`e8a5fa3`](https://github.com/apache/spark/commit/e8a5fa33d56187a6e30e81ba9439cd097fff5b2c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3704/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Sure, thanks for your time.
    
    PS `df.join(df, df("id") >= df("id"))` may be ambiguous, but in the example above
    `df1.join(df2, df2['id'].eqNullSafe(df1['id'])).collect()` where `df1` and `df2` are created from the same dataframe it is not ambiguous IMHO.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by daniel-shields <gi...@git.apache.org>.

Github user daniel-shields commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    @mgaido91 I looked at the test failures and I think the changes to the Dataset,resolve method are causing havoc. Consider the Dataset.drop method with the following signature:
    ` def drop(col: Column): DataFrame`
    There is a statement that may be comparing an AttributeReference with the new metadata to one without it:
    ```
    val colsAfterDrop = attrs.filter { attr =>
          attr != expression
        }.map(attr => Column(attr))```
    This may be resulting in columns not getting dropped. I haven't verified but this is the first thing I would check. The change to resolve may be too drastic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    @cloud-fan do you have any further comments about this? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by daniel-shields <gi...@git.apache.org>.

Github user daniel-shields commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    I'm not sure that this behavior should be applied to all binary comparisons. It could result in unexpected behavior in some rare cases. For example:
    `df1.join(df2, df2['x'] < df1['x'])`
    If 'x' is ambiguous, this would result in the conditional being flipped.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21449

I see what you mean. Honestly I have not thought of a full design for this problem (so I can't state what we should support and what not), but focusing on this specific case I think that:

- at the moment we do support self-joins (at least in the case `df.join(df, df("id") >= df("id"))`) so considering this invalid would cause a big behavior change (potentially causing user workflows to break).
- even though we might consider acceptable such a change in a major release, I think that we should support with the Dataframe API what we support in the SQL API, and SQL standard supports self joins (using aliases for the tables). So I do believe we should support this use case.
- the case presented by @daniel-shields in https://github.com/apache/spark/pull/21449#issuecomment-392947474, I think is a valid one without any doubt. As of now we are not supporting it, though.

So I think that in the holistic approach we shouldn't change the current behavior/approach which is present now and will be (IMHO) improved by this patch.

What I do think we have to discuss in order not to have to change it - once we want to solve the more generic issue - is the way to track the dataset an attribute is coming from. Here I decided to use the metadata, since I thought this is the cleanest approach. Another approach might be to introduce a new `Option` in the `AttributeReference` a reference to the dataset it is coming from.
For the generic solution, this might have the advantage that having a reference to the provenance dataset, where we might want to store some kind of DAG of the datasets this one is coming from in order to take more complex decision about the validity of the syntax and/or the resolution of the attribute.

What do you think?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91343/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by WenboZhao <gi...@git.apache.org>.

Github user WenboZhao commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    I like the proposal by @daniel-shields. If we could get it fixed soon, we will be able to catch up the Spark 2.3.2 release.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    My point is that, we may have a different design if we wanna solve this problem holistically, which may conflict with this patch. We should prove that this is in the right direction and future fix will not conflict with it, or come out with the final fix directly.
    
    An example is, we may want to treat `df.join(df, df("id") >= df("id"))` as invalid in the final design.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    @daniel-shields in that case you have 2 different datasets `df1` and `df2`. So they are 2 distinct attributes and the check `a.sameRef(b)` would return false. This is applied only in case you have self-joins, ie. you have the same dataset on both sides.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    ok so I created https://github.com/apache/spark/pull/21605 for the fix proposed by @daniel-shields. I'd like to leave this open in order to go on with the discussion for a long-term better fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by daniel-shields <gi...@git.apache.org>.

Github user daniel-shields commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    In the short term we should make the behavior of EqualTo and EqualNullSafe identical. We could do that by adding a case for EqualNullSafe that mirrors that of EqualTo.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by daniel-shields <gi...@git.apache.org>.

Github user daniel-shields commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    This case can also occur when the datasets are different but share a common lineage. Consider the following:
    `df = spark.range(10)
    df1 = df.groupby('id').count()
    df2 = df.groupby('id').sum('id')
    df1.join(df2, df2['id'].eqNullSafe(df1['id'])).collect()`
    This currently fails with eqNullSafe, but works with ==.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3666/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91253/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    thanks @daniel-shields , you're right. I am working to check if and how this can be fixed. Thanks for your catch!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit [`8e6e5c0`](https://github.com/apache/spark/commit/8e6e5c0059574c1e171e589fcf533c6b5669499f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    This will definitely not go into 2.3.1, so we have plenty of time. I'll think deeper into it after the spark summit.
    
    IMO `df.join(df, df("id") >= df("id"))` is ambiguous, especially when it's not an inner join.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91303/testReport)** for PR 21449 at commit [`b8d5057`](https://github.com/apache/spark/commit/b8d50570b7b172ef310fdfb12b01be1598ff5481).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91343/testReport)** for PR 21449 at commit [`8e6e5c0`](https://github.com/apache/spark/commit/8e6e5c0059574c1e171e589fcf533c6b5669499f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    yes @daniel-shields, you are right with your analysis. The problem was that we were sometimes using `==`, sometimes `semanticEquals`. And `equals` has the problem you mentioned.
    
    I think this is the only way for addressing the problem described here is to reference which dataset the column is coming from. I think adding a metadata for it is the cleanest way. We may also add a new attribute to the `Attribute` class instead of using metadata, but honestly this way seemed cleaner to me. What do you think? Do you have other suggestions?
    
    cc @cloud-fan @hvanhovell @gatorsmile  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    @daniel-shields do you want to open a PR for that? I'll leave this PR open as it is a more general fix so we can go on with the long-term discussion here in this PR. Do you agree with this approach @cloud-fan ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3702/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit [`92cb513`](https://github.com/apache/spark/commit/92cb513416c5dd0e9fa690c25cfae0565471a5e1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    This is a long-standing issue, I've seen many attempts to fix it (including myself) but no one success.
    
    The major problem is, there is no clear definition of the expected behavior, i.e. what's the semantic of `Dataset.col`?
    
    some examples
    ```
    df.select(df.col("i")) // valid
    
    val df1 = df.filter(...)
    df1.select(df.col("i")) // still valid
    
    df.join(otherDF, df.col("i") === otherDF.col("i")) // valid
    
    df.join(otherDF).select(df.col("i"), otherDF("i"))  // valid
    
    val df2 = df.select(df.col("i") + 1)
    df2.select(df.col("i"))   // invalid
    ```
    
    Sometime we can use an ancestor's column in a new Dataset but sometimes we can't. We should make the condition clear first.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91298/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91303/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    Thanks for your comment @cloud-fan. I understand your point. That is quite a tricky problem, since we should know probably also the "DAG" of the dataframes in order to take the right decision.
    
    But despite this change is related to that problem, I think it is different and with a much smaller scope. Indeed, while we can use the metadata information in many places, actually in this patch is is used only in the self-join case when there is ambiguity in which column to take. The behavior in any other case in unchanged.
    
    So after this patch, the situation in resolving column using `col` is unchanged. The only places where the dataset of provenance is checked is in self joins. The goal here is only to support cases which were throwing exceptions in resolving the right column.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    **[Test build #91253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91253/testReport)** for PR 21449 at commit [`92cb513`](https://github.com/apache/spark/commit/92cb513416c5dd0e9fa690c25cfae0565471a5e1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org