You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/07/17 08:18:52 UTC

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic conditions from Join...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/18652

    [WIP] Pull non-deterministic conditions from Join operator if possible

    ## What changes were proposed in this pull request?
    
    Currently we can't use non-deterministic conditions in Join operator. However, under certain scenarios, we may be able to pull non-deterministic conditions from Join operator.
    
    Related discussion:
    
    1. https://github.com/apache/spark/pull/15417#discussion_r85295977
    2. http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973
    
    ## How was this patch tested?
    
    Added unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 nondeter-join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18652.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18652
    
----
commit ef5394027b0033a772757497d79ac0c04ccf37e0
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-07-17T07:46:54Z

    Pull non-deterministic conditions from Join operator if possible.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    I think the goal is just to resolve the migration issues for Hive users. If we just provide a very limited support, I do not think it can help the workload migration. 
    
    If we really want to resolve the correctness, we need to resolve many issues (e.g., `EnsureRequirements` could also change the call orders of non-deterministic). So many efforts need to be made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79820/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79672/testReport)** for PR 18652 at commit [`08581d9`](https://github.com/apache/spark/commit/08581d9e84be4e7d18c87339b77565054d392586).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by baibaichen <gi...@git.apache.org>.

Github user baibaichen commented on the issue:

https://github.com/apache/spark/pull/18652

The naive database join implementation looks like:

```
for each tuple in left relation
for each tuple in right relation
matching join condition for each tuple pair then ..
else ..
```
Both inner and outer join will first build a cross-join, and then remove the tuple pairs which don't match the join condition. In the deterministic case, you can do any optimization if the final result is same with above computation.

However, the join has no unique result in the non-deterministic case. For example, considering pseudo random condition `on rand(10) < 0.5`, we can get the same sequence for the same seed, but the final result depends on how tuple pairs are produced.

Since the result highly depends on internal execution engine, there is no standard behavior. For example, explaining following SQL in hive (version 1.2.1)

```
SELECT a.date_id from tmp.tmp_lifan_trfc_tpa_hive a left outer join dw.dim_site_categ_ext c
on case
when a.nav_tcdt is null then
cast(rand(9) * 1000 - 9999999999 as string)
else
a.nav_tcdt
end = c.site_categ_id
and rand(c.site_categ_skid) < 0.5
and rand(a.pltfm_id) >=0.5;
```
I find that HIVE pushes down `rand(c.site_categ_skid) < 0.5` and `rand(a.pltfm_id) >=0.5` to filter operator. I guess that HIVE does't consider non-deterministic in the join-condition. I will verify this later.

By the way, Spark is distributed execution engine which is different with traditional DBMS(MySQL, Oracle), we can't do the same thing, for example. rand will start with initial seed in each worker.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127909294
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    I just did a simple test on Oracle. Looks like it allows the following query:
    
    SELECT * from test1 join test2 on test1.a + FLOOR(DBMS_RANDOM.VALUE()) = test2.b + FLOOR(DBMS_RANDOM.VALUE());
    
    Furthermore, it also doesn't disallow non-deterministic function as joining condition other than joining keys.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79663/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79672/testReport)** for PR 18652 at commit [`08581d9`](https://github.com/apache/spark/commit/08581d9e84be4e7d18c87339b77565054d392586).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79821/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    I just checked with Hive's behavior (2.1.0). I tried a query like `select * from l left outer join r on rand(l.a) > 0.1 and rand(cast(l.b as int)) > 0.2 and rand(r.c) > 0.2 and rand(cast(r.d as int)) > 0.5;`.
    
    The conditions `rand(r.c) > 0.2 and rand(cast(r.d as int)) > 0.5` are pushed down to Filter operator.
    
              TableScan
                alias: r
                Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: c (type: int), d (type: double)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: ((rand(UDFToInteger(_col1)) > 0.5) and (rand(_col0) > 0.2)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      filter predicates:
                        0 {(rand(_col0) > 0.1)} {(rand(UDFToInteger(_col1)) > 0.2)}
                        1 
                      keys:
                        0 
    
    The other conditions `rand(l.a) > 0.1 and rand(cast(l.b as int)) > 0.2` are filter predicates with the Join operator.
    
                  Map Join Operator
                    condition map:
                         Left Outer Join0 to 1
                    filter predicates:
                      0 {(rand(_col0) > 0.1)} {(rand(UDFToInteger(_col1)) > 0.2)}
                      1 
                    keys:
                      0 
                      1 
    
    A query `select * from l left outer join r on rand(l.a) = rand(r.c);` with non-deterministic joining keys. There's no push down. Hive simply evaluates the joining keys.
    
                  Map Join Operator
                    condition map:
                         Left Outer Join0 to 1
                    keys:
                      0 rand(_col0) (type: double)
                      1 rand(_col0) (type: double)
                    outputColumnNames: _col0, _col1, _col2, _col3
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79663/testReport)** for PR 18652 at commit [`ef53940`](https://github.com/apache/spark/commit/ef5394027b0033a772757497d79ac0c04ccf37e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79663/testReport)** for PR 18652 at commit [`ef53940`](https://github.com/apache/spark/commit/ef5394027b0033a772757497d79ac0c04ccf37e0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18652

@baibaichen Thanks for providing the info.

In the case of Filter, actually rand is already with initial seed in each worker. I'm not sure if this is the reason preventing us from doing this.

The main concern I have for now, is multiple non-deterministic joining conditions (not joining keys). Pushing down multiple non-deterministic joining conditions changes the number of calls of the expressions. That is the example you showed: there are `rand(c.site_categ_skid) < 0.5` and `rand(a.pltfm_id) >=0.5`. If no push down, you may only call the first rand and skip the second one. With push down, you call both rands for each rows in two tables.

The less concern for me, is non-deterministic joining keys. Under current SparkSQL join implementations, joining keys are evaluated exactly once on rows in two joining tables, so we won't change the number of calls of the expressions. IIUC, it is safer to push down non-deterministic joining keys. Please correct me if I'm wrong in this part.

> Since the result highly depends on internal execution engine, there is no standard behavior.

I'd tend to agree with that based on the thoughts on this recently. So for now my proposal is to:

1. Support non-deterministic joining keys pushdown.
2. Add a config to control it. Default is disable.
3. Not support non-deterministic joining conditions pushdown for now.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Once we pull out them into downstream project, should we still worry about call orders? They are evaluated before sort or shuffle added later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79674/testReport)** for PR 18652 at commit [`21c1fed`](https://github.com/apache/spark/commit/21c1fedf35cf23d91f1ca348ecf1b76bb742b44b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79674/testReport)** for PR 18652 at commit [`21c1fed`](https://github.com/apache/spark/commit/21c1fedf35cf23d91f1ca348ecf1b76bb742b44b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79821/testReport)** for PR 18652 at commit [`b80d2fc`](https://github.com/apache/spark/commit/b80d2fca6362d63a5eecf43a40b30040d4ac392b).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127903005
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    I do not think we can have an easy solution to ensure it always works as you expected. `EnsureRequirements` is just one of rules that could break it.
    
    The other preceding join conditions before equi join condition also could impact it. It could be skipped if the preceding join conditions is false, right?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @viirya I have not start reading the comments and the codes carefully. Just want to confirm whether the code changes in this PR follow what Hive is doing when we turn on the flag? If not, what is the behavior difference? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    I have concern about @jiangxb1987's previous comment:
    
    > The reason I don't use Project operator is that might change the result correctness, for example, suppose we have join condition rand(1) > 0 && rand(11) < 0, the operator Project(Seq(rand(1), rand(11)), child) will always eval the both rand functions, which will change the result.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    `t1.a = t2.b` is an equi join condition. `t1.c > rand()` is not. They will be split and considered individually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    ping @cloud-fan Can you have time to review this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    We could add a `Sort` above the `Project` and the orders become different, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127918499

--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
@@ -1912,6 +1913,26 @@ class Analyzer(
nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
}.copy(child = newChild)

+ case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
+ j match {
+ // We can push down non-deterministic joining keys.
--- End diff --

Most the RDBMS systems allow non-deterministic join conditions. To support it correclty in Spark, we need to check how the other systems behave. After we deciding the rule, we can't break it. Thus, it has to be very careful to design the initial version.

In the current stage, I do not think we have a bandwidth to make it perfect. If you want to continue the PR, could you just check how Hive works? Adding an extra flag for Hive users. It can simplify their migration task. By default, turn it off.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    When we join two tables, given there are equi-join keys, and they are non-deterministic, for example `t1.a = rand(t2.b)` and `t1.c = rand(t2.d)`. We pull out them to downstream project:
    
        Join [t1.a = rand(t2.b), t1.c = rand(t2.d)]
          Project [t1.a, t1.c]
            TableScan t1
          Project [rand(t2.b) as rand(t2.b), rand(t2.d) as rand(t2.d)]
            TableScan t2
    
    `rand(t2.b)` and `rand(t2.d)` are evaluated in projection. Why Join will change its order?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127891910
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    For different joining type, I think the joining keys are used to find matching/not matching rows. Currently I can't think of the case we can't push down non-deterministic joining keys. Maybe you can also show an example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127894313
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    IIUC, for joining keys, it actually satisfies what you said: It's evaluated in the same order and in the same number as we don't push it down.
    
    I can't think an example it doesn't. So I may ask if you have an example for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127897096
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    How `rand(a)` and `rand(b)` share the same state? They are different expression instances.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Then, will this PR resolve the migration issue from Hive workloads?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79819/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Seems we can't get an agreement on this topic, so I'd close this for now. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127892847
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    What is the join key? Any definition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    yea I know that, I'm thinking about if we need to change it by considering the position.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79673/testReport)** for PR 18652 at commit [`03f4d9f`](https://github.com/apache/spark/commit/03f4d9fcba3be969a9ae0176ce7370e250033473).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127875986
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    I meant joining keys. I am not sure if `a = c && rand(b) < 0` is a joining key?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @gatorsmile Ok. No problem. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127897413
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    > One row in side A could match multiple rows in side B. The join conditions could be also evaluated multiple times for the same row in side A, right? Then, if we push it down to the side A, it could also break the number of rand calls, right?
    
    No. Joining keys are evaluated at once on two tables. Then we simply match the evaluated results.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Btw, I guess that is why we also pull non-deterministic grouping expressions for Aggregate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    You are talking about the number of calls. I am worrying about the call orders. We could add a `SORT`.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    The order is different from the original one that is evaluated in the join conditions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    If we simply allow it, the evaluation order of non-deterministic join conditions will be different on different join implementation, e.g. Sort-based and Hash-based. Then we will get inconsistent join results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    > The order is different from the original one that is evaluated in the join conditions.
    
    I'm not sure what original order you meant. By pulling out to `Project`, they are evaluated by their order in the tables.
    
    If you meant the original order is the one after `Sort`, I don't think it is correct. `Sort` is the implementation detail, we should stick with the order of rows in joining tables.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @cloud-fan @gatorsmile More thoughts or comments for this change? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    
        Join [t1.a = rand(t2.b), t1.c = rand(t2.d)]
          Sort
              Project [t1.a, t1.c]
                TableScan t1
          Sort
            Project [rand(t2.b) as rand(t2.b), rand(t2.d) as rand(t2.d)]
              TableScan t2
    
    Aren't `rand(t2.b)` and `rand(t2.d)` already evaluated in `Project`? Why `Sort` will change the evaluation order?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [SPARK-21497][SQL] Pull non-deterministic equi jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/18652


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #80376 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80376/testReport)** for PR 18652 at commit [`abf51f7`](https://github.com/apache/spark/commit/abf51f7c76016737d494ac23d3071b2301f96445).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127901508
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    Could you check the behavior of DB2 and Oracle? This is not related to the semantics instead of performance. We need to check what is the correct behavior. 
    
    BTW, `EnsureRequirements` could also add extra `Sort` below the join. In our implementation, we never consider this support. Many factors could break this assumption.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    I agree it would greatly simplify the non-deterministic case if we allow only at most one non-deterministic predicate in `Join.condition`, but from the view of a user, would it looks a little weird to have this constraint?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127895419
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    One row in side A could match multiple rows in side B. The join conditions could be also evaluated multiple times for the same row in side A, right? Then, if we push it down to the side A, it could also break the number of `rand` calls, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79827/testReport)** for PR 18652 at commit [`b80d2fc`](https://github.com/apache/spark/commit/b80d2fca6362d63a5eecf43a40b30040d4ac392b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79819 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79819/testReport)** for PR 18652 at commit [`73f9827`](https://github.com/apache/spark/commit/73f982763a5d2991c51903aaad19c9ce6d06a49d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79673/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127893543
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    The major point here is the non-deterministic join condition push-down is safe only when the results are the exactly same before and after the push down. After we push it down, basically, it will be evaluated for each row of that side. Will it be evaluated in the same order and in the same number if we do not push it down? We can find many different scenarios to break it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127907280
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    We do not support non-deterministic join condition. Thus, our current execution orders in the join implementation might not behave correctly.
    
    If we really need to support it, we have to check what is the right behavior in the traditional DB system.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @gatorsmile @cloud-fan Do you have more comments or thoughts on this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127965749
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    Sure. I agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18652

It is a good question. Based on previous discussion, I think Join operator has no unique result in the non-deterministic case. The migration issue from Hive is because this kind of queries can't run by current SparkSQL. Whether we exactly follow Hive behavior is not real pain, if I didn't miss something.

Take the non-deterministic join in the query seen in the discussion thread at http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-quot-case-when-quot-doesn-t-be-supported-in-JOIN-tc21953.html#a21973 as an example:

SELECT a.col1
FROM tbl1 a
LEFT OUTER JOIN tbl2 b
ON
CASE
WHEN a.col2 IS NULL
THEN cast(rand(9)*1000 - 9999999999 as string)
ELSE
a.col2 END
= b.col3;

Where the different join result (the randomized value replacing NULL values of a.col2) actually doesn't matter, because it only retain a.col1 finally. The purpose of this query is to mitigate skew (huge NULLs) data when joining.

That said the columns from non-deterministic join conditions are not really useful at this cases. I even think it may be also the reason Hive doesn't put special consideration on non-deterministic expression when joining. As there's no unique result, users go to take care what they do with it.

I am not sure if I convey the idea clearly.

As an analogy, for a query like `SELECT a.col1 FROM table a WHERE rand() > 0.5`. Even our rand generates different numbers than Hive so the final result is different. It still helps migration issue from Hive workloads using rand.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127895248
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    Joining keys can only be equi-join. It is exactly the use case discussed in the dev mailling list. It's actually useful for the use cases.
    
    A general non-deterministic join condition pushdown doesn't make a lot of sense. The kind of predicates like `rand(1) > 0 && rand(11) < 0` can be a serious concern. The join results can be different before and after pushdown.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79820/testReport)** for PR 18652 at commit [`951124f`](https://github.com/apache/spark/commit/951124fccbb3a0d15066c4c8db1c6805af686384).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @gatorsmile When the flag is enabled, we don't follow Hive on non-deterministic join conditions.
    
    The difference are:
    
    * Hive allows non-deterministic expressions in equi join keys and other join conditions. We only allow the kind of expressions as equi join keys.
    * From the inspection on Hive, non-deterministic expressions when used as join conditions, are not treated in different way than deterministic ones. We pull non-deterministic equi join keys from Join operators.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    I mean, `t1.a = t2.b` before non-determinictic condition is an equi join condition, but `t1.a = t2.b` after non-determinictic condition is not. Is this true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81054/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79821/testReport)** for PR 18652 at commit [`b80d2fc`](https://github.com/apache/spark/commit/b80d2fca6362d63a5eecf43a40b30040d4ac392b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Can we say that, `t1.a = t2.b && t1.c > rand()` is a equal-join condition, but `t1.c > rand() && t1.a = t2.b` is not?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127895399
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    `rand(a)` and `rand(b)` are belonging to individual tables. So they are evaluated individually on different tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127893174
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    We use `ExtractEquiJoinKeys` to extract joining keys. You can check it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79820/testReport)** for PR 18652 at commit [`951124f`](https://github.com/apache/spark/commit/951124fccbb3a0d15066c4c8db1c6805af686384).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Let me talk with more people to get the feedbacks. Will respond you later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79674/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Yea, for the case with non-deterministic non equi join conditions, you'd face the issue of changing the number of calls. So I currently plan not to support it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #81054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81054/testReport)** for PR 18652 at commit [`793dac4`](https://github.com/apache/spark/commit/793dac4403926fb9f1421f4bbee59a8e9b82d7e8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79827/testReport)** for PR 18652 at commit [`b80d2fc`](https://github.com/apache/spark/commit/b80d2fca6362d63a5eecf43a40b30040d4ac392b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #80376 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80376/testReport)** for PR 18652 at commit [`abf51f7`](https://github.com/apache/spark/commit/abf51f7c76016737d494ac23d3071b2301f96445).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127894772
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    Even if for equi join, how about `rand(a) = rand(b)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79819/testReport)** for PR 18652 at commit [`73f9827`](https://github.com/apache/spark/commit/73f982763a5d2991c51903aaad19c9ce6d06a49d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by baibaichen <gi...@git.apache.org>.

Github user baibaichen commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @viirya , @jiangxb1987 @gatorsmile 
    
    In general, Hive doesn't consider non-deterministic  in join condition.
    
    Some terms:
    
    1 equi-joins with key, i.e. a.key = b.key, using **Joink** represented
    2 filter,  i.e. a.key = 2 or a.key > 1, using **JoinF** represented,
    
    Prior to  2.2.0, Hive doesn't support OR, so the join condition looks like as following:
    
    > _Joink_ **and** _Joink_ **and** _JoinF_
    
    For **Joink**, keys are extracted for later hash (reduce-side or map-side join). For **JoinF**, filters will be pushed down according to [OuterJoinBehavior](https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior)
    
    All codes are in[ `SemanticAnalyzer.parseJoinCondition`](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L2854). Predicate Pushdown starts with line [2902](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L2902).
    
    After 2.2.0 (with [HIVE-15211](https://issues.apache.org/jira/browse/HIVE-15211),[HIVE-15251](https://issues.apache.org/jira/browse/HIVE-15251)), Hive  supports complex expressions in ON clauses, but it still doesn't consider non-deterministic.
    
    Hive just pushes down filter if possible!  Given that, I agree suggestion of @viirya


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127874213
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    `a = c && rand(3) * b < 0 ` Are we able to push down the second one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127874260
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    The join type also matters. For example, are we able to push it to the left side for the right outer join?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    No, I don't think it's true. I think we don't consider the position of equi join condition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    What if we simply allow non-deterministic join condition? Since we allow non-deterministic filter condition, we should do this for join condition too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127853936
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    IIUC, we should not have the same concern (e.g., rand(1) > 0 && rand(11) < 0) on joining keys. The format of joining keys is "x equal to y". We will evaluate x and y for two joining tables. May you show an example we can't push down the joining keys?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80376/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #81054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81054/testReport)** for PR 18652 at commit [`793dac4`](https://github.com/apache/spark/commit/793dac4403926fb9f1421f4bbee59a8e9b82d7e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127903537
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    > The other preceding join conditions before equi join condition also could impact it. It could be skipped if the preceding join conditions is false, right?
    
    No. We evaluate the joining keys first to find matching/not matching rows, and then evaluate other join conditions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Alternatively, we can only support non-deterministic joining keys, and don't support non-deterministic conditions. It is the case discussed in the dev mailling list.
    
    I think non-deterministic joining keys have more clear semantics and we don't need to have the above concern.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    > As said in previous discussion, we can't avoid few issues regarding non-deterministic non equi join condition. We can simply allow it, but it faces inconsistency due to different join implementations. We can pull out it to downstream project, but it possibly changes the number of calls. EnsureRequirements can change the call order.
    
    > Notice that those issues are for non equi join condition, equi join condition is free from the issues.
    
    Why equi-join is free from the issues?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    > Why equi-join is free from the issues?
    
    Assume the equi-join predicates are in the form like `t1.a = rand(t2.b) && t1.c = rand(t2.d)`. When we compare the equi-join keys `(t1.a, t1.c)` and `(rand(t2.b), rand(t2.d))`, we compare them all and won't skip `t1.c = rand(t2.d)` if `t1.a = rand(t2.b)` is false. That says we can pull out it to downstream project and don't need to worry changing the number of calls.
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127893995

+ case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
+ j match {
+ // We can push down non-deterministic joining keys.
+ // We can't push down non-deterministic conditions.
+ case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
--- End diff --

Supporting only equi-join does not sound reasonable here. The join condition can be any predicate.

How about adding a SQLConf flag for controlling it? We can simply pushing it down no matter whether its semantics are the same or not, for making it consistent with Hive. By default, turn that flag off.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by baibaichen <gi...@git.apache.org>.

Github user baibaichen commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    can we add a flag i.e. ignore-non-deterministic , so that we can treat non-deterministic as deterministic, I believe this is what hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    **[Test build #79673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79673/testReport)** for PR 18652 at commit [`03f4d9f`](https://github.com/apache/spark/commit/03f4d9fcba3be969a9ae0176ce7370e250033473).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127898565
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    From the line of discussion, it seems to me you still talk joining keys and other join conditions together. However, pushing down non-deterministic joining keys actually doesn't change join results, as I said above. I am not sure why it doesn't make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    It seems to me that non-deterministic prediction in Join's condition is a corner use-case. A general (multiple) non-deterministic predicates in Join's condition maybe more rare case to encounter.
    
    Another reason is, to support multiple non-deterministic predicates in Join's condition, a much complex query plan (such as much additional Joins) seems inevitable. I'd like to avoid that if possible.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @baibaichen when we do so, I think the result is not as same as Hive's join result. Is it still useful?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Did not get your point. Could you just give an example why the non-deterministic expressions are always evaluated in the same order no matter which join types are chosen during the physical planning?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127965550
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    cc @cloud-fan and @hvanhovell if you have more insights that can be shared with us about this part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic joining keys from Join oper...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79672/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL] Pull non-deterministic equi join keys...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    @gatorsmile  Actually it is not rare we add a feature step by step in SparkSQL. This is not a reason preventing us from adding this support. I think this change already help much this kind of workload.
    
    As said in previous discussion, we can't avoid few issues regarding the non-deterministic non equi join condition. We can simply allow it, but it faces inconsistency due to different join implementations. We can pull out it to downstream project, but it possibly changes the number of calls. `EnsureRequirements` can change the call order.
    
    Notice that those issues are for non equi join condition, equi join condition is free from the issues.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127895586
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    However, `rand(a)` and `rand(b)` could share the same state inside of `rand`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127896217
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    +          // We can't push down non-deterministic conditions.
    +          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, _)
    --- End diff --
    
    The whole thing does not make sense to me at all. Here, I think we are just trying to behave consistent with Hive, although this looks a bug to me. We might really check how Hive works for supporting it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [SPARK-21497][SQL][WIP] Pull non-deterministic equi join...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79827/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18652#discussion_r127763010
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1912,6 +1913,26 @@ class Analyzer(
               nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
             }.copy(child = newChild)
     
    +      case j: Join if j.condition.isDefined && !j.condition.get.deterministic =>
    +        j match {
    +          // We can push down non-deterministic joining keys.
    --- End diff --
    
    This is not always true. It depends on the other factors, e.g., the position of predicates and the type of joins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18652: [WIP] Pull non-deterministic conditions from Join operat...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18652
  
    cc @cloud-fan for more advice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org