You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhonghaihua <gi...@git.apache.org> on 2015/12/03 11:21:01 UTC

[GitHub] spark pull request: pull out nondeterministic expressions from Joi...

GitHub user zhonghaihua opened a pull request:

    https://github.com/apache/spark/pull/10128

    pull out nondeterministic expressions from Join

    Currently,`nondeterministic expressions` are only allowed in `Project` or `Filter`,And only when we use nondeterministic expressions in `UnaryNode` can be pulled out.
    
    But,Sometime in many case,we will use nondeterministic expressions to process `join keys` avoiding data skew.for example:
    ```
    select * 
    from tableA a 
    join 
    (select * from tableB) b 
    on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( (-rand() * 10000000 ) as string ) else a.brand_code end ))  = b.brand_code
    
    ```
    
    This PR introduce a mechanism to pull out nondeterministic expressions from `Join`,so we can use nondeterministic expression in `Join` appropriately.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhonghaihua/spark pulloutJoinNondeterministic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10128.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10128
    
----
commit 6e166578a5c1a1faf260389509663ac8c71ec015
Author: zhonghaihua <79...@qq.com>
Date:   2015-11-30T07:44:49Z

    pull out nondeterministic expressions from Join
    
    pull out nondeterministic expressions from Join
    
    pull out nondeterministic expressions from Join

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by jeanlyn <gi...@git.apache.org>.

Github user jeanlyn commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169563822
  
    It's difference from join selection, it just pull out nondeterministic expressions of join condition to the left or right children, but it seems it can reuse the code of `ExtractEquiJoinKeys`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169231327
  
    @marmbrus Thanks for your suggestions. I think your idea can simply solve problem. But in some situations, this seems not very suitable. 
    For example:
    `Join(testRelation, testRelation2, Inner,Some(And(EqualTo(a, b), EqualTo(Rand(33) * c, d))))` If `c` is an attribute belong to `testRelation2`, I think `Rand(33)` is more appropriately pulled out to the right child tree of `Join`, otherwise, if belong to `testRelation`, it is appropriately pulled out to left child tree. 
    
    When `nondeterministic expressions` is used with `table attribute`, I think pull out it should depend on the attribute. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169093499
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48778/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169257014
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169089482
  
    This seems like a reasonable thing to do, but the implementation seems unnecessarily complex.  Why not just:
     - `transform` the condition, matching on non deterministic subtrees with no references.
     - when you find one create an alias and add it to an `ArrayBuffer`, replacing the tree with `alias.toAttribute`
     - if the array buffer is empty, return the same tree.  otherwise, add the two projections as you do now and use the transformed condition for the join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-161969421
  
    @cloud-fan Thanks for your advice. But, as @jeanlyn said,`Repartition` will deal with all data, and this PR will deal with join keys cause data skew.
    Because in some situations, we will use this operator to avoid data skew in `SQL`, then I think maybe we should support this. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #10128: [SPARK-12125][SQL] pull out nondeterministic expressions...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/10128
  
    Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10128#issuecomment-169093176
  
    **[Test build #48778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48778/consoleFull)** for PR 10128 at commit [`6e16657`](https://github.com/apache/spark/commit/6e166578a5c1a1faf260389509663ac8c71ec015).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org