You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2014/07/16 05:42:11 UTC

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/1428

    [SPARK-2509][SQL] Add optimization for Substring.

    `Substring` including `null` literal cases could be added to `NullPropagation`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-2509

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1428
    
----
commit d9eb85f902c5498f9b7b6c6035471f622dd7d629
Author: Takuya UESHIN <ue...@happy-camper.st>
Date:   2014-07-16T03:31:45Z

    Add Substring cases to NullPropagation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1428#discussion_r14981973
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
               case Literal(candidate, _) if candidate == v => true
               case _ => false
             })) => Literal(true, BooleanType)
    +      case e @ Substring(Literal(null, _), _, _) => Literal(null, e.dataType)
    --- End diff --
    
    @rxin, thank you for your comment.
    The rule `ConstantFolding` can be applied only if the expression is foldable, i.e. ALL of the child expressions are foldable, and the `NullPropagation` covers the case there exists at least one `null` literal regardless of foldabilities of the others.
    So I think the cases are needed.
    
    BTW, `Substring` can be foldable, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125499
  
    I'll open another issue about foldability of `Substring`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49205341
  
    It doesn't bring much benefit right now, but what we are doing here is creating patterns in NullPropagation to specify the semantics of each individual expression ... not very scalable in maintaining this code base. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125889
  
    @rxin I agree that we need the better way for `NullPropagation ` instead of too much pattern matching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49210382
  
    I'm not sure I agree with that.  This is a pretty niche optimization not something fundamental about the expressions that is required for correct evaluation (and the first version of this code had a bunch of mistakes).  Having all of these rules in one place made it easier for me to realize that when doing the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1428#discussion_r14980614
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
               case Literal(candidate, _) if candidate == v => true
               case _ => false
             })) => Literal(true, BooleanType)
    +      case e @ Substring(Literal(null, _), _, _) => Literal(null, e.dataType)
    --- End diff --
    
    Is this not doable with constant folding? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125623
  
    I've already merged this. I think we should re-evaluate in the future whether we should push some stuff into foldable itself, because as I see it a lot of the pattern matching rules in NullPropagation could potentially be implemented by each expression itself to be foldable. I think that might also lead to higher planner performance (although that's probably less important). More importantly, it does feel more natural to me that this stuff goes into each expression, rather than having a rule for each NullPropagation. 
    
    cc @marmbrus


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125317
  
    Thanks. Merging this in master & branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49120808
  
    QA tests have started for PR 1428. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16705/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125284
  
    QA results for PR 1428:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16705/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49125413
  
    @rxin Ah, wait for a moment.
    What do you think about foldability of the `Substring` I mentioned above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49211732
  
    I would be supportive of changing it to match my original proposal...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49211681
  
    That is exactly the argument I made when the folding logic was added. :) I suggested that we add `deterministic` instead and then have a rule that folds things that are `deterministic` and have no `references`.  `deterministic` I would argue is something fundamental about the expression itself, unlike these optimizations we are making.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1428#discussion_r14980864
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -171,6 +171,9 @@ object NullPropagation extends Rule[LogicalPlan] {
               case Literal(candidate, _) if candidate == v => true
               case _ => false
             })) => Literal(true, BooleanType)
    +      case e @ Substring(Literal(null, _), _, _) => Literal(null, e.dataType)
    --- End diff --
    
    Now I think about it more, this is probably ok to leave it here since in constant folding we don't really special case for null literals. However, I looked into other rules in NullPropagation and a lot of them don't really belong there... (has nothing to do with your PR though)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49188774
  
    You are right that this rule does more than null propagation now.  I'm not sure what a better name would be.  `DegenerateExpressionSimplification`?
    
    Regarding moving null propagation into the expressions, you could do it... but what would it look like?  You specify which of the children make the entire expression null if they are null?  Seems like a lot of refactoring for little benefit...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1428#issuecomment-49210870
  
    Well with that argument we shouldn't have any of the constant folding logic in expressions themselves either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---