You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by vidma <gi...@git.apache.org> on 2015/11/04 00:13:09 UTC

[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

GitHub user vidma opened a pull request:

    https://github.com/apache/spark/pull/9451

    WIP: Optimize Inner joins with skewed null values

    Draft of first step in optimizing skew in joins (it is quite common to have skew in data, and lots of `nulls` on either side of join is quite common (for us), especially with left join, say when joining `dimensions` to `fact` tables)
    
    feel free to propose a better approach / add commits.
    
    any ideas for an easy way to check if the rule was already applied?  After adding a `isNotNull` filter `someAttribute.nullable`  still returns `true`. I couldn't come up with anything better than simply doing a separate batch of 1 iteration.
    
    @marmbrus  (as discussed at Spark Summit EU)
    
    ---
    
    going more serious, a draft for fighting skew in left join is [rather simple with DataFrames](https://gist.github.com/vidma/98332db0f82e7e5b09e5), solves the null skew, and don't seem to add lots of overhead (though tried only on subset of all our joins which used another abstraction of ours).
    
    however this, so far, seems harder to express in optimizer rules:
    - need to add "fake" colums. no idea yet how to do this to be able to refer to the added column in join conditions
    ```scala
    val leftNullsSprayValue = CaseWhen(
          Seq(
            nullableJoinKeys(left).map(IsNull).reduceLeft(Or), // if any join keys are null
            Cast(Multiply(new Rand(), Literal(100000)), IntegerType),
            Literal(0) // otherwise
          ))
    // but how to add this column to left & right relations?
    // e.g. this fails, saying it's not `resolved`
    Alias(leftNullsSprayValue)("leftNullsSprayKey")()
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vidma/spark feature/fight-skew-in-inner-join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9451
    
----
commit 214deeae2d4c634536df0d9bd6c2ffcfc573ce7b
Author: vidmantas zemleris <vi...@vinted.com>
Date:   2015-11-03T22:38:08Z

    Optimize Inner joins with skewed null values

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-216623160
  
    @vidma I think this is already fixed in master (having constraints for join and turn constraints into predicate, push down the predicates), do you mind to close this PR?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9451#discussion_r44215375
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -904,3 +907,49 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] {
           a.copy(groupingExpressions = newGrouping)
       }
     }
    +
    +/**
    + * For an inner join - remove rows with null keys on both sides
    + */
    +object JoinSkewOptimizer extends Rule[LogicalPlan] with PredicateHelper {
    +  /**
    +   * Adds a null filter on given columns, if any
    +   */
    +  def addNullFilter(columns: AttributeSet, expr: LogicalPlan): LogicalPlan = {
    +    columns.map(IsNotNull(_))
    +      .reduceLeftOption(And)
    +      .map(Filter(_, expr))
    +      .getOrElse(expr)
    +  }
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case f@Join(left, right, joinType, joinCondition) =>
    +      // get "real" join conditions, which refer both left and right
    +      val joinConditionsOnBothRelations = joinCondition
    +        .map(splitConjunctivePredicates).getOrElse(Nil)
    +        .filter(_.isInstanceOf[EqualTo])
    +        .filter(cond => !canEvaluate(cond, left) && !canEvaluate(cond, right))
    +
    +      def nullableJoinKeys(leftOrRight: LogicalPlan) = {
    --- End diff --
    
    ideas on better/simpler way to extract `left/right join key columns` ?
    
    maybe:
    ```scala
      joinConditionsOnBothRelations.map { case EqualTo(leftColumn, rightColumn) =>
        // check columns on both sides of join condition, 
        // and take the one which refers to the required join side
        Seq(leftColumn, rightColumn)
          .filter(canEvaluate(_, leftOrRight))
          .filter(_.nullable)
      }
    ```
    
    is there a big difference between checking for nullability one side of EqualTo() predicate vs magically extracting the equivalent attribute from `left/right` LogicalPlans'?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168879127
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48694/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153529582
  
    How about we update the title to include the jira? Is https://issues.apache.org/jira/browse/SPARK-9372 the right one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154731768
  
    so any comments, guys? 
    @marmbrus ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154743562
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153529631
  
    Regarding the format of the title, we can do `[SPARK-xxxxx] [SQL] ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168758764
  
    **[Test build #48673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48673/consoleFull)** for PR 9451 at commit [`d05a63d`](https://github.com/apache/spark/commit/d05a63d62a6cdea4949ba0f7efc9e021308070bd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9451


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168755578
  
    @marmbrus ping ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153529828
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154821753
  
    **[Test build #45301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45301/consoleFull)** for PR 9451 at commit [`0fa27c4`](https://github.com/apache/spark/commit/0fa27c46b7e423ad43261f625ed6d9a007465882).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168837211
  
    **[Test build #48694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48694/consoleFull)** for PR 9451 at commit [`1bcf9aa`](https://github.com/apache/spark/commit/1bcf9aacd1d9415e57aa416a259ca61b6534c08d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154743895
  
    **[Test build #45291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45291/consoleFull)** for PR 9451 at commit [`70d1fad`](https://github.com/apache/spark/commit/70d1fad8f110c556d51aacf16f08ce89db3b7ca1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168832426
  
    **[Test build #48673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48673/consoleFull)** for PR 9451 at commit [`d05a63d`](https://github.com/apache/spark/commit/d05a63d62a6cdea4949ba0f7efc9e021308070bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153529807
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153530818
  
    **[Test build #44974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)** for PR 9451 at commit [`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-204493340
  
    @marmbrus pinging you to tag Catalyst plan rewriting guru.
    
    For current inner join PR, there's a flanky test in python, couldn't track it down yet.
    
    For more generic case (next PR), It doesn't seem to be easy not to loose table aliases, and add a randomized spraying column as extra  left join key.
    
    P.S. I'm in SF bay until Fri 8 Apr (better before Thu 9), so I could come over to chat to you guys live. 
    Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153530822
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168963214
  
    **[Test build #48754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48754/consoleFull)** for PR 9451 at commit [`cd8ca34`](https://github.com/apache/spark/commit/cd8ca343019d1e7a2a43128ea070f9cda828dc81).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154795942
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153529478
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154821210
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154752200
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153687177
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153530824
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9451#discussion_r44222500
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -904,3 +909,36 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] {
           a.copy(groupingExpressions = newGrouping)
       }
     }
    +
    +/**
    + * For an inner join - remove rows with null keys on both sides
    + */
    +object JoinSkewOptimizer extends Rule[LogicalPlan] with PredicateHelper {
    +  /**
    +   * Adds a null filter on given columns, if any
    +   */
    +  def addNullFilter(columns: Seq[Expression], expr: LogicalPlan): LogicalPlan = {
    +    columns.map(IsNotNull)
    +      .reduceLeftOption(And)
    +      .map(Filter(_, expr))
    +      .getOrElse(expr)
    +  }
    +
    +  private def hasNullableKeys(leftKeys: Seq[Expression], rightKeys: Seq[Expression]) = {
    +    leftKeys.exists(_.nullable) || rightKeys.exists(_.nullable)
    +  }
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case join @ Join(left, right, joinType, originalJoinCondition) =>
    +      join match {
    +        case ExtractEquiJoinKeys(_, leftKeys, rightKeys, _, _, _)
    +          if hasNullableKeys(leftKeys, rightKeys) && Seq(Inner, LeftSemi).contains(joinType) =>
    +            // add a non-null join-key filter on both sides of join
    +            val newLeft = addNullFilter(leftKeys.filter(_.nullable), left)
    +            val newRight = addNullFilter(rightKeys.filter(_.nullable), right)
    +            Join(newLeft, newRight, joinType, originalJoinCondition)
    --- End diff --
    
    in `Inner | Semi` join case, the null filter could be added to `joinCondition` (instead of left/right relations), assuming that I'll be pushed down by subsequent optimizer rules. 
    which do you prefer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154832382
  
    **[Test build #45301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45301/consoleFull)** for PR 9451 at commit [`0fa27c4`](https://github.com/apache/spark/commit/0fa27c46b7e423ad43261f625ed6d9a007465882).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168937118
  
    **[Test build #48754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48754/consoleFull)** for PR 9451 at commit [`cd8ca34`](https://github.com/apache/spark/commit/cd8ca343019d1e7a2a43128ea070f9cda828dc81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by vidma <gi...@git.apache.org>.
Github user vidma commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9451#discussion_r44215458
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -904,3 +907,49 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] {
           a.copy(groupingExpressions = newGrouping)
       }
     }
    +
    +/**
    + * For an inner join - remove rows with null keys on both sides
    + */
    +object JoinSkewOptimizer extends Rule[LogicalPlan] with PredicateHelper {
    +  /**
    +   * Adds a null filter on given columns, if any
    +   */
    +  def addNullFilter(columns: AttributeSet, expr: LogicalPlan): LogicalPlan = {
    +    columns.map(IsNotNull(_))
    +      .reduceLeftOption(And)
    +      .map(Filter(_, expr))
    +      .getOrElse(expr)
    +  }
    +
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case f@Join(left, right, joinType, joinCondition) =>
    +      // get "real" join conditions, which refer both left and right
    +      val joinConditionsOnBothRelations = joinCondition
    +        .map(splitConjunctivePredicates).getOrElse(Nil)
    +        .filter(_.isInstanceOf[EqualTo])
    +        .filter(cond => !canEvaluate(cond, left) && !canEvaluate(cond, right))
    +
    +      def nullableJoinKeys(leftOrRight: LogicalPlan) = {
    --- End diff --
    
    so it seems [ExtractEquiJoinKeys](https://github.com/apache/spark/blob/67d468f8d9172569ec9846edc6432240547696dd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L167) is the right way to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168879046
  
    **[Test build #48694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48694/consoleFull)** for PR 9451 at commit [`1bcf9aa`](https://github.com/apache/spark/commit/1bcf9aacd1d9415e57aa416a259ca61b6534c08d).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153646072
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168832634
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48673/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154832396
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168963367
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168879126
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153519356
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-155232148
  
    Hey, thanks for working on this!  I probably won't have time to look at this in depth until after the Spark 1.6 release (early december).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168963368
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48754/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154752182
  
    **[Test build #45291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45291/consoleFull)** for PR 9451 at commit [`70d1fad`](https://github.com/apache/spark/commit/70d1fad8f110c556d51aacf16f08ce89db3b7ca1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153647469
  
    **[Test build #45010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45010/consoleFull)** for PR 9451 at commit [`4490d9d`](https://github.com/apache/spark/commit/4490d9d214e80a4f232c38f6756588aea8e7b941).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153687182
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45010/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153530486
  
    **[Test build #44974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44974/consoleFull)** for PR 9451 at commit [`9a6d9dc`](https://github.com/apache/spark/commit/9a6d9dc1fa097dd015b4bf83d9002a3f3d19d8ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153687049
  
    **[Test build #45010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45010/consoleFull)** for PR 9451 at commit [`4490d9d`](https://github.com/apache/spark/commit/4490d9d214e80a4f232c38f6756588aea8e7b941).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-204542058
  
    @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-154743558
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: WIP: Optimize Inner joins with skewed null val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-153646052
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9372] [SQL] Filter nulls in Inner joins...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9451#issuecomment-168832628
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org