You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2014/08/29 02:38:08 UTC

[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/2188

    [SPARK-2961][SQL] Use statistics to prune batches within cached partitions

    This PR is based on #1883 authored by @marmbrus. Key differences:
    
    1. Batch pruning instead of partition pruning
    
       When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).
    
    1. More filters are supported
    
       Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions are supported. Disjunctions are not supported yet. In short, 
    
       ```sql
       SELECT * FROM t WHERE i > 0 AND i < 10
       ```
    
       is executed in a more smart way, while
    
       ```sql
       SELECT * FROM t WHERE i < 0 OR i > 10
       ```
    
       still scans the whole table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark in-mem-batch-pruning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2188
    
----
commit 1a31cd519b999848f0b76c94a9e258b17d977cc6
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-16T09:49:34Z

    Merge branch 'inMemStats' into in-mem-batch-pruning
    
    Tried to combine Michael's partition pruning branch and the batched
    column buffer building. In this way, we actually got "batch pruning"
    rather than partition pruning.
    
    Conflicts:
    	sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
    
    Conflicts:
    	sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

commit 520f41a0618a9529dfa86e2f748db405a4802f4f
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-18T02:44:38Z

    Added more predicate filters, fixed table scan stats for testing purposes

commit 77a2203ceb18546ddee2373af29036971bebd572
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-28T23:00:44Z

    Code cleanup, bugfix, and adding tests
    
    * Bugfix: gatherStats() should be called in NullableColumnBuilder,
      otherwise null values are skipped
    * Bugfix: fixed lower bound comparison in StringColumnStats and
      TimestampColumnStats

commit 46890319002be21061bcb1931ea9ccb43063fcf0
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-29T00:17:21Z

    More test cases

commit 04a956d6ed415b0ecfa717f7067231026ba25414
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-29T00:18:02Z

    Renamed PartitionSkippingSuite to PartitionBatchPruningSuite

commit ef057747fa3d9f85615c5650873abc374e21e16d
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-29T00:20:20Z

    Minor code cleanup

commit b4a82810b3567a9a7ba085750f6f03103ed83ae7
Author: Cheng Lian <li...@gmail.com>
Date:   2014-08-29T00:22:32Z

    Minor code cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53830460
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19441/consoleFull) for   PR 2188 at commit [`9bd234b`](https://github.com/apache/spark/commit/9bd234bed6c2550218a997fb20684a0c750372c1).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54269670
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19650/consoleFull) for   PR 2188 at commit [`d2a1d66`](https://github.com/apache/spark/commit/d2a1d6621a4b28bc9dc351204963659abab89d75).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54250488
  
    Thanks for taking this over!  A few minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54383012
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19705/consoleFull) for   PR 2188 at commit [`68cf019`](https://github.com/apache/spark/commit/68cf0197a9b3159ce93909d5d66c35e84507e802).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54373129
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19684/consoleFull) for   PR 2188 at commit [`4254f6c`](https://github.com/apache/spark/commit/4254f6c2e05e523c051442edcd183e2cefc32f7e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54362420
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19684/consoleFull) for   PR 2188 at commit [`4254f6c`](https://github.com/apache/spark/commit/4254f6c2e05e523c051442edcd183e2cefc32f7e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53954582
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19517/consoleFull) for   PR 2188 at commit [`270ca61`](https://github.com/apache/spark/commit/270ca6122be1e4a32c40edc66298b9bc7ed81a6c).
     * This patch **passes** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17032500
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/BooleanBitSetSuite.scala ---
    @@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
         // Tests encoder
         // -------------
     
    -    val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, BOOLEAN, BooleanBitSet)
    +    val builder = TestCompressibleColumnBuilder(new NoopColumnStats, BOOLEAN, BooleanBitSet)
    --- End diff --
    
    laziness :) We should probably implement the statistics that make sense here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17032061
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala ---
    @@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
             attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId))
           }
     
    -      iterator
    -        .map(batch => requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
    +      val rows = iterator
    +        // Skip pruned batches
    +        .filter { cachedBatch =>
    +          if (!partitionFilter(cachedBatch.stats)) {
    --- End diff --
    
    Yeah exactly.  I think in general it might be a good idea to guard these kind of optimizations behind feature flags in case bugs are found in the future.  We can remove them after enough time has passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17030853
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala ---
    @@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
             attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId))
           }
     
    -      iterator
    -        .map(batch => requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
    +      val rows = iterator
    +        // Skip pruned batches
    +        .filter { cachedBatch =>
    +          if (!partitionFilter(cachedBatch.stats)) {
    --- End diff --
    
    How risky do you think this is?  I'm wondering if we should add a flag that turns off the partition skipping.  We don't even have to document it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54382647
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53858144
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19463/consoleFull) for   PR 2188 at commit [`b6f9f6c`](https://github.com/apache/spark/commit/b6f9f6c1e91bada438f1de1440de2719899f88e6).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17031043
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.catalyst.expressions
    +
    +/**
    + * Builds a map that is keyed by an Attribute's expression id. Using the expression id allows values
    + * to be looked up even when the attributes used differ cosmetically (i.e., the capitalization
    + * of the name, or the expected nullability).
    + */
    +object AttributeMap {
    --- End diff --
    
    Perhaps not in this PR, but we should search for all instances where we create Maps keyed by Attributes or expressionId and switch it to use this class instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17031879
  
    --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala ---
    @@ -30,31 +30,30 @@ import org.apache.spark.sql.hive.test.TestHive
      */
     class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
       // TODO: bundle in jar files... get from classpath
    -  private lazy val hiveQueryDir = TestHive.getHiveFile(
    --- End diff --
    
    It's changed by mistake. Will make it `private` again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2188


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53826091
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19441/consoleFull) for   PR 2188 at commit [`9bd234b`](https://github.com/apache/spark/commit/9bd234bed6c2550218a997fb20684a0c750372c1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17030988
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala ---
    @@ -96,13 +103,95 @@ private[sql] case class InMemoryRelation(
     
     private[sql] case class InMemoryColumnarTableScan(
         attributes: Seq[Attribute],
    +    predicates: Seq[Expression],
         relation: InMemoryRelation)
       extends LeafNode {
     
       override def output: Seq[Attribute] = attributes
     
    +  import org.apache.spark.sql.catalyst.dsl.expressions._
    +  import org.apache.spark.sql.catalyst.expressions._
    --- End diff --
    
    These probably don't belong here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53851776
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19463/consoleFull) for   PR 2188 at commit [`b6f9f6c`](https://github.com/apache/spark/commit/b6f9f6c1e91bada438f1de1440de2719899f88e6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54391806
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19705/consoleFull) for   PR 2188 at commit [`68cf019`](https://github.com/apache/spark/commit/68cf0197a9b3159ce93909d5d66c35e84507e802).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: BlockManagerId, maxMem: Long)`
      * `case class SparkListenerBlockManagerRemoved(time: Long, blockManagerId: BlockManagerId)`
      * `case class SparkListenerApplicationStart(appName: String, appId: Option[String], time: Long,`
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53825264
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19439/consoleFull) for   PR 2188 at commit [`b4a8281`](https://github.com/apache/spark/commit/b4a82810b3567a9a7ba085750f6f03103ed83ae7).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54374907
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17032019
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala ---
    @@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
             attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId))
           }
     
    -      iterator
    -        .map(batch => requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
    +      val rows = iterator
    +        // Skip pruned batches
    +        .filter { cachedBatch =>
    +          if (!partitionFilter(cachedBatch.stats)) {
    --- End diff --
    
    Do you mean the risk of generating wrong query results? Hm... agree to add a flag. I did encounter several cases that I didn't covered at first when testing this branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53952977
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19517/consoleFull) for   PR 2188 at commit [`270ca61`](https://github.com/apache/spark/commit/270ca6122be1e4a32c40edc66298b9bc7ed81a6c).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53954660
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19518/consoleFull) for   PR 2188 at commit [`062c315`](https://github.com/apache/spark/commit/062c3157f0aaebe8bd6e92b15b0efa1ac1983041).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53825215
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19439/consoleFull) for   PR 2188 at commit [`b4a8281`](https://github.com/apache/spark/commit/b4a82810b3567a9a7ba085750f6f03103ed83ae7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53825846
  
    Scala style check failed, although the code style is actually OK...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17030908
  
    --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala ---
    @@ -30,31 +30,30 @@ import org.apache.spark.sql.hive.test.TestHive
      */
     class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
       // TODO: bundle in jar files... get from classpath
    -  private lazy val hiveQueryDir = TestHive.getHiveFile(
    --- End diff --
    
    Why did you change the visibility here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-53953073
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19518/consoleFull) for   PR 2188 at commit [`062c315`](https://github.com/apache/spark/commit/062c3157f0aaebe8bd6e92b15b0efa1ac1983041).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54382609
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54262405
  
    @marmbrus Addressed all the comments, thanks for the detailed review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17032472
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/BooleanBitSetSuite.scala ---
    @@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
         // Tests encoder
         // -------------
     
    -    val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, BOOLEAN, BooleanBitSet)
    +    val builder = TestCompressibleColumnBuilder(new NoopColumnStats, BOOLEAN, BooleanBitSet)
    --- End diff --
    
    @marmbrus Would you mind to elaborate a bit on why you changed `BooleanColumnStats` to `NoopColumnStats` in #1883?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2188#discussion_r17030789
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala ---
    @@ -96,13 +103,95 @@ private[sql] case class InMemoryRelation(
     
     private[sql] case class InMemoryColumnarTableScan(
         attributes: Seq[Attribute],
    +    predicates: Seq[Expression],
         relation: InMemoryRelation)
       extends LeafNode {
     
       override def output: Seq[Attribute] = attributes
     
    +  import org.apache.spark.sql.catalyst.dsl.expressions._
    +  import org.apache.spark.sql.catalyst.expressions._
    +
    +  val buildFilter: PartialFunction[Expression, Expression] = {
    --- End diff --
    
    Can we document the contract for these filters?  "Returns false iff it is impossible for the input expression to evaluate to true based on statistics collected about this partition"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2188#issuecomment-54262762
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19650/consoleFull) for   PR 2188 at commit [`d2a1d66`](https://github.com/apache/spark/commit/d2a1d6621a4b28bc9dc351204963659abab89d75).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org