You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2016/05/28 05:08:15 UTC

[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/13371

    [SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader

    ## What changes were proposed in this pull request?
    
    When we use vecterized parquet reader, although the base reader (i.e., `SpecificParquetRecordReaderBase`) will retrieve pushed-down filters for RowGroups-level filtering, we seems not really set up the filters to be pushed down. This patch tries to set the filters for pushing down to configuration for the reader.
    
    ## How was this patch tested?
    Existing tests should be passed.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 vectorized-reader-push-down-filter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13371.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13371
    
----
commit 5687a3b5527817c809244305468bfe4968bedcec
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-05-28T05:03:06Z

    Try to push down filter at RowGroups level for parquet reader.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65301812
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -578,62 +583,6 @@ private[sql] object ParquetFileFormat extends Logging {
         }
       }
     
    -  /** This closure sets various Parquet configurations at both driver side and executor side. */
    -  private[parquet] def initializeLocalJobFunc(
    --- End diff --
    
    We longer use this two functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Yea. Since this one was closed by asfgit, I am not sure you can reopen it.
    
    
    
    
    
    On Wed, Jun 15, 2016 at 7:39 PM -0700, "Liang-Chi Hsieh" <no...@github.com> wrote:
    
    
    
    
    
    
    
    
    
    
    
    
    @yhuai ok. Do you mean I need to create a new PR for this?
    
    
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub, or mute the thread.
    
    
      
      
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222408282
  
    cc @nongli @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222293505
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59550/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13371: [SPARK-15639][SQL] Try to push down filter at Row...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65503930
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -344,6 +344,11 @@ private[sql] class ParquetFileFormat
           val hadoopAttemptContext =
             new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId)
     
    +      // Try to push down filters when filter push-down is enabled.
    +      // Notice: This push-down is RowGroups level, not individual records.
    --- End diff --
    
    We use `org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups` in `SpecificParquetRecordReaderBase` to do filtering.
    
    The implementation of `RowGroupFilter` is at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java.
    
    From [this](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L91), Looks like it does filtering.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @yhuai Your step 3 may not work. We are going to filter the row groups for each parquet file to read in `VectorizedParquetRecordReader`. I think we don't do anything regarding creating splits?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/13371
  
    It is a good idea to add it if parquet supports it (I have an impression that parquet does not support it. But maybe I am wrong). I think having benchmark results is a good practice, so we can avoid it hit any obvious issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Can you add results showing that there are skipped row groups with this change (and before this patch all row groups are loaded)?
    
    For those results, let's also put them in the description of the new PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    **[Test build #60246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @liancheng Got it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @liancheng Thanks! I didn't notice that. I will rerun the benchmark. I've re-submitted this PR at #13701.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    cc @rxin Can you also take a look of this? This is staying for a while too. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    cc @cloud-fan too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    ping @yhuai I've addressed the comments. Please take a look again. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222293479
  
    **[Test build #59550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59550/consoleFull)** for PR 13371 at commit [`5687a3b`](https://github.com/apache/spark/commit/5687a3b5527817c809244305468bfe4968bedcec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    **[Test build #60256 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222290971
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    ping @yhuai @rxin @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65303008
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -578,62 +583,6 @@ private[sql] object ParquetFileFormat extends Logging {
         }
       }
     
    -  /** This closure sets various Parquet configurations at both driver side and executor side. */
    -  private[parquet] def initializeLocalJobFunc(
    --- End diff --
    
    Nor `overrideMinSplitSize()` as well. I guess https://issues.apache.org/jira/browse/SPARK-10143 issue would be still happening. I wanted to remove this after verifying it but I could have some time to do so..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @liancheng 
    
    I rerun the benchmark that excludes the time of writing Parquet file:
    
        test("Benchmark for Parquet") {
          val N = 1 << 50
            withParquetTable((0 until N).map(i => (101, i)), "t") {
              val benchmark = new Benchmark("Parquet reader", N)
              benchmark.addCase("reading Parquet file", 10) { iter =>
                sql("SELECT _1 FROM t where t._1 < 100").collect()
              }
              benchmark.run()
          }
        }
    
    `withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 3.13.0-57-generic
        Westmere E56xx/L56xx/X56xx (Nehalem-C)
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                            76 /   88          3.4         291.0       1.0X
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 3.13.0-57-generic
        Westmere E56xx/L56xx/X56xx (Nehalem-C)
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                            81 /   91          3.2         310.2       1.0X
    
    Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration.
    
    After this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                            80 /   95          3.3         306.5       1.0X
    
    Before this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                            80 /  103          3.3         306.7       1.0X
    
    For non-pushdown case, from the results, I think this patch doesn't affect normal code path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/13371
  
    BTW, I can't see any reason not to add a row-group level filter for parquet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222290740
  
    **[Test build #59549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59549/consoleFull)** for PR 13371 at commit [`5687a3b`](https://github.com/apache/spark/commit/5687a3b5527817c809244305468bfe4968bedcec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    I just talked to @liancheng offline. I don't think we should've merged this until we have verified there is no performance regression, and we definitely shouldn't have merged this in 2.0.
    
    @liancheng can you revert this from both master and branch-2.0?
    
    @viirya can you run some parquet scan benchmark and make sure this does not result in perf regression?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @yhuai We used to support row group level filter push-down before refactoring `HadoopFsRelation` into `FileFormat`, but lost it (by accident I guess) after the refactoring. So now we only have row group level filtering when the vectorized reader is not used, [see here][1].
    
    And yes, both `ParquetInputFormat` and `ParquetRecordReader` do row group level filtering.
    
    This LGTM. Thanks for fixing it! Merging to master and 2.0.
    
    [1]: https://github.com/apache/spark/blob/54f758b5fc60ecb0da6b191939a72ef5829be38c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L371-L378


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @yhuai I've run a simple benchmark as following:
    
        test("Benchmark for Parquet") {
          val N = 1 << 20
    
          val benchmark = new Benchmark("Parquet reader", N)
          benchmark.addCase("reading Parquet file", 1) { iter =>
            withParquetTable((0 until N).map(i => (101, i)), "t") {
              sql("SELECT _1 FROM t where t._1 < 100").show()
            }
          }
          benchmark.run()
        }
    
    Before this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                        34225 / 34225          0.0       32639.5       1.0X
    
    After this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        reading Parquet file                        31350 / 31350          0.0       29897.6       1.0X
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222293503
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    And once we have more data, it might make sense to merge this in 2.0!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @rxin One thing needs to be explain is, because we just have one configuration to control filter push down, it affects row-based filter push down and this row-group filter push down.
    
    The benchmark I posted above is running it against this patch and master branch individually. Of course it includes the time to write the parquet data, I will change it. I want to confirm if this kind of benchmark is enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13371: [SPARK-15639][SQL] Try to push down filter at Row...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r66308569
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -344,6 +344,11 @@ private[sql] class ParquetFileFormat
           val hadoopAttemptContext =
             new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId)
     
    +      // Try to push down filters when filter push-down is enabled.
    +      // Notice: This push-down is RowGroups level, not individual records.
    --- End diff --
    
    Besides, as we use the metadata in merged schema to figure out if a field is optional (i.e. not in all parquet files) or not to decide to push down a filter regarding it, this info has been ignored in `FileSourceStrategy` now. Without the fixing in this change, the push-down row-group level filtering will be failed due to not existing field in parquet file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222291012
  
    **[Test build #59550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59550/consoleFull)** for PR 13371 at commit [`5687a3b`](https://github.com/apache/spark/commit/5687a3b5527817c809244305468bfe4968bedcec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @yhuai Parquet also does this filtering at ParquetRecordReader (https://github.com/apache/parquet-mr/blob/4b1ff8f4b9dfa0ccb064ef286cf2953bfb2c492d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordReader.java#L178) and ParquetReader(https://github.com/apache/parquet-mr/blob/4b1ff8f4b9dfa0ccb064ef286cf2953bfb2c492d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L145).
    
    In Spark, we also did this at SpecificParquetRecordReaderBase (https://github.com/apache/spark/blob/f958c1c3e292aba98d283637606890f353a9836c/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L103).
    
    I've manually tested it. But it should be good to have a formal test case for it as you said. I will try to add it later, maybe when I come back to work few days later...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    The description is updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/13371
  
    Can you provide a test case that shows the problem? Also, can you provide benchmark results of the performance benefit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222408752
  
    also cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65301661
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -578,62 +583,6 @@ private[sql] object ParquetFileFormat extends Logging {
         }
       }
     
    -  /** This closure sets various Parquet configurations at both driver side and executor side. */
    -  private[parquet] def initializeLocalJobFunc(
    --- End diff --
    
    reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65302925
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -344,6 +344,11 @@ private[sql] class ParquetFileFormat
           val hadoopAttemptContext =
             new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId)
     
    +      // Try to push down filters when filter push-down is enabled.
    +      // Notice: This push-down is RowGroups level, not individual records.
    --- End diff --
    
    Also, does parquet support row group level predicate evaluation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65301654
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -344,6 +344,11 @@ private[sql] class ParquetFileFormat
           val hadoopAttemptContext =
             new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId)
     
    +      // Try to push down filters when filter push-down is enabled.
    +      // Notice: This push-down is RowGroups level, not individual records.
    --- End diff --
    
    Can you provide link to the doc saying it is row group level?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    **[Test build #60256 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    **[Test build #60246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit [`077f7f8`](https://github.com/apache/spark/commit/077f7f8813a76d38c8a6d898ec54e401c91b6014).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/13371
  
    @yhuai As you can see, this is not to fix a bug/problem. So I think it might be hard to provide a test case for it. I will try to do the benchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60246/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222290965
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59549/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    To be more clear, please write a proper benchmark that reads data when filter push down is not useful to compare whether this regress performance for the non-push-down case. Also make sure the benchmark does not include the time it takes to write the parquet data.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @viirya One problem in your new benchmark code is that `1 << 50` is actually very small since it's an `Int`:
    
    ```
    scala> 1 << 50
    res0: Int = 262144
    ```
    
    Anyway, `1 << 50`, which is 1PB, might be too large a value for such a microbenchmark :)
    
    So the generated Parquet file probably only contains a single row group, I guess that's why the numbers are quite close no matter you enable row group filter push-down or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    It is not really a bug fix because without this filtering push-down, the thing still works. This should be a performance fix. I should modify the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Is this a bug fix or performance fix? Sorry I don't really understand after reading your description.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222290957
  
    **[Test build #59549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59549/consoleFull)** for PR 13371 at commit [`5687a3b`](https://github.com/apache/spark/commit/5687a3b5527817c809244305468bfe4968bedcec).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13371#discussion_r65302899
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -344,6 +344,11 @@ private[sql] class ParquetFileFormat
           val hadoopAttemptContext =
             new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId)
     
    +      // Try to push down filters when filter push-down is enabled.
    +      // Notice: This push-down is RowGroups level, not individual records.
    --- End diff --
    
    (it is not obvious to know this is just for row group level)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @yhuai ok. Do you mean I need to create a new PR for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Reverted from master and branch-2.0.
    
    @viirya For the benchmark, there are two things:
    
    1. The benchmark also counts Parquet file writing into it, so the real number should be much better than the posted one.
    2. We should also benchmark for cases where no filters are pushed down to verify that this patch doesn't affect normal code path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13371: [SPARK-15639][SQL] Try to push down filter at Row...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13371


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    ping @yhuai again


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    @viirya I took a look at parquet's code. Seems parquet only evaluate row group level filters when generating splits (https://github.com/apache/parquet-mr/blob/apache-parquet-1.7.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L673). With FileSourceStrategy in Spark, I am not sure we will actually evaluate filter unneeded row groups as expected. Can you take a look? Also, it will be great if you can have a test to make sure that we actually can skip unneeded row groups. This test can be created as follows.
    
    1. We first write a parquet file containing multiple row groups. Also, let's that there is a column `c` and those row groups have disjoint ranges of `c`'s values.
    2. We write a query having a filter on `c` and we make sure that this query only need a subset of row groups.
    3. We verify that we only create splits for the needed row groups.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13371
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60256/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15639][SQL] Try to push down filter at ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13371#issuecomment-222290964
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org