You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by saucam <gi...@git.apache.org> on 2014/10/18 15:20:47 UTC

[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

GitHub user saucam opened a pull request:

    https://github.com/apache/spark/pull/2841

    SPARK-3968 Use parquet-mr filter2 api in spark sql

    The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
    We can leverage that to further improve performance of queries with filters.
    Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/saucam/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2841
    
----
commit 0bc54d3c0384606886a3e297cfdc85f44e2ea7e9
Author: Yash Datta <ya...@guavus.com>
Date:   2014-10-16T12:15:21Z

    SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr

commit 6adaf9b33f07ed247643ddf14020db69ece9d4fb
Author: Yash Datta <ya...@guavus.com>
Date:   2014-10-18T13:15:45Z

    SPARK-3968: Not pushing the filters in case of OPTIONAL columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60609282
  
      [Test build #22299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22299/consoleFull) for   PR 2841 at commit [`5f4530e`](https://github.com/apache/spark/commit/5f4530e8990df6ec93db2a68f872f7bcf807a093).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60713180
  
    Added a unit test for filter pushdown on optional column


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60491281
  
    There are style violations.  Run `sbt/sbt scalastyle` to see what they are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60591473
  
      [Test build #22297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22297/consoleFull) for   PR 2841 at commit [`f304667`](https://github.com/apache/spark/commit/f30466707e97b06f47b506699f7e9a043b1d7366).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383897
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,27 +481,43 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    -      splits.addAll(
    -        generateSplits.invoke(
    -          null,
    -          blocks,
    -          blockLocations,
    -          status,
    -          parquetMetaData.getFileMetaData,
    -          readContext.getRequestedSchema.toString,
    -          readContext.getReadSupportMetadata,
    -          minSplitSize,
    -          maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if(!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
         }
     
    +    if(rowGroupsDropped > 0 && totalRowGroups > 0){
    --- End diff --
    
    Space between `if` and `(`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60718584
  
      [Test build #22339 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22339/consoleFull) for   PR 2841 at commit [`515df1c`](https://github.com/apache/spark/commit/515df1cc70fec520828f45fb4b52a2643a6e7dcb).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61358979
  
    Thanks, closed it and assigned it to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60490511
  
    Can someone help, where is the build failing ? I can make distribution without errors, also ran dev/lint-scala successfully ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60898409
  
      [Test build #22448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22448/consoleFull) for   PR 2841 at commit [`8282ba0`](https://github.com/apache/spark/commit/8282ba0752951fc9d7a7593c6f6c89815bb92b3a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383900
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,27 +481,43 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    -      splits.addAll(
    -        generateSplits.invoke(
    -          null,
    -          blocks,
    -          blockLocations,
    -          status,
    -          parquetMetaData.getFileMetaData,
    -          readContext.getRequestedSchema.toString,
    -          readContext.getReadSupportMetadata,
    -          minSplitSize,
    -          maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if(!filteredBlocks.isEmpty){
    --- End diff --
    
    Space after `if`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61025530
  
    Alright, thanks for adding the tests. Let's get Michael's feedback on the metadata thing, I don't fully understand it. I guess it allows tasks to query different subsets of the metadata in parallel?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475599
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22208/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60511211
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22238/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60505589
  
      [Test build #22227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22227/consoleFull) for   PR 2841 at commit [`48163c3`](https://github.com/apache/spark/commit/48163c3192bb1b9ff47ad0fdd191e3ab8cb6099a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475598
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22208/consoleFull) for   PR 2841 at commit [`e8bf033`](https://github.com/apache/spark/commit/e8bf0332e205ac66342e46367c2e6c397931fea2).
     * This patch **fails Scala style tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19498939
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala ---
    @@ -209,25 +221,25 @@ private[sql] object ParquetFilters {
               case _ => None
             }
           }
    -      case p @ EqualTo(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ EqualTo(left: Literal, right: NamedExpression) =>
             Some(createEqualityFilter(right.name, left, p))
    -      case p @ EqualTo(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ EqualTo(left: NamedExpression, right: Literal) =>
             Some(createEqualityFilter(left.name, right, p))
    -      case p @ LessThan(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ LessThan(left: Literal, right: NamedExpression) =>
             Some(createLessThanFilter(right.name, left, p))
    -      case p @ LessThan(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ LessThan(left: NamedExpression, right: Literal) =>
             Some(createLessThanFilter(left.name, right, p))
    -      case p @ LessThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ LessThanOrEqual(left: Literal, right: NamedExpression) =>
             Some(createLessThanOrEqualFilter(right.name, left, p))
    -      case p @ LessThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ LessThanOrEqual(left: NamedExpression, right: Literal) =>
             Some(createLessThanOrEqualFilter(left.name, right, p))
    -      case p @ GreaterThan(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ GreaterThan(left: Literal, right: NamedExpression) =>
             Some(createGreaterThanFilter(right.name, left, p))
    -      case p @ GreaterThan(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ GreaterThan(left: NamedExpression, right: Literal) =>
             Some(createGreaterThanFilter(left.name, right, p))
    -      case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) =>
             Some(createGreaterThanOrEqualFilter(right.name, left, p))
    -      case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) =>
    --- End diff --
    
    Would be nice to add some tests with == null or >= null as well to make sure these filters work


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-59612787
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61193438
  
    Cool, that makes sense. Anyway if this looks good to you, Michael, you should merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475816
  
      [Test build #22211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22211/consoleFull) for   PR 2841 at commit [`cc7b596`](https://github.com/apache/spark/commit/cc7b596a16c2d2999e5f012ca382c0daa724c743).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475818
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22211/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475518
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22208/consoleFull) for   PR 2841 at commit [`e8bf033`](https://github.com/apache/spark/commit/e8bf0332e205ac66342e46367c2e6c397931fea2).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61052070
  
    yes. In task side metadata strategy, the tasks are spawned first, and each task will then read the metadata and drop the row groups. So if I am using yarn,  and data is huge (metadata is large) , the memory will be consumed on the yarn side , but in case of client side metadata strategy, whole of the metadata will be read before the tasks are spawned, on a single node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383955
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -106,7 +110,13 @@ case class ParquetTableScan(
         // "spark.sql.hints.parquetFilterPushdown" to false inside SparkConf.
         if (columnPruningPred.length > 0 &&
           sc.conf.getBoolean(ParquetFilters.PARQUET_FILTER_PUSHDOWN_ENABLED, true)) {
    -      ParquetFilters.serializeFilterExpressions(columnPruningPred, conf)
    +      
    +      // Set this in configuration of ParquetInputFormat, needed for RowGroupFiltering
    +      val filter: Filter = ParquetFilters.createRecordFilter(columnPruningPred)
    +      if (filter!= null){
    --- End diff --
    
    Space after `filter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60591479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22297/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60505590
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22227/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60625677
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22299/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19521851
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,29 +515,85 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if (!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
    +    }
    +
    +    if (rowGroupsDropped > 0 && totalRowGroups > 0){
    +      val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt
    +      logInfo(s"Dropping $rowGroupsDropped row groups that do not pass filter predicate "
    +        + s"($percentDropped %) !")
    +    }
    +    else {
    +      logInfo("There were no row groups that could be dropped due to filter predicates")
    +    }
    +    splits
    +
    +  }
    +
    +  def getTaskSideSplits(
    +    configuration: Configuration,
    +    footers: JList[Footer],
    +    maxSplitSize: JLong,
    +    minSplitSize: JLong,
    +    readContext: ReadContext): JList[ParquetInputSplit] = {
    +
    +    val splits = mutable.ArrayBuffer.empty[ParquetInputSplit]
    +    
    +    // Ugly hack, stuck with it until PR:
    +    // https://github.com/apache/incubator-parquet-mr/pull/17
    +    // is resolved
    +    val generateSplits =
    +      Class.forName("parquet.hadoop.TaskSideMetadataSplitStrategy")
    +       .getDeclaredMethods.find(_.getName == "generateTaskSideMDSplits").getOrElse(
    +         sys.error(
    +           s"Failed to reflectively invoke TaskSideMetadataSplitStrategy.generateTaskSideMDSplits"))
    +    generateSplits.setAccessible(true)
    + 
    +    for (footer <- footers) {
    +      val file = footer.getFile
    +      val fs = file.getFileSystem(configuration)
    +      val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
    +      val blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
           splits.addAll(
    --- End diff --
    
    I am not sure I follow here, if globalmetadata is non null, it means there is data and hence splits would be generated by the generatesplits function which takes the hdfsblocks to process as argument ? the generated splits are added to the splits to be returned later.
    
    splits.addAll(
                generateSplits.invoke(
                  null,
                  filteredBlocks,
                  blockLocations,
                  status,
                  readContext.getRequestedSchema.toString,
                  readContext.getReadSupportMetadata,
                  minSplitSize,
                  maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
            }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60908920
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22448/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19522625
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala ---
    @@ -209,25 +221,25 @@ private[sql] object ParquetFilters {
               case _ => None
             }
           }
    -      case p @ EqualTo(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ EqualTo(left: Literal, right: NamedExpression) =>
             Some(createEqualityFilter(right.name, left, p))
    -      case p @ EqualTo(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ EqualTo(left: NamedExpression, right: Literal) =>
             Some(createEqualityFilter(left.name, right, p))
    -      case p @ LessThan(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ LessThan(left: Literal, right: NamedExpression) =>
             Some(createLessThanFilter(right.name, left, p))
    -      case p @ LessThan(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ LessThan(left: NamedExpression, right: Literal) =>
             Some(createLessThanFilter(left.name, right, p))
    -      case p @ LessThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ LessThanOrEqual(left: Literal, right: NamedExpression) =>
             Some(createLessThanOrEqualFilter(right.name, left, p))
    -      case p @ LessThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ LessThanOrEqual(left: NamedExpression, right: Literal) =>
             Some(createLessThanOrEqualFilter(left.name, right, p))
    -      case p @ GreaterThan(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ GreaterThan(left: Literal, right: NamedExpression) =>
             Some(createGreaterThanFilter(right.name, left, p))
    -      case p @ GreaterThan(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ GreaterThan(left: NamedExpression, right: Literal) =>
             Some(createGreaterThanFilter(left.name, right, p))
    -      case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable =>
    +      case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) =>
             Some(createGreaterThanOrEqualFilter(right.name, left, p))
    -      case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable =>
    +      case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) =>
    --- End diff --
    
    The nullable option is set when the field is optional. So adding tests for those.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60625667
  
      [Test build #22299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22299/consoleFull) for   PR 2841 at commit [`5f4530e`](https://github.com/apache/spark/commit/5f4530e8990df6ec93db2a68f872f7bcf807a093).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61156595
  
    Also looks like they are switching the default in parquet to task side: https://issues.apache.org/jira/browse/PARQUET-122


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19410910
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -450,8 +465,14 @@ private[parquet] class FilteringParquetRowInputFormat
             globalMetaData.getKeyValueMetaData(),
             globalMetaData.getSchema()))
     
    +    val filter: Filter = ParquetInputFormat.getFilter(configuration)
    +    var rowGroupsDropped :Long = 0
    +    var totalRowGroups :Long  = 0
    +
    +    // Ugly hack, stuck with it until mentioned PR is resolved
         val generateSplits =
    -      classOf[ParquetInputFormat[_]].getDeclaredMethods.find(_.getName == "generateSplits").get
    +      Class.forName("parquet.hadoop.ClientSideMetadataSplitStrategy")
    --- End diff --
    
    Thanks for pointing this out, I had chosen ClientSideMetadataStrategy because it was the default one (had left the other as TODO), in fact we indeed should use  TaskSideMetadataSplitStrategy . For now I have included the code for both strategies, and made TaskSideMetadataSplitStrategy default. Should we create a config for this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475724
  
      [Test build #22211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22211/consoleFull) for   PR 2841 at commit [`cc7b596`](https://github.com/apache/spark/commit/cc7b596a16c2d2999e5f012ca382c0daa724c743).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383980
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -450,8 +465,14 @@ private[parquet] class FilteringParquetRowInputFormat
             globalMetaData.getKeyValueMetaData(),
             globalMetaData.getSchema()))
     
    +    val filter: Filter = ParquetInputFormat.getFilter(configuration)
    +    var rowGroupsDropped :Long = 0
    +    var totalRowGroups :Long  = 0
    +
    +    // Ugly hack, stuck with it until mentioned PR is resolved
         val generateSplits =
    -      classOf[ParquetInputFormat[_]].getDeclaredMethods.find(_.getName == "generateSplits").get
    +      Class.forName("parquet.hadoop.ClientSideMetadataSplitStrategy")
    --- End diff --
    
    Is there a reason we are using this instead of `TaskSideMetadataSplitStrategy`?  Should we at least be reflecting on the interface so that we can choose between the two?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60504390
  
      [Test build #22227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22227/consoleFull) for   PR 2841 at commit [`48163c3`](https://github.com/apache/spark/commit/48163c3192bb1b9ff47ad0fdd191e3ab8cb6099a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60718587
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22339/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383917
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -450,8 +465,14 @@ private[parquet] class FilteringParquetRowInputFormat
             globalMetaData.getKeyValueMetaData(),
             globalMetaData.getSchema()))
     
    +    val filter: Filter = ParquetInputFormat.getFilter(configuration)
    +    var rowGroupsDropped :Long = 0
    --- End diff --
    
    Should be `var rowGroupsDropped: Long = 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61193622
  
    Thanks!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60820609
  
    Hey @saucam, I took a look at this too because I had tried upgrading to Parquet 1.6 in a different branch to use decimals. Made a few comments above.
    
    Apart that, this PR doesn't seem to have any tests for the new functionality (in particular skipping row groups) or for the methods that build up Parquet filters. Do you mind adding some of those?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383923
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -450,8 +465,14 @@ private[parquet] class FilteringParquetRowInputFormat
             globalMetaData.getKeyValueMetaData(),
             globalMetaData.getSchema()))
     
    +    val filter: Filter = ParquetInputFormat.getFilter(configuration)
    +    var rowGroupsDropped :Long = 0
    +    var totalRowGroups :Long  = 0
    +
    +    // Ugly hack, stuck with it until mentioned PR is resolved
    --- End diff --
    
    Can you include the link here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60014052
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383935
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -450,8 +465,14 @@ private[parquet] class FilteringParquetRowInputFormat
             globalMetaData.getKeyValueMetaData(),
             globalMetaData.getSchema()))
     
    +    val filter: Filter = ParquetInputFormat.getFilter(configuration)
    +    var rowGroupsDropped :Long = 0
    +    var totalRowGroups :Long  = 0
    +
    +    // Ugly hack, stuck with it until mentioned PR is resolved
         val generateSplits =
    -      classOf[ParquetInputFormat[_]].getDeclaredMethods.find(_.getName == "generateSplits").get
    +      Class.forName("parquet.hadoop.ClientSideMetadataSplitStrategy")
    +       .getDeclaredMethods.find(_.getName == "generateSplits").get
    --- End diff --
    
    `.getOrElse(sys.error(s"Failed to reflectively invoke ClientSideMetadataSplitStrategy.generateSplits")` instead of `get`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19498615
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala ---
    @@ -255,6 +256,10 @@ private[sql] object ParquetTestData {
           record.add(3, i.toLong)
           record.add(4, i.toFloat + 0.5f)
           record.add(5, i.toDouble + 0.5d)
    +      if (i % 4 == 0){
    --- End diff --
    
    Space before { here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19498480
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -423,10 +436,8 @@ private[parquet] class FilteringParquetRowInputFormat
           configuration: Configuration,
           footers: JList[Footer]): JList[ParquetInputSplit] = {
     
    -    import FilteringParquetRowInputFormat.blockLocationCache
    -
    -    val cacheMetadata = configuration.getBoolean(SQLConf.PARQUET_CACHE_METADATA, false)
    -
    +    // Use task side strategy by default
    +    val taskSideMetaData = configuration.getBoolean(ParquetInputFormat.TASK_SIDE_METADATA, true)
    --- End diff --
    
    Is this true by default in Parquet? It seems to be false in the implementation of isTaskSideMetadata in https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19498392
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -435,23 +446,67 @@ private[parquet] class FilteringParquetRowInputFormat
             s"maxSplitSize or minSplitSie should not be negative: maxSplitSize = $maxSplitSize;" +
               s" minSplitSize = $minSplitSize")
         }
    -    val splits = mutable.ArrayBuffer.empty[ParquetInputSplit]
    +    
    +    // Uses strict type checking by default
         val getGlobalMetaData =
           classOf[ParquetFileWriter].getDeclaredMethod("getGlobalMetaData", classOf[JList[Footer]])
         getGlobalMetaData.setAccessible(true)
         val globalMetaData = getGlobalMetaData.invoke(null, footers).asInstanceOf[GlobalMetaData]
    -    // if parquet file is empty, return empty splits.
    -    if (globalMetaData == null) {
    -      return splits
    -    }
     
    +    if (globalMetaData == null){
    --- End diff --
    
    Small style thing: you should have spaces before { in here and a few other places (search the diff for `){`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19521440
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -423,10 +436,8 @@ private[parquet] class FilteringParquetRowInputFormat
           configuration: Configuration,
           footers: JList[Footer]): JList[ParquetInputSplit] = {
     
    -    import FilteringParquetRowInputFormat.blockLocationCache
    -
    -    val cacheMetadata = configuration.getBoolean(SQLConf.PARQUET_CACHE_METADATA, false)
    -
    +    // Use task side strategy by default
    +    val taskSideMetaData = configuration.getBoolean(ParquetInputFormat.TASK_SIDE_METADATA, true)
    --- End diff --
    
    yes , in parquet , clientsidemetadata strategy is the default one, but as I mentioned earlier, don't we want task side strategy by default due to the inherent advantages of less memory usage at client side ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61140219
  
    I talked to some twitter people and they were pretty excited about the task side metadata reading because with big datasets they were seeing lots of OOMs before even starting the jobs.  It could also be pretty good for S3 if we can avoid doing so much work serially on the driver.  That said, it seems like it would make features like merging multiple unique schema's impossible and its newer / less tested.  So, we'll want to be able to configure this easily
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60491361
  
    Spaces are required before comments:
    ```
    [error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:113:6: space.after.comment.start.message
    [error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:114:6: space.after.comment.start.message
    [error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:379:4: space.after.comment.start.message
    [error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:380:4: space.after.comment.start.message
    [error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:472:4: space.after.comment.start.message
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60908914
  
      [Test build #22448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22448/consoleFull) for   PR 2841 at commit [`8282ba0`](https://github.com/apache/spark/commit/8282ba0752951fc9d7a7593c6f6c89815bb92b3a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-59726357
  
    This PR also fixes :
    
    https://issues.apache.org/jira/browse/SPARK-1847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60511209
  
      [Test build #22238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22238/consoleFull) for   PR 2841 at commit [`ec53e92`](https://github.com/apache/spark/commit/ec53e929584770492e4d1eea7004dda91caa667f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19498603
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,29 +515,85 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if (!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
    +    }
    +
    +    if (rowGroupsDropped > 0 && totalRowGroups > 0){
    +      val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt
    +      logInfo(s"Dropping $rowGroupsDropped row groups that do not pass filter predicate "
    +        + s"($percentDropped %) !")
    +    }
    +    else {
    +      logInfo("There were no row groups that could be dropped due to filter predicates")
    +    }
    +    splits
    +
    +  }
    +
    +  def getTaskSideSplits(
    +    configuration: Configuration,
    +    footers: JList[Footer],
    +    maxSplitSize: JLong,
    +    minSplitSize: JLong,
    +    readContext: ReadContext): JList[ParquetInputSplit] = {
    +
    +    val splits = mutable.ArrayBuffer.empty[ParquetInputSplit]
    +    
    +    // Ugly hack, stuck with it until PR:
    +    // https://github.com/apache/incubator-parquet-mr/pull/17
    +    // is resolved
    +    val generateSplits =
    +      Class.forName("parquet.hadoop.TaskSideMetadataSplitStrategy")
    +       .getDeclaredMethods.find(_.getName == "generateTaskSideMDSplits").getOrElse(
    +         sys.error(
    +           s"Failed to reflectively invoke TaskSideMetadataSplitStrategy.generateTaskSideMDSplits"))
    +    generateSplits.setAccessible(true)
    + 
    +    for (footer <- footers) {
    +      val file = footer.getFile
    +      val fs = file.getFileSystem(configuration)
    +      val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
    +      val blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
           splits.addAll(
    --- End diff --
    
    You may need to add an `if splits.size() > 0` here, but I'm not sure. In the client-side metadata strategy it was not possible to call generateSplits with an empty list (it threw an exception in that case). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60880619
  
    Hi @mateiz , thanks for the suggestions, just a few points 
    1. Need to know which strategy to be kept as default (currently we use a different one than the default one in  parquet library)
    2. This PR is adding support to use filter2 api from the parquet library which supports row group filtering. Do we need to add tests to ensure that ? because such test cases already exist in the parquet library : 
    
    https://github.com/Parquet/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/test/java/parquet/filter2/compat/TestRowGroupFilter.java


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60509148
  
      [Test build #22238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22238/consoleFull) for   PR 2841 at commit [`ec53e92`](https://github.com/apache/spark/commit/ec53e929584770492e4d1eea7004dda91caa667f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60533736
  
    This is pretty awesome.  Thanks for working on it!  I made a few comments.
    
    @mateiz you had also looked into upgrading our version of parquet.  Is this missing anything?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60475393
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19580900
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,29 +515,85 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if (!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
    +    }
    +
    +    if (rowGroupsDropped > 0 && totalRowGroups > 0){
    +      val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt
    +      logInfo(s"Dropping $rowGroupsDropped row groups that do not pass filter predicate "
    +        + s"($percentDropped %) !")
    +    }
    +    else {
    +      logInfo("There were no row groups that could be dropped due to filter predicates")
    +    }
    +    splits
    +
    +  }
    +
    +  def getTaskSideSplits(
    +    configuration: Configuration,
    +    footers: JList[Footer],
    +    maxSplitSize: JLong,
    +    minSplitSize: JLong,
    +    readContext: ReadContext): JList[ParquetInputSplit] = {
    +
    +    val splits = mutable.ArrayBuffer.empty[ParquetInputSplit]
    +    
    +    // Ugly hack, stuck with it until PR:
    +    // https://github.com/apache/incubator-parquet-mr/pull/17
    +    // is resolved
    +    val generateSplits =
    +      Class.forName("parquet.hadoop.TaskSideMetadataSplitStrategy")
    +       .getDeclaredMethods.find(_.getName == "generateTaskSideMDSplits").getOrElse(
    +         sys.error(
    +           s"Failed to reflectively invoke TaskSideMetadataSplitStrategy.generateTaskSideMDSplits"))
    +    generateSplits.setAccessible(true)
    + 
    +    for (footer <- footers) {
    +      val file = footer.getFile
    +      val fs = file.getFileSystem(configuration)
    +      val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
    +      val blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
           splits.addAll(
    --- End diff --
    
    Okay, sounds fine as long as you've tested it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19383949
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,27 +481,43 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    -      splits.addAll(
    -        generateSplits.invoke(
    -          null,
    -          blocks,
    -          blockLocations,
    -          status,
    -          parquetMetaData.getFileMetaData,
    -          readContext.getRequestedSchema.toString,
    -          readContext.getReadSupportMetadata,
    -          minSplitSize,
    -          maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if(!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
         }
     
    +    if(rowGroupsDropped > 0 && totalRowGroups > 0){
    +      val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt
    +      logInfo("Dropping " + rowGroupsDropped + " row groups that do not pass filter predicate! ("
    --- End diff --
    
    Use interpolation:
    
    `s"Dropping $rowGroupsDropped row groups that do not pass filter predicate! ..."`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2841


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60713375
  
      [Test build #22339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22339/consoleFull) for   PR 2841 at commit [`515df1c`](https://github.com/apache/spark/commit/515df1cc70fec520828f45fb4b52a2643a6e7dcb).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2841#discussion_r19580913
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala ---
    @@ -460,29 +515,85 @@ private[parquet] class FilteringParquetRowInputFormat
           val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
           val parquetMetaData = footer.getParquetMetadata
           val blocks = parquetMetaData.getBlocks
    -      var blockLocations: Array[BlockLocation] = null
    -      if (!cacheMetadata) {
    -        blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    -      } else {
    -        blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    -          def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    -        })
    -      }
    +      totalRowGroups = totalRowGroups + blocks.size
    +      val filteredBlocks = RowGroupFilter.filterRowGroups(
    +        filter,
    +        blocks,
    +        parquetMetaData.getFileMetaData.getSchema)
    +      rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size)
    +      
    +      if (!filteredBlocks.isEmpty){
    +          var blockLocations: Array[BlockLocation] = null
    +          if (!cacheMetadata) {
    +            blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
    +          } else {
    +            blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] {
    +              def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen)
    +            })
    +          }
    +          splits.addAll(
    +            generateSplits.invoke(
    +              null,
    +              filteredBlocks,
    +              blockLocations,
    +              status,
    +              readContext.getRequestedSchema.toString,
    +              readContext.getReadSupportMetadata,
    +              minSplitSize,
    +              maxSplitSize).asInstanceOf[JList[ParquetInputSplit]])
    +        }
    +    }
    +
    +    if (rowGroupsDropped > 0 && totalRowGroups > 0){
    +      val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt
    +      logInfo(s"Dropping $rowGroupsDropped row groups that do not pass filter predicate "
    +        + s"($percentDropped %) !")
    +    }
    +    else {
    +      logInfo("There were no row groups that could be dropped due to filter predicates")
    +    }
    +    splits
    +
    +  }
    +
    +  def getTaskSideSplits(
    +    configuration: Configuration,
    +    footers: JList[Footer],
    +    maxSplitSize: JLong,
    +    minSplitSize: JLong,
    +    readContext: ReadContext): JList[ParquetInputSplit] = {
    +
    +    val splits = mutable.ArrayBuffer.empty[ParquetInputSplit]
    +    
    +    // Ugly hack, stuck with it until PR:
    +    // https://github.com/apache/incubator-parquet-mr/pull/17
    +    // is resolved
    +    val generateSplits =
    +      Class.forName("parquet.hadoop.TaskSideMetadataSplitStrategy")
    +       .getDeclaredMethods.find(_.getName == "generateTaskSideMDSplits").getOrElse(
    +         sys.error(
    +           s"Failed to reflectively invoke TaskSideMetadataSplitStrategy.generateTaskSideMDSplits"))
    +    generateSplits.setAccessible(true)
    + 
    +    for (footer <- footers) {
    +      val file = footer.getFile
    +      val fs = file.getFileSystem(configuration)
    +      val status = fileStatuses.getOrElse(file, fs.getFileStatus(file))
    +      val blockLocations = fs.getFileBlockLocations(status, 0, status.getLen)
           splits.addAll(
    --- End diff --
    
    Basically I was wondering whether generateSplits.invoke() works on an empty list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60897965
  
    Added more tests for filtering on nullable columns 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-60581723
  
      [Test build #22297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22297/consoleFull) for   PR 2841 at commit [`f304667`](https://github.com/apache/spark/commit/f30466707e97b06f47b506699f7e9a043b1d7366).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/2841#issuecomment-61216332
  
    @marmbrus , @mateiz  thanks for all the help ! 
    @marmbrus  you may want to close this ticket as well : 
    
    https://issues.apache.org/jira/browse/SPARK-1847


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org