You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/05/16 13:52:20 UTC

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18000

    [SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names

    ## What changes were proposed in this pull request?
    
    This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with https://github.com/apache/spark/pull/17680.
    
    The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.
    
    I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.
    
    This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).
    
    Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.
    
    **With dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    +--------+
    ```
    
    **Without dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("coldots").write.parquet(path)
    spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
    ```
    
    ```
    +-------+
    |coldots|
    +-------+
    |      1|
    +-------+
    ```
    
    **After**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    |       1|
    +--------+
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `ParquetFilterSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-20364-workaround

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18000
    
----
commit e94560e07dc578154e222896915fb29e98869480
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-16T09:39:18Z

    Disable Parquet predicate pushdown for fields having dots in the names

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76958/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117030310
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -47,39 +49,45 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2}
      *    data type is nullable.
      */
     class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext {
    +
    +  private def checkWithSelectedFilters
    +      (df: DataFrame, predicate: Predicate)
    +      (checker: (DataFrame, Seq[Filter]) => Unit): Unit = {
    +    val output = predicate.collect { case a: Attribute => a }.distinct
    +
    +    val filtered = df
    +      .select(output.map(e => Column(e)): _*)
    +      .where(Column(predicate))
    +
    +    var maybeRelation: Option[HadoopFsRelation] = None
    +    val maybeAnalyzedPredicate = filtered.queryExecution.optimizedPlan.collect {
    +      case PhysicalOperation(_, filters, LogicalRelation(relation: HadoopFsRelation, _, _)) =>
    +        maybeRelation = Some(relation)
    +        filters
    +    }.flatten.reduceLeftOption(_ && _)
    +    assert(maybeAnalyzedPredicate.isDefined, "No filter is analyzed from the given query")
    +
    +    val (_, selectedFilters, _) =
    +      DataSourceStrategy.selectFilters(maybeRelation.get, maybeAnalyzedPredicate.toSeq)
    +    assert(selectedFilters.nonEmpty, "No filter is pushed down")
    +    checker(filtered, selectedFilters)
    +  }
    +
       private def checkFilterPredicate(
           df: DataFrame,
           predicate: Predicate,
           filterClass: Class[_ <: FilterPredicate],
           checker: (DataFrame, Seq[Row]) => Unit,
           expected: Seq[Row]): Unit = {
    -    val output = predicate.collect { case a: Attribute => a }.distinct
    +    checkWithSelectedFilters(df, predicate) { case (filtered, selectedFilters) =>
    +      selectedFilters.foreach { pred =>
    +        val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    +        assert(maybeFilter.isDefined, s"Couldn't generate filter predicate for $pred")
    +      }
     
    -    withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    -      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
    -        val query = df
    -          .select(output.map(e => Column(e)): _*)
    -          .where(Column(predicate))
    -
    -        var maybeRelation: Option[HadoopFsRelation] = None
    -        val maybeAnalyzedPredicate = query.queryExecution.optimizedPlan.collect {
    -          case PhysicalOperation(_, filters, LogicalRelation(relation: HadoopFsRelation, _, _)) =>
    -            maybeRelation = Some(relation)
    -            filters
    -        }.flatten.reduceLeftOption(_ && _)
    -        assert(maybeAnalyzedPredicate.isDefined, "No filter is analyzed from the given query")
    -
    -        val (_, selectedFilters, _) =
    -          DataSourceStrategy.selectFilters(maybeRelation.get, maybeAnalyzedPredicate.toSeq)
    -        assert(selectedFilters.nonEmpty, "No filter is pushed down")
    -
    -        selectedFilters.foreach { pred =>
    -          val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    -          assert(maybeFilter.isDefined, s"Couldn't generate filter predicate for $pred")
    -          // Doesn't bother checking type parameters here (e.g. `Eq[Integer]`)
    -          maybeFilter.exists(_.getClass === filterClass)
    --- End diff --
    
    Don't we need the check of `filterClass`? Why remove it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #77026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77026/testReport)** for PR 18000 at commit [`c83f1b7`](https://github.com/apache/spark/commit/c83f1b77d39db20aad60272e50f96951c234e914).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116805464
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
         val dataTypeOf = getFieldMap(schema)
     
    +    // Note that, we do not push down filters for columns having dots in the names. In Parquet
    +    // 1.8.2, currently, column APIs in Parquet's `FilterApi` only allow dot-separated names
    +    // so here we simply avoid this case. See SPARK-20364.
    +    def canMakeFilterOn(name: String): Boolean = dataTypeOf.contains(name) && !name.contains(".")
    --- End diff --
    
    `private`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #77026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77026/testReport)** for PR 18000 at commit [`c83f1b7`](https://github.com/apache/spark/commit/c83f1b77d39db20aad60272e50f96951c234e914).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116879339
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
         val dataTypeOf = getFieldMap(schema)
     
    +    // Note that, we do not push down filters for columns having dots in the names. In Parquet
    +    // 1.8.2, currently, column APIs in Parquet's `FilterApi` only allow dot-separated names
    +    // so here we simply avoid this case. See SPARK-20364.
    --- End diff --
    
    Sure, it looks much nicer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Sounds ok for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    LGTM pending Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117132177
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    Instead of checking dots for each predicate recursively in `ParquetFilters`, we can check `Filter.references` of the predicate at top level in `ParquetFileFormat`,  and skip `ParquetFilters.createFilter` at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117061432
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -47,39 +49,45 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2}
      *    data type is nullable.
      */
     class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext {
    +
    +  private def checkWithSelectedFilters
    +      (df: DataFrame, predicate: Predicate)
    +      (checker: (DataFrame, Seq[Filter]) => Unit): Unit = {
    +    val output = predicate.collect { case a: Attribute => a }.distinct
    +
    +    val filtered = df
    +      .select(output.map(e => Column(e)): _*)
    +      .where(Column(predicate))
    +
    +    var maybeRelation: Option[HadoopFsRelation] = None
    +    val maybeAnalyzedPredicate = filtered.queryExecution.optimizedPlan.collect {
    +      case PhysicalOperation(_, filters, LogicalRelation(relation: HadoopFsRelation, _, _)) =>
    +        maybeRelation = Some(relation)
    +        filters
    +    }.flatten.reduceLeftOption(_ && _)
    +    assert(maybeAnalyzedPredicate.isDefined, "No filter is analyzed from the given query")
    +
    +    val (_, selectedFilters, _) =
    +      DataSourceStrategy.selectFilters(maybeRelation.get, maybeAnalyzedPredicate.toSeq)
    +    assert(selectedFilters.nonEmpty, "No filter is pushed down")
    +    checker(filtered, selectedFilters)
    +  }
    +
       private def checkFilterPredicate(
           df: DataFrame,
           predicate: Predicate,
           filterClass: Class[_ <: FilterPredicate],
           checker: (DataFrame, Seq[Row]) => Unit,
           expected: Seq[Row]): Unit = {
    -    val output = predicate.collect { case a: Attribute => a }.distinct
    +    checkWithSelectedFilters(df, predicate) { case (filtered, selectedFilters) =>
    +      selectedFilters.foreach { pred =>
    +        val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    +        assert(maybeFilter.isDefined, s"Couldn't generate filter predicate for $pred")
    +      }
     
    -    withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    -      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
    -        val query = df
    -          .select(output.map(e => Column(e)): _*)
    -          .where(Column(predicate))
    -
    -        var maybeRelation: Option[HadoopFsRelation] = None
    -        val maybeAnalyzedPredicate = query.queryExecution.optimizedPlan.collect {
    -          case PhysicalOperation(_, filters, LogicalRelation(relation: HadoopFsRelation, _, _)) =>
    -            maybeRelation = Some(relation)
    -            filters
    -        }.flatten.reduceLeftOption(_ && _)
    -        assert(maybeAnalyzedPredicate.isDefined, "No filter is analyzed from the given query")
    -
    -        val (_, selectedFilters, _) =
    -          DataSourceStrategy.selectFilters(maybeRelation.get, maybeAnalyzedPredicate.toSeq)
    -        assert(selectedFilters.nonEmpty, "No filter is pushed down")
    -
    -        selectedFilters.foreach { pred =>
    -          val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    -          assert(maybeFilter.isDefined, s"Couldn't generate filter predicate for $pred")
    -          // Doesn't bother checking type parameters here (e.g. `Eq[Integer]`)
    -          maybeFilter.exists(_.getClass === filterClass)
    --- End diff --
    
    Thanks ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76987/testReport)** for PR 18000 at commit [`ae8ea08`](https://github.com/apache/spark/commit/ae8ea082622d4b6c265fec5d6ae897000c02d2ef).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Based on the discussion in https://github.com/apache/parquet-mr/pull/361, it does not sound Parquet will support it in the short term. We might need to live with it for a long time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76987/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117143600
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    Ok. Sounds making sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76958/testReport)** for PR 18000 at commit [`e94560e`](https://github.com/apache/spark/commit/e94560e07dc578154e222896915fb29e98869480).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117030781
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -108,6 +116,15 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         checkFilterPredicate(df, predicate, filterClass, checkBinaryAnswer _, expected)
       }
     
    +  private def checkNoFilterPredicate(predicate: Predicate)(implicit df: DataFrame): Unit = {
    +    checkWithSelectedFilters(df, predicate) { case (_, selectedFilters) =>
    +      selectedFilters.foreach { pred =>
    +        val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    +        assert(maybeFilter.isEmpty, s"Could generate filter predicate for $pred")
    --- End diff --
    
    "Shouldn't generate filter predicate for $pred"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    cc @gatorsmile who reviewed the latter way in that PR, @cloud-fan who reviewed my previous Parquet related PR, @liancheng and @viirya who I believe is used to this code path, @ash211 who found this issue and @marmbrus who took an action on the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77051/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    a high-level question, is it a parquet bug or Spark doesn't use parquet reader correctly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76958/testReport)** for PR 18000 at commit [`e94560e`](https://github.com/apache/spark/commit/e94560e07dc578154e222896915fb29e98869480).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117157965
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -490,6 +516,42 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         }
       }
     
    +  test("SPARK-20364 Do not push down filters when column names have dots") {
    +    implicit class StringToAttribute(str: String) {
    +      // Implicits for attr, $ and symbol do not handle backticks.
    +      def attribute: Attribute = UnresolvedAttribute.quotedString(str)
    --- End diff --
    
    Shall we make $ to use`UnresolvedAttribute.quotedString`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon closed the pull request at:

    https://github.com/apache/spark/pull/18000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116879168
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
         val dataTypeOf = getFieldMap(schema)
     
    +    // Note that, we do not push down filters for columns having dots in the names. In Parquet
    +    // 1.8.2, currently, column APIs in Parquet's `FilterApi` only allow dot-separated names
    +    // so here we simply avoid this case. See SPARK-20364.
    +    def canMakeFilterOn(name: String): Boolean = dataTypeOf.contains(name) && !name.contains(".")
    --- End diff --
    
    (this is a nested function)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117168546
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -47,39 +49,47 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2}
      *    data type is nullable.
      */
     class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext {
    --- End diff --
    
    Sure, I just revert it back and made a simple test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76987/testReport)** for PR 18000 at commit [`ae8ea08`](https://github.com/apache/spark/commit/ae8ea082622d4b6c265fec5d6ae897000c02d2ef).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117132718
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    We should log something for users because it might not be straightforward for users to know predicate pushdown is disabled for dot-columns. This is bad for performance, it seems to me that it's better to let users know what's happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18000#discussion_r116946732

--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
@@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
* Converts data sources filters to Parquet filter predicates.
*/
def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
- val dataTypeOf = getFieldMap(schema)
+ val nameTypeMap = getFieldMap(schema)
+
+ // Parquet does not allow dots in the column name because dots are used as a column path
+ // delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates
+ // with missing columns. The incorrect results could be got from Parquet when we push down
+ // filters for the column having dots in the names. Thus, we do not push down such filters.
--- End diff --

It seems to me that a missing column is treated like a NULL value. The results will be changed only for some predicates, e.g, `IsNull` and `IsNotNull`. For other predicates, can we still push down them?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117145159
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    Not just for speed.  Also for the number of codes needed to change. But I think it is ok for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    ^ Yea, maybe. we need to make sure onmhttps://github.com/apache/spark/blob/3bfb639cb7352aec572ef6686d3471bd78748ffa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117032222
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -108,6 +116,15 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         checkFilterPredicate(df, predicate, filterClass, checkBinaryAnswer _, expected)
       }
     
    +  private def checkNoFilterPredicate(predicate: Predicate)(implicit df: DataFrame): Unit = {
    +    checkWithSelectedFilters(df, predicate) { case (_, selectedFilters) =>
    +      selectedFilters.foreach { pred =>
    +        val maybeFilter = ParquetFilters.createFilter(df.schema, pred)
    +        assert(maybeFilter.isEmpty, s"Could generate filter predicate for $pred")
    --- End diff --
    
    We should also check correctness of the results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Just to make sure, I don't feel strongly for both comments @viirya. I am willing to fix if you feel strongly. Please let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116800201
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
         val dataTypeOf = getFieldMap(schema)
     
    +    // Note that, we do not push down filters for columns having dots in the names. In Parquet
    +    // 1.8.2, currently, column APIs in Parquet's `FilterApi` only allow dot-separated names
    +    // so here we simply avoid this case. See SPARK-20364.
    --- End diff --
    
    Added the link to PARQUET-389 in the JIRA SPARK-20364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    In addition, we need to know the limitation of column names in Parquet. See the related PR in Parquet: https://github.com/apache/parquet-mr/pull/361


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

GitHub user HyukjinKwon reopened a pull request:

    https://github.com/apache/spark/pull/18000

    [SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names

    ## What changes were proposed in this pull request?
    
    This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with https://github.com/apache/spark/pull/17680.
    
    The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case.
    
    I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here.
    
    This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best).
    
    Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet.
    
    **With dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    +--------+
    ```
    
    **Without dots**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("coldots").write.parquet(path)
    spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
    ```
    
    ```
    +-------+
    |coldots|
    +-------+
    |      1|
    +-------+
    ```
    
    **After**
    
    ```scala
    val path = "/tmp/abcde"
    Seq(Some(1), None).toDF("col.dots").write.parquet(path)
    spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    |       1|
    +--------+
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `ParquetFilterSuite`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-20364-workaround

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18000
    
----
commit e94560e07dc578154e222896915fb29e98869480
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-05-16T09:39:18Z

    Disable Parquet predicate pushdown for fields having dots in the names

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117162950
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -47,39 +49,47 @@ import org.apache.spark.util.{AccumulatorContext, AccumulatorV2}
      *    data type is nullable.
      */
     class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext {
    --- End diff --
    
    can we just have a simple end-to-end test? The fix is actually very simple and seems not worth such complex tests to verify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116801592
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
         val dataTypeOf = getFieldMap(schema)
    --- End diff --
    
    `dataTypeOf` -> `nameTypeMap`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116964092
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    +    // delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates
    +    // with missing columns. The incorrect results could be got from Parquet when we push down
    +    // filters for the column having dots in the names. Thus, we do not push down such filters.
    --- End diff --
    
    In any way, I believe we should disable because it appears the pushed Parquet filter indicates another column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    I would rather like to say it is a limitation in Parquet API. It looks there is no way to set column names having dots in Parquet filters properly. https://github.com/apache/spark/pull/17680 suggests a hacky workaround to set this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Thanks! Merging to master/2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117168737
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -538,6 +538,21 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           // scalastyle:on nonascii
         }
       }
    +
    +  test("SPARK-20364: Disable Parquet predicate pushdown for fields having dots in the names") {
    --- End diff --
    
    Looks much better now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    @viirya, I addressed your comments. Could you take another look when you have sometime?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76989 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76989/testReport)** for PR 18000 at commit [`20adf72`](https://github.com/apache/spark/commit/20adf723d717d7f3c5c64de0297c03075423b2fc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #77051 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77051/testReport)** for PR 18000 at commit [`1eae64a`](https://github.com/apache/spark/commit/1eae64aa2d0618085c4d48fd033d6ed4a4aca8b2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76989/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117026406
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    +    // delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates
    +    // with missing columns. The incorrect results could be got from Parquet when we push down
    +    // filters for the column having dots in the names. Thus, we do not push down such filters.
    +    // See SPARK-20364.
    +    def canMakeFilterOn(name: String): Boolean = nameTypeMap.contains(name) && !name.contains(".")
    --- End diff --
    
    Do we need to consider other special characters, e.g., those in apache/parquet-mr#361?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r116950919
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    +    // delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates
    +    // with missing columns. The incorrect results could be got from Parquet when we push down
    +    // filters for the column having dots in the names. Thus, we do not push down such filters.
    --- End diff --
    
    Yes, but the problem is, it (almost) always evaluates it with NULL when the columns have dots in the names because column paths become nested (`a.b` not `` `a.b` ``) in the Parquet predicate filter.
    
    You are right for `IsNull`. I pointed out this in https://github.com/apache/spark/pull/17680#discussion_r112285883 as they (almost) always evaluate it to `true` in Parquet-side but it is filtered in Spark-side. So, for input/output, it is not an issue in this case but I believe we should disable this for this case too.
    
    I think this example explains the case
    
    ```scala
    val dfs = Seq(
      Seq(Some(1), None).toDF("col.dots"),
      Seq(Some(1L), None).toDF("col.dots"),
      Seq(Some(1.0F), None).toDF("col.dots"),
      Seq(Some(1.0D), None).toDF("col.dots"),
      Seq(true, false).toDF("col.dots"),
      Seq("apple", null).toDF("col.dots"),
      Seq("apple", null).toDF("col.dots")
    )
     
    val predicates = Seq(
      "`col.dots` > 0",
      "`col.dots` >= 1L",
      "`col.dots` < 2.0",
      "`col.dots` <= 1.0D",
      "`col.dots` == true",
      "`col.dots` IS NOT NULL",
      "`col.dots` IS NULL"
    )
    
    dfs.zip(predicates).zipWithIndex.foreach { case ((df, predicate), i) =>
      val path = s"/tmp/abcd$i"
      df.write.mode("overwrite").parquet(path)
      spark.read.parquet(path).where(predicate).show()	
    }
    ```
    
    ```
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    +--------+
    
    +--------+
    |col.dots|
    +--------+
    |    null|
    +--------+
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77026/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117143908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    Hm, I expect this is a non-critical path and not executed multiple times. Also, it does not look particularly faster to call, `Filter.references` -> `Filter.findReferences` -> `Filter.references` ... . Another downside (maybe nitpicking) is, this will introduce another small code path that returns `None` for filter creation failure.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117158402
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -490,6 +516,42 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         }
       }
     
    +  test("SPARK-20364 Do not push down filters when column names have dots") {
    +    implicit class StringToAttribute(str: String) {
    +      // Implicits for attr, $ and symbol do not handle backticks.
    +      def attribute: Attribute = UnresolvedAttribute.quotedString(str)
    --- End diff --
    
    Yea, actually my initial version in my local included the change for `symbol` and` $` to match them to `Column`. It also looks making sense per https://github.com/apache/spark/pull/7969. I believe this is an internal API - https://github.com/apache/spark/blob/e9c91badce64731ffd3e53cbcd9f044a7593e6b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L24 so I guess it would be fine even if it introduces a behaviour change.
    
    Nevertheless, I believe some guys don't like this change much and wanted to avoid such changes here for now (it is single place it needs anyway for now ... ).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Seems jenkins doesn't work for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Now, the fix is much safer for merging to 2.2. Will review the test case later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117162403
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    --- End diff --
    
    nit: `nameToType`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18000#discussion_r116805363

--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
@@ -168,6 +168,11 @@ private[parquet] object ParquetFilters {
def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
val dataTypeOf = getFieldMap(schema)

+ // Note that, we do not push down filters for columns having dots in the names. In Parquet
+ // 1.8.2, currently, column APIs in Parquet's `FilterApi` only allow dot-separated names
+ // so here we simply avoid this case. See SPARK-20364.
--- End diff --

Parquet does not allow . in the column name because it is used as a column path delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates with missing columns. Since it can generate the incorrect results, we block the filter pushdown when the filters have the column having dots in the names. See SPARK-20364.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117140137
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    --- End diff --
    
    Yes, it is. However, we don't already log pushed filters failed to create, e.g., `In` AFAIK. Probably, we should log in those cases across all the sources. If you don't strongly feel about this, I would like to not log here for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    LGTM, is parquet going to fix it in the future? or is there any official way to support filter push down for column names with dot?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Thank you @viirya.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #76989 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76989/testReport)** for PR 18000 at commit [`20adf72`](https://github.com/apache/spark/commit/20adf723d717d7f3c5c64de0297c03075423b2fc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    **[Test build #77051 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77051/testReport)** for PR 18000 at commit [`1eae64a`](https://github.com/apache/spark/commit/1eae64aa2d0618085c4d48fd033d6ed4a4aca8b2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18000: [SPARK-20364][SQL] Disable Parquet predicate push...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18000#discussion_r117061359
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -166,7 +166,14 @@ private[parquet] object ParquetFilters {
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: StructType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val dataTypeOf = getFieldMap(schema)
    +    val nameTypeMap = getFieldMap(schema)
    +
    +    // Parquet does not allow dots in the column name because dots are used as a column path
    +    // delimiter. Since Parquet 1.8.2 (PARQUET-389), Parquet accepts the filter predicates
    +    // with missing columns. The incorrect results could be got from Parquet when we push down
    +    // filters for the column having dots in the names. Thus, we do not push down such filters.
    +    // See SPARK-20364.
    +    def canMakeFilterOn(name: String): Boolean = nameTypeMap.contains(name) && !name.contains(".")
    --- End diff --
    
    I think we should disallow when initially loading or writing out if it is still allowed in any way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18000: [SPARK-20364][SQL] Disable Parquet predicate pushdown fo...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18000
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org