You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yucai <gi...@git.apache.org> on 2018/08/23 08:04:48 UTC

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

GitHub user yucai opened a pull request:

    https://github.com/apache/spark/pull/22197

    [SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet

    ## What changes were proposed in this pull request?
    
    Currently, filter pushdown will not work if Parquet schema and Hive metastore schema are in different letter cases even spark.sql.caseSensitive is false.
    
    Like the below case:
    ```scala
    spark.range(10).write.parquet("/tmp/data")
    sql("DROP TABLE t")
    sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
    sql("select * from t where id > 0").show
    ```
    
    No filter will be generated, this PR provides a case-insensitive field resolution to filter pushdown.
    ## How was this patch tested?
    
    Added UTs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yucai/spark SPARK-25207

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22197.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22197
    
----
commit 5902afe6fb6e88f98fb4f2649e59156264bc3e4d
Author: yucai <yy...@...>
Date:   2018-08-23T07:16:42Z

    [SPARK-25207][SQL] Case-insensitve field resolution for filter pushdown when reading Parquet

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95428/testReport)** for PR 22197 at commit [`9142f49`](https://github.com/apache/spark/commit/9142f4925f5c127fe790c02bc8525e415aa1a60b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213597570
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -366,18 +367,29 @@ class ParquetFileFormat
     
           val sharedConf = broadcastedHadoopConf.value.value
     
    -      lazy val footerFileMetaData =
    +      val footerFileMetaData =
             ParquetFileReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
    +
    +      val parquetRequestedSchema = {
    +        val schemaString = sharedConf.get(ParquetReadSupport.SPARK_ROW_CATALYST_REQUESTED_SCHEMA)
    +        assert(schemaString != null, "Catalyst requested schema not set.")
    +        val catalystRequestedSchema = StructType.fromString(schemaString)
    +        val parquetSchema = footerFileMetaData.getSchema
    +        ParquetReadSupport.clipParquetSchema(
    +          parquetSchema, catalystRequestedSchema, isCaseSensitive)
    +      }
    +      sharedConf.set(ParquetReadSupport.SPARK_ROW_PARQUET_REQUESTED_SCHEMA,
    --- End diff --
    
    We are already at executor side here, why do we need to set the conf? We can even pass the `parquetRequestedSchema` to reader via constructor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95409/testReport)** for PR 22197 at commit [`512b181`](https://github.com/apache/spark/commit/512b181f61bc961731a466ba2e939b362a017401).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95454/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95449/testReport)** for PR 22197 at commit [`cb03fb7`](https://github.com/apache/spark/commit/cb03fb737f1e556792c9284cd5102148a91fa483).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r214263031
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,113 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    def createParquetFilter(caseSensitive: Boolean): ParquetFilters = {
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive)
    +    }
    +    val caseSensitiveParquetFilters = createParquetFilter(caseSensitive = true)
    +    val caseInsensitiveParquetFilters = createParquetFilter(caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    +  }
    +
    +  test("SPARK-25207: exception when duplicate fields in case-insensitive mode") {
    +    withTempPath { dir =>
    +      val tableName = "spark_25207"
    +      val tableDir = dir.getAbsoluteFile + "/table"
    +      withTable(tableName) {
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
    +          spark.range(10).selectExpr("id as A", "id as B", "id as b")
    +            .write.mode("overwrite").parquet(tableDir)
    +        }
    +        sql(
    +          s"""
    +             |CREATE TABLE $tableName (A LONG, B LONG) USING PARQUET LOCATION '$tableDir'
    +           """.stripMargin)
    +
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
    +          val e = intercept[SparkException] {
    +            sql(s"select a from $tableName where b > 0").collect()
    --- End diff --
    
    nit: to be consistent with the following query, I'd make this query as `select A from $tableName where B > 0` too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    This PR is basically trying to resolve case sensitivity when the logical schema and physical schema do not match. This sounds like a general issue in all the data sources. Could any of you do us a favor? Check whether all the built-in data sources respect the conf `spark.sql.caseSensitive` in this case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95409/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @yucai . If you don't mind, could you update the PR description? This PR doesn't generate new filters here. This only changes `field resolution` logic. With and without this PR, there exists filters and there occurs filter push-downs.
    ```
    - No filter will be pushdown.
    + Wrong filters will be pushed down.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95449/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212812755
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,116 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    +  }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet" +
    +    " - exception when duplicate fields in case-insensitive mode") {
    --- End diff --
    
    nit: We can have just `exception when duplicate fields in case-insensitive mode` as test title. Original one is too verbose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95264/testReport)** for PR 22197 at commit [`10c437e`](https://github.com/apache/spark/commit/10c437eaee6b767f744e5c2227b96efc4f9bc685).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Both `catalystRequestedSchema` and `parquetSchema` are recursive structure, is there the easy way to find duplicated fields? Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r214233946
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -44,7 +45,14 @@ private[parquet] class ParquetFilters(
         pushDownTimestamp: Boolean,
         pushDownDecimal: Boolean,
         pushDownStartWith: Boolean,
    -    pushDownInFilterThreshold: Int) {
    +    pushDownInFilterThreshold: Int,
    +    caseSensitive: Boolean) {
    +
    +  private case class ParquetField(
    +      // field name in parquet file
    --- End diff --
    
    I'd just move those into the doc for this case class above, for instance,
    
    ```
    /**
     * blabla
     * @param blabla
     */
    private case class ParquetField
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95410/testReport)** for PR 22197 at commit [`fe8cfce`](https://github.com/apache/spark/commit/fe8cfce5e856b33e72844692a017267f3fcd775a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212788741
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -377,7 +378,7 @@ class ParquetFileFormat
               // Collects all converted Parquet filter predicates. Notice that not all predicates can be
               // converted (`ParquetFilters.createFilter` returns an `Option`). That's why a `flatMap`
               // is used here.
    -          .flatMap(parquetFilters.createFilter(parquetSchema, _))
    +          .flatMap(parquetFilters.createFilter(parquetSchema, _, isCaseSensitive))
    --- End diff --
    
    can we pass this config when creating `ParquetFilters`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95149/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95462/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95456/testReport)** for PR 22197 at commit [`41a7b83`](https://github.com/apache/spark/commit/41a7b8385972ef504af7f88954a9e4ccecec6885).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212812673
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    --- End diff --
    
    If we don't need to consider ambiguity, can't we just lowercase `f.getName` above instead of doing dedup here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212801680
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1021,88 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25132: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    --- End diff --
    
    Added, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95147/testReport)** for PR 22197 at commit [`5902afe`](https://github.com/apache/spark/commit/5902afe6fb6e88f98fb4f2649e59156264bc3e4d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @dongjoon-hyun Do you think we face the same issue in ORC?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212788978
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ---
    @@ -377,7 +378,7 @@ class ParquetFileFormat
               // Collects all converted Parquet filter predicates. Notice that not all predicates can be
               // converted (`ParquetFilters.createFilter` returns an `Option`). That's why a `flatMap`
               // is used here.
    -          .flatMap(parquetFilters.createFilter(parquetSchema, _))
    +          .flatMap(parquetFilters.createFilter(parquetSchema, _, isCaseSensitive))
    --- End diff --
    
    Yes, that way is better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95256 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95256/testReport)** for PR 22197 at commit [`1ea94cc`](https://github.com/apache/spark/commit/1ea94cca149b7a4814ea617f673111ea0f26219d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95517/testReport)** for PR 22197 at commit [`e0d6196`](https://github.com/apache/spark/commit/e0d61969b13bcfd9dfc95e2a013b14e111d2b832).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95147/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95257 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95257/testReport)** for PR 22197 at commit [`90b8717`](https://github.com/apache/spark/commit/90b8717612ab43d97228d77deaae3d6f33add055).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95454/testReport)** for PR 22197 at commit [`04b88c5`](https://github.com/apache/spark/commit/04b88c579afb975fb0d4c8a8c4e5ff54f3c27ce2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213596588
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala ---
    @@ -65,17 +65,19 @@ private[parquet] class ParquetReadSupport(val convertTz: Option[TimeZone])
        * readers.  Responsible for figuring out Parquet requested schema used for column pruning.
        */
       override def init(context: InitContext): ReadContext = {
    +    val conf = context.getConfiguration
    +
         catalystRequestedSchema = {
    --- End diff --
    
    This is created but not used?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95247/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95253/testReport)** for PR 22197 at commit [`10cd89f`](https://github.com/apache/spark/commit/10cd89ff46892724d513a68bd82cf1a022387436).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212848096
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    --- End diff --
    
    can we do the dedup before parquet filter pushdown and parquet column pruning? Then we can simplify the code in both cases.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212814157
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    +      val dedupPrimitiveFields =
    +        primitiveFields
    +          .groupBy(_._1.toLowerCase(Locale.ROOT))
    +          .filter(_._2.size == 1)
    +          .mapValues(_.head._2)
    +      CaseInsensitiveMap(dedupPrimitiveFields)
    +    }
       }
     
       /**
        * Converts data sources filters to Parquet filter predicates.
        */
       def createFilter(schema: MessageType, predicate: sources.Filter): Option[FilterPredicate] = {
    -    val nameToType = getFieldMap(schema)
    +    val nameToParquet = getFieldMap(schema)
    --- End diff --
    
    ->  nameToParquetField


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95258/testReport)** for PR 22197 at commit [`10c437e`](https://github.com/apache/spark/commit/10c437eaee6b767f744e5c2227b96efc4f9bc685).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95264/testReport)** for PR 22197 at commit [`10c437e`](https://github.com/apache/spark/commit/10c437eaee6b767f744e5c2227b96efc4f9bc685).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212812744
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,116 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    --- End diff --
    
    nit: maybe we don't need to test against so many predicate. We just want to make sure case insensitive resolution work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    > Is it acceptable?
    
    apparently not...
    
    OK let's just check duplicated filed names twice: one in filter pushdown, one in column pruning. And clean it up in followup PRs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @cloud-fan @HyukjinKwon Seem cannot simply add `originalName` into `ParquetSchemaType`.
    
    Because we need exact ParquetSchemaType info for type match, like:
    ```
      private val ParquetByteType = ParquetSchemaType(INT_8, INT32, 0, null)
      private val ParquetShortType = ParquetSchemaType(INT_16, INT32, 0, null)
      private val ParquetIntegerType = ParquetSchemaType(null, INT32, 0, null)
      ...
    private val makeEq: = {
      case ParquetByteType | ParquetShortType | ParquetIntegerType =>
    ```
    
    I use a new case class `ParquetField` to reduce the map.
    ```
      private case class ParquetField(
          name: String,
          schema: ParquetSchemaType)
    ```
    
    Let me know if you are OK with this way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95149/testReport)** for PR 22197 at commit [`2226eae`](https://github.com/apache/spark/commit/2226eaef2eab44c94d4ada0ea86e92c21eb5945d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95428/testReport)** for PR 22197 at commit [`9142f49`](https://github.com/apache/spark/commit/9142f4925f5c127fe790c02bc8525e415aa1a60b).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95524/testReport)** for PR 22197 at commit [`e0d6196`](https://github.com/apache/spark/commit/e0d61969b13bcfd9dfc95e2a013b14e111d2b832).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @cloud-fan I also think my way changes too much in this PR.
    
    > go through the parquet schema and find duplicated field names
    
    If user query only query non-duplicated field, this way also throws Exception. Like below:
    ```
    spark.range(10).selectExpr("id as a", "id as b", "id as B").write.mode("overwrite").parquet("/tmp/data")
    sql("select a from t").collect
    ```
    
    Is it acceptable? Or use another PR to do the refactor first?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213912108
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,113 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    def createParquetFilter(caseSensitive: Boolean): ParquetFilters = {
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive)
    +    }
    +    val caseSensitiveParquetFilters = createParquetFilter(caseSensitive = true)
    +    val caseInsensitiveParquetFilters = createParquetFilter(caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    +  }
    +
    +  test("SPARK-25207: exception when duplicate fields in case-insensitive mode") {
    +    withTempPath { dir =>
    +      val tableName = "spark_25207"
    +      val tableDir = dir.getAbsoluteFile + "/table"
    +      withTable(tableName) {
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
    +          spark.range(10).selectExpr("id as A", "id as B", "id as b")
    +            .write.mode("overwrite").parquet(tableDir)
    +        }
    +        sql(
    +          s"""
    +             |CREATE TABLE $tableName (A LONG, B LONG) USING PARQUET LOCATION '$tableDir'
    +           """.stripMargin)
    +
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
    +          val e = intercept[SparkException] {
    +            sql(s"select a from $tableName where b > 0").collect()
    --- End diff --
    
    Yes, we can, see below.
    ```
    val tableName = "test"
    val tableDir = "/tmp/data"
    spark.conf.set("spark.sql.caseSensitive", true)
    spark.range(10).selectExpr("id as A", "2 * id as B", "3 * id as b").write.mode("overwrite").parquet(tableDir)
    sql(s"DROP TABLE $tableName")
    sql(s"CREATE TABLE $tableName (A LONG, B LONG) USING PARQUET LOCATION '$tableDir'")
    scala> sql("select A from test where B > 0").show
    +---+
    |  A|
    +---+
    |  7|
    |  8|
    |  9|
    |  2|
    |  3|
    |  4|
    |  5|
    |  6|
    |  1|
    +---+
    ```
    Let me add one test case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95258 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95258/testReport)** for PR 22197 at commit [`10c437e`](https://github.com/apache/spark/commit/10c437eaee6b767f744e5c2227b96efc4f9bc685).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95524/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95256 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95256/testReport)** for PR 22197 at commit [`1ea94cc`](https://github.com/apache/spark/commit/1ea94cca149b7a4814ea617f673111ea0f26219d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95455/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95455/testReport)** for PR 22197 at commit [`04b88c5`](https://github.com/apache/spark/commit/04b88c579afb975fb0d4c8a8c4e5ff54f3c27ce2).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95462/testReport)** for PR 22197 at commit [`41a7b83`](https://github.com/apache/spark/commit/41a7b8385972ef504af7f88954a9e4ccecec6885).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95256/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95253/testReport)** for PR 22197 at commit [`10cd89f`](https://github.com/apache/spark/commit/10cd89ff46892724d513a68bd82cf1a022387436).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95257 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95257/testReport)** for PR 22197 at commit [`90b8717`](https://github.com/apache/spark/commit/90b8717612ab43d97228d77deaae3d6f33add055).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212813302
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,116 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    --- End diff --
    
    Each test is corresponding to one line code change in `createFilter`. Like:
    ```
          case sources.IsNull(name) if canMakeFilterOn(name, null) =>
            makeEq.lift(fieldMap(name).schema).map(_(fieldMap(name).name, null))
    ```
    All tests together can cover all my change in `createFilter`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95147/testReport)** for PR 22197 at commit [`5902afe`](https://github.com/apache/spark/commit/5902afe6fb6e88f98fb4f2649e59156264bc3e4d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @cloud-fan I reverted to the previous version.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212850351
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,113 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    def createParquetFilter(caseSensitive: Boolean): ParquetFilters = {
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive)
    +    }
    +    val caseSensitiveParquetFilters = createParquetFilter(caseSensitive = true)
    +    val caseInsensitiveParquetFilters = createParquetFilter(caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    +  }
    +
    +  test("SPARK-25207: exception when duplicate fields in case-insensitive mode") {
    +    withTempDir { dir =>
    --- End diff --
    
    nit: `withTempPath`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22197


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213906227
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,113 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    def createParquetFilter(caseSensitive: Boolean): ParquetFilters = {
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive)
    +    }
    +    val caseSensitiveParquetFilters = createParquetFilter(caseSensitive = true)
    +    val caseInsensitiveParquetFilters = createParquetFilter(caseSensitive = false)
    +
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    +  }
    +
    +  test("SPARK-25207: exception when duplicate fields in case-insensitive mode") {
    +    withTempPath { dir =>
    +      val tableName = "spark_25207"
    +      val tableDir = dir.getAbsoluteFile + "/table"
    +      withTable(tableName) {
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
    +          spark.range(10).selectExpr("id as A", "id as B", "id as b")
    +            .write.mode("overwrite").parquet(tableDir)
    +        }
    +        sql(
    +          s"""
    +             |CREATE TABLE $tableName (A LONG, B LONG) USING PARQUET LOCATION '$tableDir'
    +           """.stripMargin)
    +
    +        withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
    +          val e = intercept[SparkException] {
    +            sql(s"select a from $tableName where b > 0").collect()
    --- End diff --
    
    can we read this table with case-sensitive mode?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95428/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Seems fine to me too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @dongjoon-hyun Sorry for the late response, description is changed to:
    
    > Although filter "ID < 100L" is generated by Spark, it fails to pushdown into parquet actually, Spark still does the full table scan when reading.
    > This PR provides a case-insensitive field resolution to make it work.
    
    Let me know if you have any suggestion :).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213551202
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    --- End diff --
    
    @cloud-fan, it is a great idea, thanks!
    I think it is not to "dedup" before pushdown and pruning.
    Maybe we should do parquet schema clip before pushdown and pruning.
    If duplicated fields are detected, throw the exception.
    If not, pass clipped parquet schema via hadoopconf to parquet lib.
    ```
        catalystRequestedSchema = {
          val conf = context.getConfiguration
          val schemaString = conf.get(ParquetReadSupport.SPARK_ROW_REQUESTED_SCHEMA)
          assert(schemaString != null, "Parquet requested schema not set.")
          StructType.fromString(schemaString)
        }
    
        val caseSensitive = context.getConfiguration.getBoolean(SQLConf.CASE_SENSITIVE.key,
          SQLConf.CASE_SENSITIVE.defaultValue.get)
        val parquetRequestedSchema = ParquetReadSupport.clipParquetSchema(
          context.getFileSchema, catalystRequestedSchema, caseSensitive)
    ```
    I am trying this way, will update soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95410/testReport)** for PR 22197 at commit [`fe8cfce`](https://github.com/apache/spark/commit/fe8cfce5e856b33e72844692a017267f3fcd775a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @dongjoon-hyun In the **schema matched case** as you listed, it is expected behavior in current master.
    ```
    spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 1024)
    spark.range(1, 40 * 1024 * 1024, 1, 1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t")
    sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'")
    
    // master and 2.3 have different plan for top limit (see below), that's why 28.4 MB are read in master
    sql("select * from t where id < 100L").show()
    ```
    This difference is probably introduced by #21573, @cloud-fan, current master read more data than 2.3 for top limit like in https://github.com/apache/spark/pull/22197#issuecomment-416085556 , is it a regression or not?
    
    Master:
    ![image](https://user-images.githubusercontent.com/2989575/44654328-e0639500-aa23-11e8-94c5-5ba8f9f87fd8.png)
    
    2.3 branch:
    ![image](https://user-images.githubusercontent.com/2989575/44654336-e9ecfd00-aa23-11e8-89ec-a1c0085e2c36.png)
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    I will treat the above case as acceptable and will add a duplicated field check for the parquet schema.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95258/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95455/testReport)** for PR 22197 at commit [`04b88c5`](https://github.com/apache/spark/commit/04b88c579afb975fb0d4c8a8c4e5ff54f3c27ce2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212813240
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,116 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    --- End diff --
    
    Good idea, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    ```
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name c1 in the table definition.
    	at org.apache.hadoop.hive.ql.metadata.Table.validateColumns(Table.java:952)
    	at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:216)
    	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:495)
    	at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
    	... 88 more
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212797855
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +352,43 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns nameMap and typeMap based on different case sensitive mode, if predicate push
    --- End diff --
    
    instead of returning 2 maps, can we just add a `originalName` field to `ParquetSchemaType`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212798970
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +352,43 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns nameMap and typeMap based on different case sensitive mode, if predicate push
    --- End diff --
    
    Great idea!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95149/testReport)** for PR 22197 at commit [`2226eae`](https://github.com/apache/spark/commit/2226eaef2eab44c94d4ada0ea86e92c21eb5945d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212814249
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -44,7 +45,12 @@ private[parquet] class ParquetFilters(
         pushDownTimestamp: Boolean,
         pushDownDecimal: Boolean,
         pushDownStartWith: Boolean,
    -    pushDownInFilterThreshold: Int) {
    +    pushDownInFilterThreshold: Int,
    +    caseSensitive: Boolean) {
    +
    +  private case class ParquetField(
    --- End diff --
    
    Add a description about these two fields? It is confusing what is `resolvedName` for the future code maintainer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95456/testReport)** for PR 22197 at commit [`41a7b83`](https://github.com/apache/spark/commit/41a7b8385972ef504af7f88954a9e4ccecec6885).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95454/testReport)** for PR 22197 at commit [`04b88c5`](https://github.com/apache/spark/commit/04b88c579afb975fb0d4c8a8c4e5ff54f3c27ce2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95247/testReport)** for PR 22197 at commit [`c76189d`](https://github.com/apache/spark/commit/c76189d96fe7f67123ba6558b1d53680398697fd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95409/testReport)** for PR 22197 at commit [`512b181`](https://github.com/apache/spark/commit/512b181f61bc961731a466ba2e939b362a017401).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95517/testReport)** for PR 22197 at commit [`e0d6196`](https://github.com/apache/spark/commit/e0d61969b13bcfd9dfc95e2a013b14e111d2b832).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212797948
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +352,43 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns nameMap and typeMap based on different case sensitive mode, if predicate push
    --- End diff --
    
    +1 for avoiding returning 2maps if possible.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @gatorsmile . I don't think so we have this regression on ORC data source.
    However, there was another JIRA report, [SPARK-25175](https://issues.apache.org/jira/browse/SPARK-25175) 5 days ago. There was not much details on it. So, I'm still monitoring that issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95247/testReport)** for PR 22197 at commit [`c76189d`](https://github.com/apache/spark/commit/c76189d96fe7f67123ba6558b1d53680398697fd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95456/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212814524
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    --- End diff --
    
    It is a good question!
    
    Let's see the scenario like below:
    1. parquet file has duplicate fields "a INT, A INT".
    2. user wants to pushdown "A > 0".
    
    Without dedup, we possible pushdown "a > 0" instead of "A > 0",
    although it is wrong, it will still trigger the Exception finally when reading parquet,
    so whether dedup or not, we will get the same result.
    
    @cloud-fan , @gatorsmile any idea?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    I agree moving the `clipSchema` to the beginning is cleaner, but I feel it goes too far away from our original target: making parquet filter pushdown case insensitive when spark is case insensitive.
    
    I'd expect a very small change: at the beginning, go through the parquet schema and find duplicated field names. We can do the refactoring later, to make this PR small.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95264/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @cloud-fan, tests have passed. And I will use a followup PR to make it cleaner.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95253/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95449/testReport)** for PR 22197 at commit [`cb03fb7`](https://github.com/apache/spark/commit/cb03fb737f1e556792c9284cd5102148a91fa483).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r213544078
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -350,25 +356,38 @@ private[parquet] class ParquetFilters(
       }
     
       /**
    -   * Returns a map from name of the column to the data type, if predicate push down applies.
    +   * Returns a map, which contains parquet field name and data type, if predicate push down applies.
        */
    -  private def getFieldMap(dataType: MessageType): Map[String, ParquetSchemaType] = dataType match {
    -    case m: MessageType =>
    -      // Here we don't flatten the fields in the nested schema but just look up through
    -      // root fields. Currently, accessing to nested fields does not push down filters
    -      // and it does not support to create filters for them.
    -      m.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    -        f.getName -> ParquetSchemaType(
    -          f.getOriginalType, f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata)
    -      }.toMap
    -    case _ => Map.empty[String, ParquetSchemaType]
    +  private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
    +    // Here we don't flatten the fields in the nested schema but just look up through
    +    // root fields. Currently, accessing to nested fields does not push down filters
    +    // and it does not support to create filters for them.
    +    val primitiveFields =
    +      dataType.getFields.asScala.filter(_.isPrimitive).map(_.asPrimitiveType()).map { f =>
    +        f.getName -> ParquetField(f.getName,
    +          ParquetSchemaType(f.getOriginalType,
    +            f.getPrimitiveTypeName, f.getTypeLength, f.getDecimalMetadata))
    +      }
    +    if (caseSensitive) {
    +      primitiveFields.toMap
    +    } else {
    +      // Don't consider ambiguity here, i.e. more than one field is matched in case insensitive
    +      // mode, just skip pushdown for these fields, they will trigger Exception when reading,
    +      // See: SPARK-25132.
    --- End diff --
    
    ping @yucai


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95410/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212812771
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -44,7 +45,12 @@ private[parquet] class ParquetFilters(
         pushDownTimestamp: Boolean,
         pushDownDecimal: Boolean,
         pushDownStartWith: Boolean,
    -    pushDownInFilterThreshold: Int) {
    +    pushDownInFilterThreshold: Int,
    +    caseSensitive: Boolean) {
    +
    +  private case class ParquetField(
    +      name: String,
    --- End diff --
    
    `resolvedName`? This `name` and the `name` in schema looks confused in following code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Thanks. I got it. Definitely, it's irrelevant to this and an intentional regression due to that reverting.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212797780
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1021,88 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25132: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    +    def testCaseInsensitiveResolution(
    +        schema: StructType,
    +        expected: FilterPredicate,
    +        filter: sources.Filter): Unit = {
    +      val parquetSchema = new SparkToParquetSchemaConverter(conf).convert(schema)
    +
    +      assertResult(Some(expected)) {
    +        caseInsensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +      assertResult(None) {
    +        caseSensitiveParquetFilters.createFilter(parquetSchema, filter)
    +      }
    +    }
    +
    +    val schema = StructType(Seq(StructField("cint", IntegerType)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), null.asInstanceOf[Integer]), sources.IsNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), null.asInstanceOf[Integer]),
    +      sources.IsNotNull("CINT"))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualTo("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualTo("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.eq(intColumn("cint"), 1000: Integer), sources.EqualNullSafe("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.notEq(intColumn("cint"), 1000: Integer),
    +      sources.Not(sources.EqualNullSafe("CINT", 1000)))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.lt(intColumn("cint"), 1000: Integer), sources.LessThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.ltEq(intColumn("cint"), 1000: Integer),
    +      sources.LessThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema, FilterApi.gt(intColumn("cint"), 1000: Integer), sources.GreaterThan("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.gtEq(intColumn("cint"), 1000: Integer),
    +      sources.GreaterThanOrEqual("CINT", 1000))
    +
    +    testCaseInsensitiveResolution(
    +      schema,
    +      FilterApi.or(
    +        FilterApi.eq(intColumn("cint"), 10: Integer),
    +        FilterApi.eq(intColumn("cint"), 20: Integer)),
    +      sources.In("CINT", Array(10, 20)))
    +
    +    val dupFieldSchema = StructType(
    +      Seq(StructField("cint", IntegerType), StructField("cINT", IntegerType)))
    +    val dupParquetSchema = new SparkToParquetSchemaConverter(conf).convert(dupFieldSchema)
    +    assertResult(None) {
    +      caseInsensitiveParquetFilters.createFilter(
    +        dupParquetSchema, sources.EqualTo("CINT", 1000))
    +    }
    --- End diff --
    
    Can we add one negative test that having name names in case insensitive modes, for example, `cInt`, `CINT` and check if that throws an exception?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95517/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95524/testReport)** for PR 22197 at commit [`e0d6196`](https://github.com/apache/spark/commit/e0d61969b13bcfd9dfc95e2a013b14e111d2b832).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22197: [SPARK-25207][SQL] Case-insensitve field resoluti...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22197#discussion_r212812718
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -1021,6 +1022,116 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           }
         }
       }
    +
    +  test("SPARK-25207: Case-insensitive field resolution for pushdown when reading parquet") {
    +    val caseSensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = true)
    +
    +    val caseInsensitiveParquetFilters =
    +      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
    +        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
    +        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = false)
    --- End diff --
    
    nit: add a method like:
    
    ```scala
    def createParquetFilter(caseSensitive: Boolean) = {
      new ParquetFilters(conf.parquetFilterPushDownDate, conf.parquetFilterPushDownTimestamp,
        conf.parquetFilterPushDownDecimal, conf.parquetFilterPushDownStringStartWith,
        conf.parquetFilterPushDownInFilterThreshold, caseSensitive = caseSensitive)
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95257/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by yucai <gi...@git.apache.org>.

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    @gatorsmile I can help check `spark.sql.caseSensitive` for all the built-in data sources.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    **[Test build #95462 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95462/testReport)** for PR 22197 at commit [`41a7b83`](https://github.com/apache/spark/commit/41a7b8385972ef504af7f88954a9e4ccecec6885).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22197
  
    One minor comment that can be addressed in a follow-up PR. LGTM.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org