You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/25 21:33:00 UTC

[GitHub] bersprockets commented on issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema

bersprockets commented on issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema
URL: https://github.com/apache/spark/pull/23165#issuecomment-467192011
 
 
   Hi @gengliangwang @cloud-fan 
   
   I noticed this PR changed how mixed-cased partition columns are handled when the user provides a schema.
   
   Say I have this file structure (note that each instance of ```pS``` is mixed case):
   <pre>
   bash-3.2$ find partitioned5 -type d
   partitioned5
   partitioned5/pi=2
   partitioned5/pi=2/pS=foo
   partitioned5/pi=2/pS=bar
   partitioned5/pi=1
   partitioned5/pi=1/pS=foo
   partitioned5/pi=1/pS=bar
   bash-3.2$
   </pre>
   If I load the file with a user-provided schema in 2.4 (before this PR was committed) or 2.3, I see:
   <pre>
   
   scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5")
   df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
   scala> df.printSchema
   root
    |-- intField: integer (nullable = true)
    |-- pi: integer (nullable = true)
    |-- ps: string (nullable = true)
   scala>
   </pre>
   However, with this PR I see:
   <pre>
   scala> val df = spark.read.schema("intField int, pi int, ps string").parquet("partitioned5")
   df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
   scala> df.printSchema
   root
    |-- intField: integer (nullable = true)
    |-- pi: integer (nullable = true)
    |-- pS: string (nullable = true)
   scala>
   </pre>
   Spark is picking up the mixed-case column name ```pS``` from the directory name, not the lower-case ```ps``` from my specified schema.
   
   In all cases, ```spark.sql.caseSensitive``` is set to the default (false).
   
   Not sure is this is an issue, but it is a difference.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org