You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/08/29 03:18:19 UTC

[GitHub] [orc] dongjoon-hyun commented on issue #1237: The result is strange when casting `string` to `date` in ORC reading via spark.

dongjoon-hyun commented on issue #1237:
URL: https://github.com/apache/orc/issues/1237#issuecomment-1229705741

   Hi, @sinkinben . You are trying `Schema Evolution (Upcasting)`.
   
   Both Apache Spark and ORC community recommend to use explicit SQL `CAST` method instead of depending on data source's `Schema Evolution`. There are three reasons.
   
   - First of all, if you use explicit `CAST` syntax, you will get the expected result.
   ```scala
   scala> sql("select cast('2022-01-32' as DATE)").show()
   +------------------------+
   |CAST(2022-01-32 AS DATE)|
   +------------------------+
   |                    null|
   +------------------------+
   
   
   scala> sql("select cast('9808-02-30' as DATE)").show()
   +------------------------+
   |CAST(9808-02-30 AS DATE)|
   +------------------------+
   |                    null|
   +------------------------+
   
   
   scala> sql("select cast('2022-06-31' as DATE)").show()
   +------------------------+
   |CAST(2022-06-31 AS DATE)|
   +------------------------+
   |                    null|
   +------------------------+
   ```
   
   - Second, Spark provides many data sources like CSV/Avro/Parquet/ORC. A data source's schema evolution capability is heterogeneous from each others. In other words, you cannot expect a consistent result when you change the file-based data source format. You will get different results from other data sources like Parquet. FYI, Apache Spark community has a test coverage for that feature parity issue and has been tracking it.
    
   https://github.com/apache/spark/blob/146f187342140635b83bfe775b6c327755edfbe1/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala#L40-L49
   ```
    * The reader schema is said to be evolved (or projected) when it changed after the data is
    * written by writers. The followings are supported in file-based data sources.
    * Note that partition columns are not maintained in files. Here, `column` means non-partition
    * column.
    *
    *   1. Add a column
    *   2. Hide a column
    *   3. Change a column position
    *   4. Change a column type (Upcast)
    *
    * Here, we consider safe changes without data loss. For example, data type changes should be
    * from small types to larger types like `int`-to-`long`, not vice versa.
    *
    * So far, file-based data sources have the following coverages.
    *
    *   | File Format  | Coverage     | Note                                                   |
    *   | ------------ | ------------ | ------------------------------------------------------ |
    *   | TEXT         | N/A          | Schema consists of a single string column.             |
    *   | CSV          | 1, 2, 4      |                                                        |
    *   | JSON         | 1, 2, 3, 4   |                                                        |
    *   | ORC          | 1, 2, 3, 4   | Native vectorized ORC reader has the widest coverage.  |
    *   | PARQUET      | 1, 2, 3      |                                                        |
    *   | AVRO         | 1, 2, 3      |                                                        |
   ```
   
   - Last but not least, Apache Spark has three ORC readers. For your use case, you can set `spark.sql.orc.impl=hive` to get a correct result if you really need to depend on Apache ORC's Schema Evolution inevitably.
   ```
   scala> sql("set spark.sql.orc.impl=hive")
   
   scala> :paste
   // Entering paste mode (ctrl-D to finish)
   
   val data = Seq(
       ("", "2022-01-32"),  // pay attention to this, null
       ("", "9808-02-30"),  // pay attention to this, 9808-02-29
       ("", "2022-06-31"),  // pay attention to this, 2022-06-30
   )
   val cols = Seq("str", "date_str")
   val df = spark.createDataFrame(data).toDF(cols:_*).repartition(1)
   df.write.format("orc").mode("overwrite").save("/tmp/df")
   spark.read.format("orc").schema("date_str date").load("/tmp/df").show(false)
   
   // Exiting paste mode, now interpreting.
   
   +--------+
   |date_str|
   +--------+
   |null    |
   |null    |
   |null    |
   +--------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org