You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yann Byron (Jira)" <ji...@apache.org> on 2022/02/05 09:01:00 UTC

[jira] [Commented] (HUDI-2972) Support different Spark internal Timestamp and Date types

    [ https://issues.apache.org/jira/browse/HUDI-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487436#comment-17487436 ] 

Yann Byron commented on HUDI-2972:
----------------------------------

[~ryanpife] can you retry by hudi master branch which include this [HUDI-3125|https://github.com/apache/hudi/pull/4471]

> Support different Spark internal Timestamp and Date types
> ---------------------------------------------------------
>
>                 Key: HUDI-2972
>                 URL: https://issues.apache.org/jira/browse/HUDI-2972
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark-sql
>            Reporter: Ryan Pifer
>            Priority: Critical
>
> In Spark 3 a configuration was added, {{spark.sql.datetime.java8API.enabled}} which can modify the internal Row type of Timestamp and Date types to *Instant* or {*}LocalDate{*}. 
> https://issues.apache.org/jira/browse/SPARK-27008
> In Spark 3.1 this is enabled by default through spark-sql which will break writes using Timestamps. It's also likely this could be enabled by default in future across all Spark in which this would become a breaking issue
> Right now in AvroConversionHelper ([ref|https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L301-L304]) and SqlKeyGenerator ([ref|https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala]) it cannot handle this properly.
> When partitioned by Timestamp
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Invalid format: "2021-05-07T00:00:00Z" is malformed at "T00:00:00Z" at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$convertPartitionPathToSqlType$1(SqlKeyGenerator.scala:94) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.convertPartitionPathToSqlType(SqlKeyGenerator.scala:85) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.getPartitionPath(SqlKeyGenerator.scala:115) at org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777){code}
> Inserts with type Timestamp
> {code:java}
> 21/10/21 18:14:17 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (ip-10-71-235-164.ec2.internal executor 20): java.lang.ClassCastException: java.time.Instant cannot be cast to java.sql.Timestamp at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:304) at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:304) at scala.Option.map(Option.scala:230) at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:304) at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362) at org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)