You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by lrz <36...@qq.com> on 2021/04/01 04:03:42 UTC

Discussion for timestamp support

Hi, I want to discuss about the support for timestamp dataType.
As we know, now Hudi save timestamp type as long, then this will lead to some problem when the table include timestamp datatype:
1) At bootstrap operation, if the origin parquet file was written by a spark application, then spark will default save timestamp as int96(see spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github <https://github/>.com/apache/parquet-mr/pull/831/files)

2) after bootstrap, doing upsert will fail because we use hoodie schema to read origin parquet file. The schema is not match because hoodie schema  treat timestamp as long and at origin file it’s Int96

3) after bootstrap, and partial update for a parquet file will fail, because we copy the old record and save by hoodie schema( we miss a convertFixedToLong operation like spark does)

4) if we set hoodie.datasource.hive_sync.support_timestamp=true, and will had a convertTypeException when reading the rt view, it’s because we miss convert LongWritable to ItmestampWrtableV2 at HoodieRealtimeRecordReaderUtils.

To solve these issues, we need to upgrade parquet version, and add some config, please help to get a good solution, thank you very much!

Re: Discussion for timestamp support

Posted by Danny Chan <da...@apache.org>.

I think the read is very about each engine because Hoodie does not define
its own parquet reader yet, for e.g the Flink reader can read int96 as
timestamp based on the declared precision.

Best,
Danny Chan

lrz <36...@qq.com> 于2021年4月1日周四 下午12:04写道：

> Hi, I want to discuss about the support for timestamp dataType.
> As we know, now Hudi save timestamp type as long, then this will lead to
> some problem when the table include timestamp datatype:
> 1) At bootstrap operation, if the origin parquet file was written by a
> spark application, then spark will default save timestamp as int96(see
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check
> https://github <https://github/>.com/apache/parquet-mr/pull/831/files)
>
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to
> read origin parquet file. The schema is not match because hoodie schema
> treat timestamp as long and at origin file it’s Int96
>
> 3) after bootstrap, and partial update for a parquet file will fail,
> because we copy the old record and save by hoodie schema( we miss a
> convertFixedToLong operation like spark does)
>
> 4) if we set hoodie.datasource.hive_sync.support_timestamp=true, and will
> had a convertTypeException when reading the rt view, it’s because we miss
> convert LongWritable to ItmestampWrtableV2 at
> HoodieRealtimeRecordReaderUtils.
>
> To solve these issues, we need to upgrade parquet version, and add some
> config, please help to get a good solution, thank you very much!