You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Istvan Darvas (Jira)" <ji...@apache.org> on 2022/02/23 12:25:00 UTC

[jira] [Comment Edited] (HUDI-3490) Timestamp conversion (parquet)

    [ https://issues.apache.org/jira/browse/HUDI-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496713#comment-17496713 ] 

Istvan Darvas edited comment on HUDI-3490 at 2/23/22, 12:24 PM:
----------------------------------------------------------------

DeltaStreamer from Kafka/Json => S3/Hudi table

config:

  hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS

file based target schema:

{
 "name": "report_time",
  "type":

{   "type": "long",   "logicalType": "timestamp-millis"  }

},

—

sinked parquet file schema inspect: (parquet tools)

 
 # 
 ## 
 ### 
 #### 
 ##### 
 ###### 
 ####### 
 ######## 
 ######### 
 ########## 
 ########### 
 ############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

 

it seems it does not respect the config. neither the hoodie conf, nor the avro target conf.


was (Author: JIRAUSER282551):
DeltaStreamer from Kafka/Json => S3/Hudi table

config:

  hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS

file based target schema:

{
 "name": "report_time",
  "type": {
  "type": "long",
  "logicalType": "timestamp-millis"
 }
},

---

sinked parquet file schema inspect: (parquet tools)

 

############ Column(receive_time) ############
name: receive_time
path: receive_time
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

 

it seems it does not respect the config

> Timestamp conversion (parquet)
> ------------------------------
>
>                 Key: HUDI-3490
>                 URL: https://issues.apache.org/jira/browse/HUDI-3490
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Istvan Darvas
>            Priority: Major
>
> Hi Guys!
>  
> My Env is Hudi 0.8.0 AWS EMR 6.4
>  
> It seems timestamp conversion is very confusing and not deterministic across the tools.
> 1.) for me it seems Delta Streamer default is TIMESTAMP_MILLIS
> 2.) PySpark/HUDI API is TIMESTAMP_MICROS
>  
> but the real issue for me is, I cannot control this.
>  
> Neither in DeltaStremer:
>  --hoodie-conf hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
> Nor in PySpark
> {"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS"}
>  
> So I am not able to set a default for me accross systems. ofcourse I can convert it myself and I will do it as a workaround, but it would be greate to have this convenient feture.
>  
> One more suggestion / idea:
> I do not know it is possible or not, but maybe this parameter (hoodie.parquet.outputtimestamptype) could be removed from everywhere, and the framework could use the high level contract from the spark framework. Wich is
>    spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS / TIMESTAMP_MICROS
>    the storage is INT96, which is not compatible with avro, but here I think you could do some atomatic conversion which would be well documented :)
>  
> Summarized, I am confused and I am not able to use the automatic conversion of the timestamps across the systems. So this should be standardized.
>  
> Thanks,
>  Darvi
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)