You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/28 04:06:23 UTC

[GitHub] [hudi] neerajpadarthi opened a new issue, #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

neerajpadarthi opened a new issue, #6232:
URL: https://github.com/apache/hudi/issues/6232

   Hi Team,
   
   Using the configs below, I see Hudi is truncating the second precisions while ingesting the data. We are currently on 0.9V and I have observed this issue with this version, but it worked with 0.11V. 
   
   Do I need to add any other configurations to make it work with 0.9V without migrating to 0.11V? Any help on how to avoid this issue would be greatly appreciated.
   
   Configs
   
   db_name = tst_db
   tableName =tst_tb
   pk = ‘id’
   de_dup = ‘last_updated’
   commonConfig = {
       “hoodie.datasource.hive_sync.database”: db_name,
       ‘hoodie.table.name’: tableName,
     ‘hoodie.datasource.hive_sync.support_timestamp’: ‘true’,
       ‘hoodie.datasource.write.recordkey.field’: pk,
       ‘hoodie.datasource.write.precombine.field’: de_dup,
       ‘hoodie.datasource.hive_sync.enable’: ‘true’,
       ‘hoodie.datasource.hive_sync.table’:  tableName
   }
   nonPartitionConfig = {
       ‘hoodie.datasource.hive_sync.partition_extractor_class’:
           ‘org.apache.hudi.hive.NonPartitionedExtractor’,
       ‘hoodie.datasource.write.keygenerator.class’:
           ‘org.apache.hudi.keygen.NonpartitionedKeyGenerator’
   }
   config = {
    ‘hoodie.bulkinsert.shuffle.parallelism’: 10,
       ‘hoodie.datasource.write.operation’: ‘bulk_insert’
   }
   S3Location = ‘s3://<>/hudi/tst_tb’
   combinedConf = {**commonConfig, **nonPartitionConfig, **config}
   df.write.format(‘org.apache.hudi’).options(
       **combinedConf).mode(‘overwrite’).save(S3Location)
   
   
   
    Environment Description
   
   EMR: emr-6.5.0
   Hudi version : 0.9
   Spark version : Spark 3.1.2
   Hive version : Hive 3.1.2
   Hadoop version :Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : no
   
   
   
   Source Data
   
   +----------+--------------------------+--------------------------+
   |id        |creation_date             |last_updated              |
   +----------+--------------------------+--------------------------+
   |7cb15b859e|2021-11-07 08:48:25.000232|2021-11-08 08:50:35.000359|
   |60ab5da73a|2022-07-02 19:48:27.000891|2022-07-03 20:05:19.000364|
   |abb663a826|2015-07-12 15:35:14       |2015-08-01 15:38:07       |
   |c92aaeedc1|2021-05-10 16:47:10.000455|2021-05-30 16:49:29.00063 |
   +----------+--------------------------+--------------------------+
   
   Source Schema
   
   root
    |-- id: string (nullable = true)
    |-- creation_date: timestamp (nullable = true)
    |-- last_updated: timestamp (nullable = true)
   
   
   Hudi 0.9V Output
   
   +-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+----------+-------------------+-------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                    |id        |creation_date      |last_updated       |
   +-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+----------+-------------------+-------------------+
   |20220728035114     |20220728035114_3_2  |c92aaeedc1        |                      |1736fb90-f6b2-4282-9c77-da2ace4bf0bd-0_3-10-80_20220728035114.parquet|c92aaeedc1|2021-05-10 16:47:10|2021-05-30 16:49:29|
   |20220728035114     |20220728035114_1_3  |7cb15b859e        |                      |d650a502-386e-47b9-81f3-e72cf64b0c0e-0_1-10-78_20220728035114.parquet|7cb15b859e|2021-11-07 08:48:25|2021-11-08 08:50:35|
   |20220728035114     |20220728035114_2_1  |abb663a826        |                      |941ca621-111e-47d9-8ca1-bdc943490371-0_2-10-79_20220728035114.parquet|abb663a826|2015-07-12 15:35:14|2015-08-01 15:38:07|
   |20220728035114     |20220728035114_0_1  |60ab5da73a        |                      |2d2fb872-7775-4b2d-bd28-93c289ae12c8-0_0-8-77_20220728035114.parquet |60ab5da73a|2022-07-02 19:48:27|2022-07-03 20:05:19|
   +-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+----------+-------------------+-------------------+
   
   Hudi 0.11V Output
   
   +-------------------+---------------------+------------------+----------------------+-------------------------------------------------------------------------+----------+--------------------------+--------------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                        |id        |creation_date             |last_updated              |
   +-------------------+---------------------+------------------+----------------------+-------------------------------------------------------------------------+----------+--------------------------+--------------------------+
   |20220728035802662  |20220728035802662_0_1|1340225           |                      |38263eea-aa5d-4adf-b7f1-f11ebd2f9142-0_0-2522-0_20220728035802662.parquet|1340225   |2017-01-24 00:02:10       |2022-02-25 07:03:54.000853|
   |20220728035802662  |20220728035802662_0_2|53773de3-9        |                      |38263eea-aa5d-4adf-b7f1-f11ebd2f9142-0_0-2522-0_20220728035802662.parquet|53773de3-9|2022-02-25 07:21:06.000037|2022-02-25 08:35:57.000877|
   |20220728035802662  |20220728035802662_0_3|722b232f-e        |                      |38263eea-aa5d-4adf-b7f1-f11ebd2f9142-0_0-2522-0_20220728035802662.parquet|722b232f-e|2022-02-22 06:02:32.000481|2022-02-25 08:54:05.00042 |
   +-------------------+---------------------+------------------+----------------------+-------------------------------------------------------------------------+----------+--------------------------+--------------------------+
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

neerajpadarthi commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1198656143

   Hey thanks for checking. I have tired with the first option earlier but no luck. Also, I don’t find this config here for 0.9V (https://hudi.apache.org/docs/0.9.0/configurations) 
   
   Below is in detail -
   
   Passed only hoodie.parquet.outputtimestamptype config - Precision values are truncated. 
   Passed only hoodie.datasource.write.row.writer.enable to false - Precision values are truncated. 
   Passed both configs - Still, the seconds precision values are truncated. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YannByron commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

YannByron commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1198839746

   @neerajpadarthi 
   guess you use spark dataframe api, then maybe you can try to set `spark.sql.parquet.writeLegacyFormat` to `TIMESTAMP_MICROS` when create `SparkSession` object.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1198731643

   @YannByron do you have any suggestions or this issue cannot be fixed in 0.9.0?
   @neerajpadarthi is it possible for you to upgrade to Hudi 0.10.1 to pick up the `hoodie.parquet.outputtimestamptype` flag?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1244575528

   @YannByron : gentle reminder to look at the issue when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

neerajpadarthi commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199790011

   @yihua - I will validate with 0.10.1
   @YannByron - Thanks for checking. I have tested with below configs passing to spark session but I still see the same issue. 
   "spark.sql.parquet.outputTimestampType","TIMESTAMP_MICROS"
   "spark.sql.parquet.writeLegacyFormat", "true"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1198560819

   @neerajpadarthi Based on your description, the issue could be related to the output timestamp type for bulk insert row-writing path (#4552 #4749).  Could you try each of the following?
   (1) Set `hoodie.parquet.outputtimestamptype` to `TIMESTAMP_MICROS` when you bulk insert data
   (2) Disable row writing by setting `hoodie.datasource.write.row.writer.enable` to `false`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] neerajpadarthi commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by GitBox <gi...@apache.org>.

neerajpadarthi commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1199914603

   @yihua 
   
   Hey, I have verified the same in Hudi 0.10.1 but no luck still precision is getting truncated. Below are the configs, spark session details and spark/Hudi outputs. Could you please verify and let me know if anything is missing here? Thanks 
   
   ===Environment Details
   
   EMR: emr-6.6.0
   Hudi version : 0.10.1
   Spark version : Spark 3.2.0
   Hive version : Hive 3.1.2
   Hadoop version :Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : no
   
   ===Spark Configs
   
   def create_spark_session():
    spark = SparkSession \
    .builder \
    .config(“spark.sql.extensions”, “org.apache.spark.sql.hudi.HoodieSparkSessionExtension”) \
    .config(“spark.sql.parquet.writeLegacyFormat”, “true”) \
    .config(“spark.sql.parquet.outputTimestampType”, “TIMESTAMP_MICROS”) \
    .config(“spark.sql.legacy.parquet.datetimeRebaseModeInRead”, “LEGACY”)\
    .config(“spark.sql.legacy.parquet.int96RebaseModeInRead”,“LEGACY”)\
    .enableHiveSupport()\
    .getOrCreate()
   
   return spark
   
   ===Hudi Configs
   
   db_name = <>
   tableName = <>
   pk =<>
   de_dup =<>
   commonConfig = {‘hoodie.datasource.hive_sync.database’: db_name,‘hoodie.table.name’: tableName,‘hoodie.datasource.hive_sync.support_timestamp’: ‘true’,‘hoodie.datasource.write.recordkey.field’: pk,‘hoodie.datasource.write.precombine.field’: de_dup,‘hoodie.datasource.hive_sync.enable’: ‘true’,‘hoodie.datasource.hive_sync.table’: tableName}
   nonPartitionConfig = {‘hoodie.datasource.hive_sync.partition_extractor_class’:‘org.apache.hudi.hive.NonPartitionedExtractor’,‘hoodie.datasource.write.keygenerator.class’:‘org.apache.hudi.keygen.NonpartitionedKeyGenerator’}
   config = {‘hoodie.bulkinsert.shuffle.parallelism’: 10,‘hoodie.datasource.write.operation’: ‘bulk_insert’,‘hoodie.parquet.outputtimestamptype’:‘TIMESTAMP_MICROS’,
   #‘hoodie.datasource.write.row.writer.enable’:’false’}
   
   ===Spark DF Output
   +----------+--------------------------+--------------------------+
   |id        |creation_date             |last_updated              |
   +----------+--------------------------+--------------------------+
   |1340225   |2017-01-24 00:02:10       |2022-02-25 07:03:54.000853|
   |722b232f-e|2022-02-22 06:02:32.000481|2022-02-25 08:54:05.00042 |
   |53773de3-9|2022-02-25 07:21:06.000037|2022-02-25 08:35:57.000877|
   +----------+--------------------------+--------------------------+
   
   ===Hudi V0.10.1 Output
   +-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+----------+-------------------+-------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |id        |creation_date      |last_updated       |
   +-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+----------+-------------------+-------------------+
   |20220729201157281  |20220729201157281_1_2|53773de3-9        |                      |55f7c820-c289-4eb7-aabc-4f079bd44536-0_1-11-10_20220729201157281.parquet|53773de3-9|2022-02-25 07:21:06|2022-02-25 08:35:57|
   |20220729201157281  |20220729201157281_2_3|722b232f-e        |                      |0dd8d6c2-9d64-40d7-a4db-bf7cf95bd02c-0_2-11-11_20220729201157281.parquet|722b232f-e|2022-02-22 06:02:32|2022-02-25 08:54:05|
   |20220729201157281  |20220729201157281_0_1|1340225           |                      |2e0cf27b-999d-4d5e-9c4e-52d27c25294e-0_0-9-9_20220729201157281.parquet  |1340225   |2017-01-24 00:02:10|2022-02-25 07:03:54|
   +-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+----------+-------------------+-------------------+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #6232: [SUPPORT] Hudi V0.9 truncating second precision for timestamp columns

Posted by "Gatsby-Lee (via GitHub)" <gi...@apache.org>.

Gatsby-Lee commented on issue #6232:
URL: https://github.com/apache/hudi/issues/6232#issuecomment-1424657330

   hmm interesting.
   I am on Glue 3 + Hudi 0.10.1.
   
   While I did bulk_insert from Hudi Table A to Hudi Table B to change recordKey, I encountered an issue related Timestamp.
   
   Hudi Table A
   - timestamp datatype in Table Schema in AWS Glue Catalog 
   - Parquet meta shows `OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1`
   
   Hudi Table B
   - timestamp datatype in Table Schema in AWS Glue Catalog 
   - Parquet meta shows `OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1`
   - The data is queryable and looks working and valid. HOWEVER, when the new record is coming in with MICROS format into the Parquet's meta `TIMESTAMP(MILLIS)`, things become super weird.
   
   The fix for me is `hoodie.parquet.outputtimestamptype: TIMESTAMP_MICROS`
   
   I can see the precision as well like `2023-02-09 05:01:44.626`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org