You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/07 11:04:01 UTC

[GitHub] [hudi] mtami opened a new issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

mtami opened a new issue #3429:
URL: https://github.com/apache/hudi/issues/3429


   
   I am trying to upsert data into S3 that has timestamp with microseconds precision.
   However, Hudi truncate the microseconds part when writing to parquet file.
   `I used hoodie.datasource.hive_sync.support_timestamp=true, but still can't get microseconds precision.`
   I used a parquet-tool to show the data before and after the ingesting.
   
   **Expected behavior**
   
   A timestamp field with microseconds precision.
   
   **Environment Description**
   
   * Hudi version : 0.80
   
   * Spark version :  2.4.3
   
   * Hadoop version : 2.8
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
mtami commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-894764461


   > hudi use avro logical type with time millis(spark default), so that they differ in percison by 3
   
    @cdmikechen 
   How we can achieve microseconds precision ?
   
   I tried to overwrite spark 2.4.0 TimestampType to microseconds without succeed.
   However, if i use spark (without Hudi) to ingest data, the output still have microseconds.
   
   So, what am i missing here ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cdmikechen commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-1010597333


   @nsivabalan I've noticed that flink recently submitted a related pr [hudi-flink support timestamp-micros](https://github.com/apache/hudi/pull/4548).
   Should we move the PR to the common package level in hudi to solve the problem uniformly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
mtami commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-998512708


   Hi @nsivabalan 
   
   It's a timestamp string,  i cast it to timestamp.
   
   `input_df = input_df.withColumn('updated', f.to_timestamp(f.col('updated')))`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cdmikechen edited a comment on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
cdmikechen edited a comment on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-894738466


   hudi use avro logical type with time millis(spark default), so that they differ in percison by 3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-999413753


   got it. I could able to repro. 
   @cdmikechen : Do you know if there is a way we can get around this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar edited a comment on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-895684669


   @mtami Could you try writing using the row writer enabled with bulk_insert operation (it does not work with upsert atm) and see if this issue goes away? 
   
   http://hudi.apache.org/docs/configurations#enable_row_writer_opt_key 
   
   Also could you share a snippet, I can use to reproduce this locally?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cdmikechen edited a comment on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
cdmikechen edited a comment on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-1001179764


   @nsivabalan 
   You can replace `Timestamp.valueOf("2015-01-01T13:51:39.345397Z")` to `Timestamp.valueOf("2015-01-01 13:51:39.345397")`
   
   The problem maybe here: https://github.com/apache/hudi/blob/c81df99e50f2df84d85f08ff3a839595dad974d7/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L123-L139
   
   I think maybe we need to add a new configuration to support this feature (microsecond precision)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
mtami commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-895822058


   Hi @vinothchandar 
   Thank you for taking time to help.
   i tried `ENABLE_ROW_WRITER_OPT_KEY=true` with bulk_insert without succeed.
   
   here my snippet if you want reproduce locally:
   
   [https://gist.github.com/mtami/cfd100ee738d1e50090dddb427be2477](https://gist.github.com/mtami/cfd100ee738d1e50090dddb427be2477)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-998442784


   @mtami : may I know whats the datatype of. "updated" here. 
   I tried to cast it as timestamp, but it fails. 
   this is in spark shell though. 
   
   ```
   import spark.implicits._
   import spark.implicits._
   scala> val df = Seq(
        |   ("row1", 1, Timestamp.valueOf("2015-01-01T13:51:39.345397Z")),
        |   ("row2", 1, Timestamp.valueOf("2015-01-01T12:14:58.597216Z"))
        | ).toDF("row", "preComb","eventTime")
   java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
     at java.sql.Timestamp.valueOf(Timestamp.java:204)
     ... 66 elided
   
   ```
   I will give it a try with pyspark. just wanted to see if I can repro with spark. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-1010101330


   @cdmikechen : thanks for the pointer. 
   
   Have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-3216). One of the devs from the community will look into putting in a fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3429:
URL: https://github.com/apache/hudi/issues/3429


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cdmikechen commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-894738466


   hudi use avro logical type with time millis, so that they differ in percison by 3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-895684669


   @mtami Could you try writing using the row writer enabled with bulk_insert operation and see if this issue goes away? 
   
   http://hudi.apache.org/docs/configurations#enable_row_writer_opt_key 
   
   Also could you share a snippet, I can use to reproduce this locally?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-1031435160


   @mtami : Can you try setting "hoodie.parquet.outputtimestamptype" to "TIMESTAMP_MICROS" and let us know if things are working as expected. 
   CC @YannByron 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
mtami commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-894640254


   input data (with microseconds precision in `updated` field):
   <img width="383" alt="Screen Shot 2021-08-07 at 2 00 18 PM" src="https://user-images.githubusercontent.com/15871409/128598081-8c4243cf-c7f0-47f5-bcfa-9cd5875feb01.png">
   
   
   output data (without microseconds precision in `updated` field):
   <img width="346" alt="Screen Shot 2021-08-07 at 1 41 19 PM" src="https://user-images.githubusercontent.com/15871409/128598086-16fd0710-e092-4972-975e-633e1ab0cff4.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] cdmikechen commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

Posted by GitBox <gi...@apache.org>.
cdmikechen commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-1001179764


   @nsivabalan 
   You can replace `Timestamp.valueOf("2015-01-01T13:51:39.345397Z")` to `Timestamp.valueOf("2015-01-01 13:51:39.345397")`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org