You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/29 19:19:28 UTC

[GitHub] [hudi] jasondavindev opened a new issue, #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

jasondavindev opened a new issue, #5469:
URL: https://github.com/apache/hudi/issues/5469

   **Describe the problem you faced**
   
   I'm writing an application to upsert records from a table. The problem is when an upsert operation is done, the ordering column of records that exists in base table and not exists in incoming data is overwritten to invalid value.
   E.g.
   The base table has a record with `id = 1` and `createddate = 2022-04-01`
   The incoming data has a record with `id = 2` and `createddate = 2022-04-02`
   
   After upsert operation the createddate of record with `id = 1` is changed to `1970-xx-xx` and the record with `id = 2` remains intact.
   
   **To Reproduce**
   ```python
   from pyspark.sql.functions import expr
   from pyspark.sql import DataFrame, SparkSession
   
   database = 'db'
   table = 'tb'
   table_path = f'/{database}/{table}'
   
   spark = SparkSession.builder.config(
       'spark.sql.shuffle.partitions', '4').enableHiveSupport().getOrCreate()
   
   options = {
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
       'hoodie.datasource.write.recordkey.field': 'id',
       'hoodie.datasource.write.partitionpath.field': 'field:simple',
       'hoodie.datasource.write.precombine.field': 'createddate',
       'hoodie.payload.event.time.field': 'createddate',
       'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
       'hoodie.table.name': table,
   
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.mode': 'hms',
       'hoodie.datasource.hive_sync.support_timestamp': 'true',
       'hoodie.datasource.hive_sync.database': database,
       'hoodie.datasource.hive_sync.table': table,
       'hoodie.datasource.hive_sync.partition_fields': 'field',
   
   }
   
   full = spark.read.parquet(
       '/opt/spark/conf/full/')
   delta = spark.read.json(
       '/opt/spark/conf/delta')
   
   full_parse: DataFrame = full \
       .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))
   
   delta_parse: DataFrame = delta \
       .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as timestamp)'))
   
   full_parse \
       .write \
       .format('org.apache.hudi') \
       .options(**options) \
       .option('hoodie.datasource.write.operation', 'bulk_insert') \
       .mode('overwrite') \
       .save(table_path)
   
   delta_parse \
       .write \
       .format('org.apache.hudi') \
       .options(**options) \
       .option('hoodie.datasource.write.operation', 'upsert') \
       .mode('append') \
       .save(table_path)
   ```
   
   Example full file content
   
   ```
   +------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   |createdbyid       |createddate        |datatype   |field              |id                |isdeleted|newvalue        |oldvalue|parentid          |
   +------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   |0055G00000808dFQAQ|2022-03-16 16:55:13|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null    |a015G00000kpbM3QAI|
   +------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   ```
   
   After upsert operation
   
   ```
   +------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   |createdbyid       |createddate            |datatype   |field              |id                |isdeleted|newvalue        |oldvalue              |parentid          |
   +------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   |0055G00000808dFQAQ|1970-01-20 01:37:29.713|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita Cancelada|null                  |a015G00000kpbM3QAI|
   +------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   ```
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : Local
   
   * Running on Docker? (yes/no) : Yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jasondavindev commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
jasondavindev commented on issue #5469:
URL: https://github.com/apache/hudi/issues/5469#issuecomment-1114816469

   I tried with the version `0.10.1` and there are error yet.
   With the version `0.11.0` launched this weekend, zero records was found with wrong date.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #5469:
URL: https://github.com/apache/hudi/issues/5469#issuecomment-1113873141

   @jasondavindev Thanks for reporting this issue and detailed information, I'll try to reproduce it.  Have you tried the latest master of Hudi to see if the problem still exists? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #5469:
URL: https://github.com/apache/hudi/issues/5469#issuecomment-1169423116

   @jasondavindev to clarify, do you still see issues with BULK_INSERT in 0.11.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #5469:
URL: https://github.com/apache/hudi/issues/5469#issuecomment-1115351450

   @jasondavindev Thanks for confirming.  If this is solved by Hudi 0.11.0 release and there is no other ask for this issue, feel free to close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jasondavindev closed issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
jasondavindev closed issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value
URL: https://github.com/apache/hudi/issues/5469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jasondavindev commented on issue #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Posted by GitBox <gi...@apache.org>.
jasondavindev commented on issue #5469:
URL: https://github.com/apache/hudi/issues/5469#issuecomment-1119873863

   When used INSERT operation instead BULK_INSERT, no records are affected by the bug. Thus the BULK_INSERT has a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org