You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/22 17:56:42 UTC

[GitHub] [hudi] vicuna96 opened a new issue, #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

vicuna96 opened a new issue, #5942:
URL: https://github.com/apache/hudi/issues/5942

   
   **Describe the problem you faced**
   
   Case 1.
   We are currently trying to create a partial upsert pipeline with global index (GLOBAL_BLOOM). The issue that we face is that when setting HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "true", we notice that **the columns not updated by the partial update are dropped / nullified**.
   
   Case 2.
   In addition, as an alternative we are exploring using HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "false". However, in this case we notice that while the metadata column `_hoodie_partition_path` does not get updated, our partition field does.
   In our unit testing, this means that for the record in question, the columns become `_hoodie_partition_path: partitionField=2022-05-07` and `partitionField: 2022-05-08`. We are wondering if there is any implications to this. For example, if there is any pruning in place on `_hoodie_partition_path`, is our record with mismatch in partition column info prone to any inconsistencies?
   
   **To Reproduce**
   Case 1.
   Define 
   ```
   case class TestHudiTable(keyField: String, stringField: String, numberField: Int, precombineField: Timestamp, partitionField: Date)
       val targetGlobalPartition = "2022-05-08"
   
       val insertRecords = Seq(
         TestHudiTable("key1", "value3", 55, Timestamp.valueOf("2022-05-07 08:00:00"), Date.valueOf("2022-05-07")),
         TestHudiTable("key2", "value4", 66, Timestamp.valueOf("2022-05-07 09:00:00"), Date.valueOf("2022-05-07")),
         TestHudiTable("key3", "value4", 77, Timestamp.valueOf("2022-05-07 10:00:00"), Date.valueOf("2022-05-07")))
   
       val insertDF = insertRecords.toDF(keyField, stringField, numberField, precombineField, partitionField)
         .withColumn(precombineField, col(precombineField).cast(TimestampType))
         .withColumn(partitionField, to_date(col(partitionField)))
   ```
   Then run an original insert of these records. Finally, test the partial upsert with the following records, using org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
   ```
       val partialUpdates = Seq(
         ("key1", "value5", "2022-05-07T11:00:00", "2022-05-07"),
         ("key3", "value6", "2022-05-07T12:00:00", targetGlobalPartition)).toDF(
           keyField, stringField, precombineField, partitionField).withColumn(
           precombineField, col(precombineField).cast(TimestampType)).withColumn(
           partitionField, to_date(col(partitionField)))
   ```
   
   Hence, we are testing a partial update that updates most columns except numberField, which will be null.
   ```
   **Before partial update to records corresponding to key1 and key3.**
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                                                         |keyField|stringField|numberField|precombineField    |partitionField|
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622111807610  |20220622111807610_0_3|keyField:key1     |partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key1    |value3     |55         |2022-05-07 08:00:00|2022-05-07    |
   |20220622111807610  |20220622111807610_0_4|keyField:key3     |partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key3    |value4     |77         |2022-05-07 10:00:00|2022-05-07    |
   |20220622111813282  |20220622111813282_0_4|keyField:key2     |partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-220-196_20220622111813282.parquet|key2    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   **After partial update to key1 and key3, with the latter also updating the partition column.**
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                                                         |keyField|stringField|numberField|precombineField    |partitionField|
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622111818415  |20220622111818415_1_6|keyField:key3     |partitionField=2022-05-08|bce85553-9df0-4d02-a62c-0b2b81e58969-0_1-273-239_20220622111818415.parquet|key3    |value6     |null       |2022-05-07 12:00:00|2022-05-08    |
   |20220622111818415  |20220622111818415_0_5|keyField:key1     |partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-273-238_20220622111818415.parquet|key1    |value5     |55         |2022-05-07 11:00:00|2022-05-07    |
   |20220622111813282  |20220622111813282_0_4|keyField:key2     |partitionField=2022-05-07|80700093-395f-4915-b39a-f97ba7527688-0_0-273-238_20220622111818415.parquet|key2    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   ```
   We notice that the value for numberField is dropped upon the update.
   
   Case 2.
   For the second case, the setup and procedure is exactly the same, but we use instead HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "false".
   
   All the fields are updated -- in particular, numberField is not dropped. However, we see the inconsistency across `partitionField` and `_hoodie_partition_path`. The results are shown below.
   
   ```
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                                                         |keyField|stringField|numberField|precombineField    |partitionField|
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622114109089  |20220622114109089_0_3|keyField:key1     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key1    |value3     |55         |2022-05-07 08:00:00|2022-05-07    |
   |20220622114109089  |20220622114109089_0_4|keyField:key3     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key3    |value4     |77         |2022-05-07 10:00:00|2022-05-07    |
   |20220622114117004  |20220622114117004_0_4|keyField:key2     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-220-196_20220622114117004.parquet|key2    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                                                         |keyField|stringField|numberField|precombineField    |partitionField|
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622114126357  |20220622114126357_0_5|keyField:key1     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key1    |value5     |55         |2022-05-07 11:00:00|2022-05-07    |
   |20220622114126357  |20220622114126357_0_6|keyField:key3     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key3    |value6     |77         |2022-05-07 12:00:00|2022-05-08    |
   |20220622114117004  |20220622114117004_0_4|keyField:key2     |partitionField=2022-05-07|3166a763-28e3-4352-a9a5-5403e2599bc5-0_0-273-237_20220622114126357.parquet|key2    |value4     |66         |2022-05-07 09:00:00|2022-05-07    |
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   
   ```
   
   
   **Expected behavior**
   
   Case 1. 
   In the first case, the hope is that the partial update any columns which are not null on the incremental load, and that any columns which are null in the incremental load but not null in hdfs are not nullified (numberField is nullified this case.
   
   Case 2.
   In this case, we want to confirm that this is expected behavior. In particular, we did not expect `partitionField` to be updated so it would remain consistent with `_hoodie_partition_path`. If this is indeed expected behavior, we would like to know how this may affect any pruning on reads from the dataset based on the partition column. 
   
   Currently, we see that a read filtered on `partitionField` is showing the updated record, but are unsure if there is any other pruning which instead uses `_hoodie_partition_path` on the backend, and which we should be aware of
   
   ```
       readGlobalDataset().where(col(partitionField) === targetGlobalPartition).show(false)
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path   |_hoodie_file_name                                                         |keyField|stringField|numberField|precombineField    |partitionField|
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   |20220622100847473  |20220622100847473_0_6|keyField:key3     |partitionField=2022-05-07|1f801a9d-5284-4e89-be48-c64273a4af79-0_0-273-237_20220622100847473.parquet|key3    |value6     |77         |2022-05-07 12:00:00|2022-05-08    |
   +-------------------+---------------------+------------------+-------------------------+--------------------------------------------------------------------------+--------+-----------+-----------+-------------------+--------------+
   ```
   
   
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 2.4.8
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.10.1
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : No, running on Dataproc 1.5
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vicuna96 commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
vicuna96 commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1164752309

   Case1: Hi @nsivabalan, yes we mean the latter explanation: a record is coming in which will update _col1_ and _col3_, but _col2_ has been populated before by a different source and thus there is a non-default value in hdfs. Hence we want to update _col1_ and _col3_ from this source, but if there was a value from _col2_ in hdfs, we would like to keep that value so that the record written is `incremental.col1, coalesce(incremental.col2, hdfs.col2), incremental.col3`. If you notice on the dataframes shown before and after for case 1, the record corresponding to key1 is correctly updated with incoming columns, and the value for numberField is also correctly taken from hdfs (as 55).
   
   So it seems to us that this "partial update" is working when the record doesn't move from one partition to another, as evidenced by the update to record with _key1_. However, for _key3_ the partition column is to be updated, and that is where the non-default numberField from hdfs gets nullified (numberField goes from 77 to null for this record after the update).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1164703872

   Case1: I don't think we support partial updates yet. We do have active PR in review, but in general not sure if partial updates are supported in official releases. I assume by partial updates, you meant, if table schema has col1, col2, col3, and your incoming batch has just col1, and col3, you wish to set some default for col2? or do you want to fetch value of col2 from previous version of the record? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] rishabhbandi commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
rishabhbandi commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1176373448

   Even I am facing issue in doing Partial Update in Hudi @nsivabalan How can I reach out to you, please share your contact email.
   I have posted the issue on Apache Hudi slack channel - https://apache-hudi.slack.com/archives/C4D716NPQ/p1657111605627799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1164641054

   @minihippo can you help look into this please? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ankur334 commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by "ankur334 (via GitHub)" <gi...@apache.org>.
ankur334 commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1632162090

   
   
   Partial Update with some partition column/key is not working as expected. 
   
   Let's suppose I am currently having following event/message. 
   
   { 
    "id": 1, 
    "language": "python", 
    "created": "2023-07-12", 
    "updated": "2023-07-12"
   } 
   
   Here my, 
   
   primaryKey = id
   deDupKey/preCombine Field = updated
   partition Field = created. 
   
   
   & I am applying UPSERT as writeOperation type. 
   
   Now I want to apply the partial update when I am receiving record from my source system/producer. 
   
   New incoming event is as following. 
   
   { 
    "id": 1, 
    "language": "scala", 
    "updated": "2023-07-13"
   } 
   
   Now after partital update, I want to update only columns like `language` & `updated` column. But after applying partital update, we are getting `null` in CREATED column. 
   
   Expected Event after merge/partital update should be 
   
   { 
    "id": 1, 
    "language": "scala", 
    "created": "2023-07-12", 
    "updated": "2023-07-13"
   } 
   
   but it coming as 
   { 
    "id": 1, 
    "language": "scala", 
    "created": null, 
    "updated": "2023-07-13"
   } 
   
   Which is actually wrong. Will you please help us here? Are we doing something wrong?
   
   Environment Description
   
   Hudi version : 0.13.1
   
   Spark version : 3.1
   
   Hive version : 3.1
   
   Storage (HDFS/S3/GCS..) : GCS
   
   Running on Docker? (yes/no) : No, running on Dataproc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1229350642

   hey sorry. I was on vacation during July and caught up w/ work in August. you can post in general slack and CC me (`shivnarayan`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1296275785

   @nsivabalan could you share the latest update on the discussion please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vicuna96 commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
vicuna96 commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1164755188

   Case 2: Yes, we would prefer to use HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "true" as well. Nonetheless, as explained in the previous comment, we noticed that partial update was working when the record was not moved across partitions, so we were testing this as a workaround to have a working partial update, since we can still run incremental on this dataset even if it doesn't lie on the correct partition. However, we wanted to double check that this inconsistency wouldn't lead to incorrect pruning of the record.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #5942: [SUPPORT] Partial Update on Global Index with BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5942:
URL: https://github.com/apache/hudi/issues/5942#issuecomment-1164705689

   Case2: you are right. hudi tries not to update any of the user passed in fields and hence the behavior. can you help understand the use-case, where HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key() -> "false" is required. We have seen most users in the community use "true" for this config. So, interested to understand your use-case better. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org