You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/27 07:46:30 UTC

[GitHub] [hudi] mandar-mw opened a new issue, #5442: HUDI does not deduplicate within the same partition

mandar-mw opened a new issue, #5442:
URL: https://github.com/apache/hudi/issues/5442

   HUDI does not seem to deduplicate records in some cases. Below is the configuration that we use. We partition the data by customer_id and our recordkey is [user_id, customer_id], so our expectation is that HUDI will enforce uniqueness within the partition, i.e each customer_id folder. Although, we are noticing that there are two parquet files inside some customer_id folders, and when we query the data in these partitions, we notice there are duplicate user_id in the same customer_id. The _hoodie_record_key is identical for the two duplicate records, but the _hoodie_file_name is different, which makes me suspect that hudi is enforcing uniqueness not in the customer_id folder, but in these individual parquet files. Can someone explain this behavior?
   
   ```
    op: "INSERT"
     target-base-path: "s3_path"
     target-table: "some_table_name"
   
     source-ordering-field: "created_at"
     transformer-class: "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"
   
     filter-dupes: ""
     hoodie_conf:
     # source table base path
     hoodie.deltastreamer.source.dfs.root: "s3_path"
   
     # record key, partition paths and keygenerator
     hoodie.datasource.write.recordkey.field: "user_id,customer_id"
     hoodie.datasource.write.partitionpath.field: "customer_id"
     hoodie.datasource.write.keygenerator.class: 
     "org.apache.hudi.keygen.ComplexKeyGenerator"
   
     # hive sync properties
     hoodie.datasource.hive_sync.enable: true
     hoodie.datasource.hive_sync.table: "table_name"
     hoodie.datasource.hive_sync.database: "database_name"
     hoodie.datasource.hive_sync.partition_fields: "customer_id"
     hoodie.datasource.hive_sync.partition_extractor_class: 
     "org.apache.hudi.hive.MultiPartKeysValueExtractor"
     hoodie.datasource.write.hive_style_partitioning: true
   
     # sql transformer
     hoodie.deltastreamer.transformer.sql: "SELECT user_id, customer_id, updated_at as 
     created_at FROM <SRC> a"
   
     # since there is no dt partition, the following config from default has to be 
     overridden
     hoodie.deltastreamer.source.dfs.datepartitioned.selector.depth: 0
   ```
   
   Here is an example of duplicate records
   
   ```
   
   
   _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | user_record_id | created_at | org
   -- | -- | -- | -- | -- | -- | -- | --
   20220316201026 | 20220316201026_95_35511 | user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> | 4a17e6ec-8f53-4a68-8878-6c8d6c4e2583-0_95-26-3087_20220316201026.parquet | <redacted> | 2020-03-24 05:03:53.016406+00 | <redacted>
   20220315225025 | 20220315225025_81_28979 | user_id:<redacted>,customer_id:<redacted> | customer_id=<redacted> | 52631482-5c9b-4f84-97c1-1e5ab232b1de-0_81-26-8091_20220315225025.parquet | <redacted> | 2022-03-15 15:32:29.325168 | <redacted>
   
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] mandar-mw commented on issue #5442: HUDI does not deduplicate within the same partition

Posted by GitBox <gi...@apache.org>.
mandar-mw commented on issue #5442:
URL: https://github.com/apache/hudi/issues/5442#issuecomment-1111304219

   I am going to close this issue because our suspicion is that it is very specific to the way we bootstrapped data for this table. We are in touch with hudi team on slack to debug this further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] qianchutao commented on issue #5442: HUDI does not deduplicate within the same partition

Posted by GitBox <gi...@apache.org>.
qianchutao commented on issue #5442:
URL: https://github.com/apache/hudi/issues/5442#issuecomment-1110714139

   I've encountered this problem before and never found a reason, asking commiters in the community never got an answer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #5442: HUDI does not deduplicate within the same partition

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #5442:
URL: https://github.com/apache/hudi/issues/5442#issuecomment-1111758104

   @mandar-mw In order to deduplicate records within and across commits for INSERT operation, you need to see both of the following configs to be `true` (they are `false` by default): [`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) and [`hoodie.combine.before.insert`](https://hudi.apache.org/docs/configurations#hoodiecombinebeforeinsert).  Let me know if this helps to solve your problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] mandar-mw closed issue #5442: HUDI does not deduplicate within the same partition

Posted by GitBox <gi...@apache.org>.
mandar-mw closed issue #5442: HUDI does not deduplicate within the same partition
URL: https://github.com/apache/hudi/issues/5442


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #5442: HUDI does not deduplicate within the same partition

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #5442:
URL: https://github.com/apache/hudi/issues/5442#issuecomment-1111758882

   @qianchutao Feel free to open a separate Github issue for your problem if it still happens.  We can help you promptly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org