You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "BalaMahesh (via GitHub)" <gi...@apache.org> on 2023/04/27 05:42:24 UTC

[GitHub] [hudi] BalaMahesh opened a new issue, #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

BalaMahesh opened a new issue, #7733:
URL: https://github.com/apache/hudi/issues/7733

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We have Postgres data coming from debezium connector via Kafka. We are running Hudi in upsert mode on this dataset, we have seen that there are around 12 records which has two versions of data for the same id instead of updating the latest values and cleaning the old record.
   
   We are yet not clear how this is the case since this data is from older commits. 
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start Postgres Debezium Kafka connector and publish data to Kafka
   2. Run Hudi in upsert mode
   3. We are not sure whether there are any crashes happened during those commits.
   4. Use below configurations.5. 
   
   hoodie.compaction.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload
   hoodie.table.type=MERGE_ON_READ
   hoodie.table.metadata.partitions=
   hoodie.table.precombine.field=_event_lsn
   hoodie.table.partition.fields=
   hoodie.archivelog.folder=archived
   hoodie.timeline.layout.version=1
   hoodie.table.checksum=4134192528
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.recordkey.fields=id
   hoodie.partition.metafile.use.base.format=false
   hoodie.populate.meta.fields=true
   hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
   hoodie.table.base.file.format=PARQUET
   hoodie.table.version=5
   
   
   **Expected behavior**
   
   We expect only version of the record to be available in the latest queried data. 
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.2.1
   
   * Hive version : 2.3.5
   
   * Hadoop version : 2.7.7
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : yes.
   
   
   **Additional context**
   
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.partitionpath.field=
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.cleaner.commits.retained=5
   hoodie.keep.min.commits=10
   #compaction config
   hoodie.datasource.compaction.async.enable=true
   hoodie.parquet.small.file.limit=104857600
   hoodie.compaction.target.io=50
   
   **Stacktrace**
   
   ```
   Id              updated          _hoodie_commit_time. _event_lsn
   Aa5udG	1667998354	20221109125316627	5037873812216
   Aa5udG	1667972649	20221109055102633	5028051185232
   Aa61Gb	1667998400	20221109125802500	5037878072632
   Aa61Gb	1667972837	20221109055102633	5028239838008
   Aa7hZx	1667998411	20221109125802500	5037879344768
   Aa7hZx	1667973014	20221109055102633	5028334998944
   Aa81Sq	1667998439	20221109125802500	5037897355680
   Aa81Sq	1667973345	20221109055825061	5028484902408
   AbB9sW	1668051396	20221110034051271	5045161427664
   AbB9sW	1667974610	20221109061740419	5029141615480
   OiYzUz	1672662739	20230112125716390	6287523270024
   OiYzUz	1672662739	20230112125716390	6287523270024
   XxNzFk	1667982183	20221109082337760	5031758334520
   XxNzFk	1667981380	20221109081024733	5031516715520
   YxNzFk	1667982167	20221109082337760	5031758226096
   YxNzFk	1667981376	20221109081024733	5031516565840
   YbB9sW	1668051393	20221110034051271	5045160856976
   YbB9sW	1667974609	20221109061740419	5029141513960
   ZxNzFk	1667982174	20221109082337760	5031755205544
   ZxNzFk	1667981375	20221109081024733	5031516243272
   ZanXvJ	1668051273	20221110033657677	5045153106408
   ZanXvJ	1667967621	20221109042439193	5025825527744
   ZbB9sW	1668051391	20221110034051271	5045160222496
   ZbB9sW	1667974609	20221109061740419	5029141376128
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
URL: https://github.com/apache/hudi/issues/7733


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1401392014

   Are these records belong to the same file group?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
URL: https://github.com/apache/hudi/issues/7733


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1404529634

   can you provide us complete list of write configs. if you are setting operation type to "bulk_insert", you could find duplicates. 
   
   wrt non partitioned, the key gen class is "org.apache.hudi.keygen.NonpartitionedKeyGenerator" 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BalaMahesh commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "BalaMahesh (via GitHub)" <gi...@apache.org>.
BalaMahesh commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1401402454

   > Are these records belong to the same file group?
   
   yes, these are all in the same file group. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1522076493

   Closing due to inactivity but the issue is fixed in https://github.com/apache/hudi/pull/7944
   This is due to non-partitioned table having null partition value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1527022328

   The linked patch contains unit tests. also, tried reproducing locally w/ some failures as well. could not reproduce. 
   closing the issue as not valid anymore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1524743450

   Reopening to validate the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1455105739

   @BalaMahesh genetle reminder to share the complete list of write configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org