You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "BalaMahesh (via GitHub)" <gi...@apache.org> on 2023/04/27 05:42:24 UTC
[GitHub] [hudi] BalaMahesh opened a new issue, #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
BalaMahesh opened a new issue, #7733:
URL: https://github.com/apache/hudi/issues/7733
**_Tips before filing an issue_**
- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
**Describe the problem you faced**
We have Postgres data coming from debezium connector via Kafka. We are running Hudi in upsert mode on this dataset, we have seen that there are around 12 records which has two versions of data for the same id instead of updating the latest values and cleaning the old record.
We are yet not clear how this is the case since this data is from older commits.
**To Reproduce**
Steps to reproduce the behavior:
1. Start Postgres Debezium Kafka connector and publish data to Kafka
2. Run Hudi in upsert mode
3. We are not sure whether there are any crashes happened during those commits.
4. Use below configurations.5.
hoodie.compaction.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload
hoodie.table.type=MERGE_ON_READ
hoodie.table.metadata.partitions=
hoodie.table.precombine.field=_event_lsn
hoodie.table.partition.fields=
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.checksum=4134192528
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.recordkey.fields=id
hoodie.partition.metafile.use.base.format=false
hoodie.populate.meta.fields=true
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
hoodie.table.base.file.format=PARQUET
hoodie.table.version=5
**Expected behavior**
We expect only version of the record to be available in the latest queried data.
**Environment Description**
* Hudi version : 0.12.1
* Spark version : 3.2.1
* Hive version : 2.3.5
* Hadoop version : 2.7.7
* Storage (HDFS/S3/GCS..) : GCS
* Running on Docker? (yes/no) : yes.
**Additional context**
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.clean.automatic=true
hoodie.clean.async=true
hoodie.cleaner.commits.retained=5
hoodie.keep.min.commits=10
#compaction config
hoodie.datasource.compaction.async.enable=true
hoodie.parquet.small.file.limit=104857600
hoodie.compaction.target.io=50
**Stacktrace**
```
Id updated _hoodie_commit_time. _event_lsn
Aa5udG 1667998354 20221109125316627 5037873812216
Aa5udG 1667972649 20221109055102633 5028051185232
Aa61Gb 1667998400 20221109125802500 5037878072632
Aa61Gb 1667972837 20221109055102633 5028239838008
Aa7hZx 1667998411 20221109125802500 5037879344768
Aa7hZx 1667973014 20221109055102633 5028334998944
Aa81Sq 1667998439 20221109125802500 5037897355680
Aa81Sq 1667973345 20221109055825061 5028484902408
AbB9sW 1668051396 20221110034051271 5045161427664
AbB9sW 1667974610 20221109061740419 5029141615480
OiYzUz 1672662739 20230112125716390 6287523270024
OiYzUz 1672662739 20230112125716390 6287523270024
XxNzFk 1667982183 20221109082337760 5031758334520
XxNzFk 1667981380 20221109081024733 5031516715520
YxNzFk 1667982167 20221109082337760 5031758226096
YxNzFk 1667981376 20221109081024733 5031516565840
YbB9sW 1668051393 20221110034051271 5045160856976
YbB9sW 1667974609 20221109061740419 5029141513960
ZxNzFk 1667982174 20221109082337760 5031755205544
ZxNzFk 1667981375 20221109081024733 5031516243272
ZanXvJ 1668051273 20221110033657677 5045153106408
ZanXvJ 1667967621 20221109042439193 5025825527744
ZbB9sW 1668051391 20221110034051271 5045160222496
ZbB9sW 1667974609 20221109061740419 5029141376128
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
URL: https://github.com/apache/hudi/issues/7733
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1401392014
Are these records belong to the same file group?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
URL: https://github.com/apache/hudi/issues/7733
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1404529634
can you provide us complete list of write configs. if you are setting operation type to "bulk_insert", you could find duplicates.
wrt non partitioned, the key gen class is "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] BalaMahesh commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "BalaMahesh (via GitHub)" <gi...@apache.org>.
BalaMahesh commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1401402454
> Are these records belong to the same file group?
yes, these are all in the same file group.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1522076493
Closing due to inactivity but the issue is fixed in https://github.com/apache/hudi/pull/7944
This is due to non-partitioned table having null partition value.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1527022328
The linked patch contains unit tests. also, tried reproducing locally w/ some failures as well. could not reproduce.
closing the issue as not valid anymore.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1524743450
Reopening to validate the fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] codope commented on issue #7733: [SUPPORT] Duplicate rows found in Hudi non partitioned table.
Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7733:
URL: https://github.com/apache/hudi/issues/7733#issuecomment-1455105739
@BalaMahesh genetle reminder to share the complete list of write configs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org