You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ethan Guo (Jira)" <ji...@apache.org> on 2021/11/16 01:33:00 UTC

[jira] [Commented] (HUDI-2745) Record count does not match input after compaction is scheduled when running Hudi Kafka Connect sink

    [ https://issues.apache.org/jira/browse/HUDI-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444218#comment-17444218 ] 

Ethan Guo commented on HUDI-2745:
---------------------------------

I check the {{MergeOnReadSnapshotRelation}}  and file index built for the kafka-connect case, it looks like the file slices containing the log files after the pending compaction is missing.  There is one exact issue being filed and this is not limited to kafka-connect: HUDI-2480.  And the clustering count mismatch is likely due to this as well.

> Record count does not match input after compaction is scheduled when running Hudi Kafka Connect sink
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2745
>                 URL: https://issues.apache.org/jira/browse/HUDI-2745
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> Spark Shell command to do snapshot query:
> {code:java}
> val basePath = "/tmp/hoodie/hudi-test-topic"
> val df = spark.read.format("hudi").load(basePath)
> df.createOrReplaceTempView("hudi_test_table")
> spark.sql("select count(*) from hudi_test_table").show() {code}
> Two cases of count mismatch:
> (1) Compaction scheduled, more deltacommits later on: the count does not match input size.  After compaction is executed.  The count becomes correct.
> (2) Clustering scheduled, more deltacommits later on: the count is correct, equal to the input size.  After clustering is executed, the count drops and becomes incorrect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)