You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "koochiswathiTR (via GitHub)" <gi...@apache.org> on 2023/03/14 09:25:02 UTC

[GitHub] [hudi] koochiswathiTR opened a new issue, #8178: Duplicate data in MOR table Hudi

koochiswathiTR opened a new issue, #8178:
URL: https://github.com/apache/hudi/issues/8178

   We see duplicate data in our hudi dataset
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   We run spark streaming application where we read kinesis stream process the data and stores in Hudi.
   We started seeing duplicates in our hudi dataset
    Below are our Hudi configs
   
       DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,
       DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "guid",
       DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "collectionName",
       DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "operationTime",
       HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key() -> CompactionTriggerStrategy.TIME_ELAPSED.name,
       HoodieCompactionConfig.INLINE_COMPACT_TIME_DELTA_SECONDS.key() -> String.valueOf(60 * 60),
       HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
       HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "624", 
       HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "625",  
       HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "648", 
       HoodieCompactionConfig.ASYNC_CLEAN.key() -> "false", 
       HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
       HoodieMetricsConfig.TURN_METRICS_ON.key() -> "true",
       HoodieMetricsConfig.METRICS_REPORTER_TYPE_VALUE.key() -> MetricsReporterType.DATADOG.name(),
       HoodieMetricsDatadogConfig.API_SITE_VALUE.key() -> "US",
       HoodieMetricsDatadogConfig.METRIC_PREFIX_VALUE.key() -> "tacticalnovusingest.hudi",
       HoodieMetadataConfig.ENABLE.key() -> "false",
       HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE.key() -> "false",
   
   We only use upsert in our code , we never use insert
   
           dataframe.write.format("org.apache.hudi")
             .option("hoodie.insert.shuffle.parallelism", hudiParallelism)
             .option("hoodie.upsert.shuffle.parallelism", hudiParallelism)
             .option(HoodieWriteConfig.TABLE_NAME, hudiTableName)
             .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
             .option(HoodieMetricsDatadogConfig.METRIC_TAG_VALUES.key(), s"env:$environment")
             .options(hudiOptions).mode(SaveMode.Append)
             .save(s3Location)
   
   Please help us on this.
   Below are two situation where we see duplicates.
   1. duplicates with same hudi commit time
   2. duplicates with different commit time.
   
   I have attached the json files for reference 
   We tried to delete duplicate data using hudi commit seq num and our primary key, it is deleting both keys
   Duplicate with hudi DELETE 
   dataframe.write.format("org.apache.hudi")
         .option("hoodie.insert.shuffle.parallelism", hudiParallelism)
         .option("hoodie.upsert.shuffle.parallelism", hudiParallelism)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.DELETE_OPERATION_OPT_VAL)
   
   We tried to deduplicate with the hudi cli command,
   
   repair deduplicate --duplicatedPartitionPath s3://**/ --repairedOutputPath s3://**/ --sparkMemory 2G --sparkMaster yarn
   
   We are getting java.io.FileNotFoundException:  
   Please help
   
   
   
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2
   
   * Hive version :NA
   
   * Hadoop version : NA
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) :no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1528168418

   hey @koochiswathiTR : can you provide us any more details. we are taking another serious look into all data consistency issues. So, interested in getting to the bottom of the issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1528169908

   hey @koochiswathiTR : are you using global index by any chance. i.e. GLOBAL_BLOOM or GLOBAL_SIMPLE as the index type. we know there is a bug which could lead to duplicates when either of these index is used. 
   and we already have a fix for it. https://github.com/apache/hudi/pull/8490
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] koochiswathiTR commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "koochiswathiTR (via GitHub)" <gi...@apache.org>.

koochiswathiTR commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1470026546

   @nsivabalan  can you please check this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] koochiswathiTR commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "koochiswathiTR (via GitHub)" <gi...@apache.org>.

koochiswathiTR commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1467720097

   [duplicate with different commit time_hudi.txt](https://github.com/apache/hudi/files/10966648/duplicate.with.different.commit.time_hudi.txt)
   [duplicate with same commit time_hudi.txt](https://github.com/apache/hudi/files/10966649/duplicate.with.same.commit.time_hudi.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1522119191

   @koochiswathiTR Is the record key field `guid` some randomly generation id like uuid. There have been known [issues](https://github.com/apache/hudi/issues/7829) with non-deteministic id generation scheme, especially when there are loss of executors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1473163827

   we have found some issues w/ spark cache invalidation when tasks are retried. 
   we made some fixes on that end. 
   https://github.com/apache/hudi/pull/4753
   https://github.com/apache/hudi/pull/4856
   
   Can you try w/ 0.12.0 or 0.13.0 and let us know if you see any more issues. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "codope (via GitHub)" <gi...@apache.org>.

codope commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1522128826

   Also, previously our spark streaming writes were not idempotent. So, there could be duplicates. We have fixed that in https://github.com/apache/hudi/issues/8178


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1474205280

   probably, here is what you can do
   1. query the table to find all duplicates. 
   2. store the dupes to some staging location (may be df.write.parquet). 
   3. issue deletes for these records to against hudi. 
   4. for the same batch, de-duplicate to pick one version of the record and ingest to hudi using upsert. 
   
   If anything crashes inbetween, you always have the staging data. this is just to ensure after deleting from hudi table, if your process crashes, you may have lost track of the records. bcoz, snapshot query is not going to return it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1474202023

   btw, `repair deduplicate ` does not work for MOR table :( 
   so you have to write some code in the application layer to fix the duplicates unfortunately. sorry about that. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org