You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "michael1991 (via GitHub)" <gi...@apache.org> on 2023/03/08 03:14:44 UTC

[GitHub] [hudi] michael1991 opened a new issue, #8121: [SUPPORT] MOR Table Duplicated Records Found

michael1991 opened a new issue, #8121:
URL: https://github.com/apache/hudi/issues/8121

   **Describe the problem you faced**
   
   We found duplicated records in partitions of a MOR table with schedule compaction only, meanwhile we didn't run separate compaction job.
   After we changed schedule compaction to inline compaction, we didn't find duplicated records on later ingested partitions.
   
   **Expected behavior**
   
   Whatever compactions were executed, record_key should be unique in one partition.
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * Spark version : 3.3.0
   
   * Hive version : not used
   
   * Hadoop version : 3.1.3
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] michael1991 commented on issue #8121: [SUPPORT] MOR Table Duplicated Records Found

Posted by "michael1991 (via GitHub)" <gi...@apache.org>.
michael1991 commented on issue #8121:
URL: https://github.com/apache/hudi/issues/8121#issuecomment-1465433971

   > Could be couple of issues:
   > 
   > 1. I see you are setting COMBINE_BEFORE_UPSERT.key() -> "false". this should be set to true. if not, duplicates records from incoming batch may not be deduped.
   > 2. Could be due to [[HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline serverĀ #8079](https://github.com/apache/hudi/pull/8079) which we found recently.
   >    Can you try w/ latest master or apply above patch and give it a try.
   
   Thanks @nsivabalan !!!
   1st point, we could make sure before writing actions, we have only one record for one record key, so we set to false to accelerate writing process.
   2nd point, I will try latest master later, due to we have no problems if we use inline compaction currently.
   
   Anyway, thanks a lot! I will try your advice later. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] michael1991 commented on issue #8121: [SUPPORT] MOR Table Duplicated Records Found

Posted by "michael1991 (via GitHub)" <gi...@apache.org>.
michael1991 commented on issue #8121:
URL: https://github.com/apache/hudi/issues/8121#issuecomment-1459708223

   > Can you give the job configurations here? cc @nsivabalan , maybe you can take a look~
   
   Sure, pls see configurations as below:
   ```scala
   val COMMON_HUDI_CONF_MAP = Map(RECORDKEY_FIELD.key() -> "id", PRECOMBINE_FIELD.key() -> "id",
       SCHEMA_EVOLUTION_ENABLED.key() -> "true", DATABASE_NAME.key() -> "database",
       COMBINE_BEFORE_UPSERT.key() -> "false", EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED.key() -> "true",
       INSERT_PARALLELISM_VALUE.key() -> "5", UPSERT_PARALLELISM_VALUE.key() -> "5",
       CLEANER_COMMITS_RETAINED.key() -> "2", ASYNC_CLEAN.key() -> "true", 
       PARTITIONPATH_FIELD.key() -> "date,hour", TBL_NAME.key() -> TBL_LOG_INCREMENT_DETAILS_NAME,
       TABLE_TYPE.key() -> MOR_TABLE_TYPE_OPT_VAL, OPERATION.key() -> UPSERT_OPERATION_OPT_VAL,
       WRITE_PAYLOAD_CLASS_NAME.key() -> CUSTOM_PAYLOAD_CLASS, INLINE_COMPACT.key() -> "true")
   
   // This is current configuration, previous is only changed INLINE_COMPACT to SCHEDULE_INLINE_COMPACT.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #8121: [SUPPORT] MOR Table Duplicated Records Found

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8121:
URL: https://github.com/apache/hudi/issues/8121#issuecomment-1465084863

   Could be couple of issues:
   1. I see you are setting COMBINE_BEFORE_UPSERT.key() -> "false". this should be set to true. if not, duplicates records from incoming batch may not be deduped. 
   
   2. Could be due to https://github.com/apache/hudi/pull/8079 which we found recently. 
   Can you try w/ latest master or apply above patch and give it a try. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] michael1991 closed issue #8121: [SUPPORT] MOR Table Duplicated Records Found

Posted by "michael1991 (via GitHub)" <gi...@apache.org>.
michael1991 closed issue #8121: [SUPPORT] MOR Table Duplicated Records Found
URL: https://github.com/apache/hudi/issues/8121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8121: [SUPPORT] MOR Table Duplicated Records Found

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8121:
URL: https://github.com/apache/hudi/issues/8121#issuecomment-1459689938

   Can you give the job configurations here? cc @nsivabalan , maybe you can take a look~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org