You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/09 02:44:15 UTC

[GitHub] [hudi] zwj0110 opened a new issue, #6640: [SUPPORT] HUDI partition table duplicate data cow

zwj0110 opened a new issue, #6640:
URL: https://github.com/apache/hudi/issues/6640

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ## hudi sink config
   ```sql
       'connector' = 'hudi',
       'hoodie.table.name' = 'xxx',
       'table.type' = 'COPY_ON_WRITE',
       'path' = 'xxxx',
       'hoodie.datasource.write.keygenerator.type' = 'COMPLEX',
       'hoodie.datasource.write.recordkey.field' = 'id',
       'hoodie.cleaner.policy' = 'KEEP_LATEST_FILE_VERSIONS',
       'hoodie.cleaner.fileversions.retained' = '20',
       'hoodie.keep.min.commits' = '30',
       'hoodie.keep.max.commits' = '40',
       'hoodie.cleaner.commits.retained' = '20',
       'write.operation' = 'upsert',
       'write.commit.ack.timeout' = '60000000',
       'write.sort.memory' = '128',
       'write.task.max.size' = '1024',
       'write.merge.max_memory' = '100',
       'write.tasks' = '96',
       'write.precombine' = 'true',
       'write.precombine.field' = 'meta_es_offset',
       'index.state.ttl' = '0',
       'index.global.enabled' = 'false',
       'hive_sync.enable' = 'true',
       'hive_sync.table' = 'xxx',
       'hive_sync.auto_create_db' = 'true',
       'hive_sync.mode' = 'hms',
       'hive_sync.metastore.uris' = 'xxx',
       'hive_sync.db' = 'xxx',
       'hoodie.datasource.write.partitionpath.field' = 'year,month,day',
       'hoodie.datasource.write.hive_style_partitioning' = 'true',
       'hive_sync.partition_fields' = 'year,month,day',
       'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'index.bootstrap.enabled' = 'true'
   ```
   ## description:
   After initialize history data,  set `scan.startup.mode` as `timestamp`,and set the timestamp ahead, the duplicate occur,and if we restart the job from checkpoint, the data is well
   ## data duplicate result:
   ![image](https://user-images.githubusercontent.com/44424308/189260717-03668b84-c3dd-4785-8b5c-a874f43a084f.png)
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * flink version : 1.13.1
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.1.0
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zhifeiyu commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Posted by GitBox <gi...@apache.org>.
Zhifeiyu commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1241440579

   mark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1287999462

   @yuzhaojing @danny0405 : gentle ping. 
   @zwj0110 : feel free to close if the issue is resolved. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1306532320

   Not enough details here, can you try 0.12.1 and see if the duplicates happen ? Would close the issue here first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 closed issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Posted by GitBox <gi...@apache.org>.
danny0405 closed issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0  flink 1.13.1
URL: https://github.com/apache/hudi/issues/6640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1254303659

   @yuzhaojing @danny0405 Could any one of you chime in here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org