You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "maheshguptags (via GitHub)" <gi...@apache.org> on 2023/03/15 13:51:21 UTC

[GitHub] [hudi] maheshguptags opened a new issue, #8195: Clustering is not happening on Flink Hudi

maheshguptags opened a new issue, #8195:
URL: https://github.com/apache/hudi/issues/8195

   
   Trying to cluster small files into large clustered files to reduce the IO and number of logs and parquet
   
   to overcome with small files problem in HUDI we are trying to create the cluster file but it is just creating the `.replacecommit.requested` file and is not producing the `.replacecommit.inflight` and `.replace` files to complete the clustering process.
   
   
   **Expected behavior**
   
   it should create the clustered file after the request is created. 
   
   **Environment Description**
   
   * Hudi version : 12.1 
   
   * Flink version : 1.14.6
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : Yes
    **Config of Job**
   
   ```'hoodie.clean.max.commits'='3',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1' 
   ```
   
   I am attaching a screenshot for the same.
   
   ![image](https://user-images.githubusercontent.com/115445723/225327245-4e34d468-4c43-4760-836a-a4bac18aa913.png)
   
   
   
   **Stacktrace**
   
   ```No error on hudi job```
   
   you can see in the screenshot it is creating the `.replacecommit.requested` file but it is not generating the `.replacecommit.inflight` and `.replacement`.
   
   This is blocking our progress toward completing the project.
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1473411812

   @danny0405 with your suggestion it is not working with MOR and COW in Insert type. could you please share the support matrics for compaction clustering ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hbgstc123 commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.
hbgstc123 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1470238179

   flink config is different from spark, set these two config to enable inline clustering for flink.
   
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true'
   
   other flink clustering config can refer the doc:  
   https://hudi.apache.org/docs/configurations#clusteringasyncenabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1480701240

   Hi @danny0405,
   
   Thank you for joining me on the call. I tried your suggestion and deployed the code in k8s cluster and it is working fine.
   it seems it was an issue with the local IDE run.
   
   below is the configuration that worked for me.
   ```
   `table.type' = 'COPY_ON_WRITE',
   'write.operation'='insert',
   'hoodie.datasource.write.recordkey.field' = 'x,y',
   'read.streaming.enabled'='true',
   'read.start-commit'='earliest',
   'hoodie.clean.max.commits'='3',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'hoodie.parquet.max.file.size'= '536870912',
   'hoodie.parquet.compression.codec'='snappy',
   'clustering.schedule.enabled'='true', 
   'clustering.async.enabled'='true', 
   'hoodie.clustering.inline'= 'true', 
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1',
   'write.tasks'='2'
   ```
   
   
   **DAG Plan** 
   ![image](https://user-images.githubusercontent.com/115445723/227127466-2f6cf2bb-9f19-4215-89a2-4a7d5dcd1a52.png)
   
   **timeline**
   <img width="1236" alt="image" src="https://user-images.githubusercontent.com/115445723/227127542-bd678fb1-15e0-4d91-8fa6-b71f3577095e.png">
   
   Thank you very much for all the help. Really appreciate!!! 
   
   Thanks 
   - Mahesh  
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471316428

   The clustering only works for `MOR` table with `INSERT` operation, what is your table type then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475545235

   No support matrics specifically, I can list some here:
   
   ```md
   table type | operation | table service | supported
   COW         | insert       | clustering      | yes
   MOR         | upsert       | compaction      | yes
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475804727

   @danny0405,
   I am using the operation type as `INSERT` not `UPSERT`.
   
    I am using the below config.
   ```
   'table.type' = 'COPY_ON_WRITE',
   'hoodie.datasource.write.operation'='insert',
   'hoodie.clean.max.commits'='2',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1'
   ``` 
   
   thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471343339

   I was trying with the `COW` table but let me try it out with the `MOR` table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 closed issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 closed issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi 
URL: https://github.com/apache/hudi/issues/8195


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475552002

   Hi @danny0405,
    I have implemented the operation type as `Insert` with the `COW` table but it is still not working(please check the above config for `COW` ). 
   Can you please look into this? 
   
   This is blocking us in production. 
   Feel free to let me know if you need anything from me.
   
   Thanks,
   Mahesh    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475576491

   Can you paste the job pipeling DAG on the flink web page here, let's see whether the clustering sub-pipeline has been brought up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475771895

   Hmm, from the DAG, the append mode does not really take effect here. You are still using the `UPSERT` mode which has updates.
   
   Make sure you table type is COW and the operation is INSERT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471357761

   Hi @danny0405,
   I tried with the `MOR` table but the result is still the same not performing clustering.
   ```
   'table.type' = 'MERGE_ON_READ',
   'hoodie.compact.inline'= 'true',
   'hoodie.compact.inline.max.delta.commits'='2',
   'hoodie.clean.max.commits'='2',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1'
   ``` 
   I am attaching a screenshot for the same please have a look.
   
   ![image](https://user-images.githubusercontent.com/115445723/225526212-e1d15c2d-2e54-4764-b14c-df9307a8a294.png)
   
   Thanks 
   Mahesh 
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471302720

   Hi @hbgstc123,
   
   I have tried with your suggestion and updated the config but I am still not getting the clustering inflight and .replace file.
   
   I am also attaching the updated config and screenshot for your reference.
   
   **CONFIG**
   ``` 'hoodie.clean.max.commits'='3',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1' 
   ``` 
   ![image](https://user-images.githubusercontent.com/115445723/225516721-ced15071-703b-4209-ab1f-5abd81502616.png)
   
   Thanks,
   Mahesh  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1477305130

   I see, config `write.operation` instead of `hoodie.datasource.write.operation` as `INSERT`, that should work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1470049511

   cc: @codope @danny0405 @XuQianJin-Stars  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471670015

   > I was trying with the `COW` table but let me try it out with the `MOR` table.
   
   Sorry, my mistake, it's the `COW` table with `INSERT` operation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1475723085

   Hi @danny0405,
   Please find attached the pipeline dag and  timeline of job 
   
   ![image](https://user-images.githubusercontent.com/115445723/226269602-712e0ba0-70fd-42a2-8681-dbe639a87065.png)
   
   ![image](https://user-images.githubusercontent.com/115445723/226269709-0e6cadd5-4a87-4b17-88cf-01f8a0d96326.png)
   
   let me know if you need any other details.
   
   Thanks,
   Mahesh  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hbgstc123 commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.
hbgstc123 commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471465616

   The clustering only works for COW table with INSERT operation.  
   Seems you are using default write.operation which is UPSERT.  
   upsert with default state index will try to write new data to file group which base file is smaller than this config hoodie.parquet.small.file.limit (default 100MB) first.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maheshguptags commented on issue #8195: Clustering is not happening on Flink Hudi

Posted by "maheshguptags (via GitHub)" <gi...@apache.org>.
maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1471521198

   Hi @hbgstc123 and @danny0405,
   I tried the suggestions that you guys suggest with `MOR` and `COW` both but it doesn't seem to work with any combination. Let me share the configuration and screenshot that I tried.
   
   I would also like to ask you guys for an example/use case where you have implemented inline clustering with the Flink job or we can set up a call to discuss this issue in more detail.
   
   **MOR config**
   ``` 
   'table.type' = 'MERGE_ON_READ',
   'hoodie.datasource.write.operation'='insert',
   'hoodie.compact.inline'= 'true', 
   'hoodie.compact.inline.max.delta.commits'='2',
   'hoodie.clean.max.commits'='2', 
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true',
   'hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1' 
   ```
   ![image](https://user-images.githubusercontent.com/115445723/225557911-fb337174-4b79-4f2c-a4c7-22cc12178c61.png)
   
   **COW config**
   ```
   'table.type' = 'COPY_ON_WRITE',
   'hoodie.datasource.write.operation'='insert',
   'hoodie.clean.max.commits'='2',
   'hoodie.cleaner.commits.retained' = '3',
   'hoodie.clean.async'='true','hoodie.cleaner.policy' = 'KEEP_LATEST_COMMITS',
   'hoodie.parquet.small.file.limit'='104857600',
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='true',
   'hoodie.clustering.inline'= 'true',
   'hoodie.clustering.inline.max.commits'= '2',
   'hoodie.clustering.plan.strategy.max.bytes.per.group'= '107374182400',
   'hoodie.clustering.plan.strategy.max.num.groups'= '1'
   ```
   ![image](https://user-images.githubusercontent.com/115445723/225558799-39002a89-6b95-40a2-8291-f2902a69dc74.png)
   
   Thanks 
   Mahesh 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org