You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/28 03:11:00 UTC

[GitHub] [hudi] Guanpx opened a new issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Guanpx opened a new issue #5150:
URL: https://github.com/apache/hudi/issues/5150


   **Describe the problem you faced**
   
   Flink + hudi cow + BUCKET index + bulk_insert
   bucket_bulk_insert  **so slow** and generate **too many hdfs small flie**
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. use bulk_insert cow table with Flink BUCKET index, that data size about 500w(5GB) (Flink batch mode)
   
   **Expected behavior**
   
   data source from hive, sink to hudi with Flink **and** without too many small file.
   
   **Environment Description**
   
   * Hadoop version : 1.14.3
   
   * Hudi version : master-0.11.0 (2022-03-28 10:00am, UTC+8)
   
   * Hadoop version : 3.0.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   * hudi config
   ```
     'connector' = 'hudi',
     'path' = 'hdfs://nameservice-ha/hudi/dw/rds.db/xxxx',
     'hoodie.parquet.compression.codec'= 'snappy',
     'index.type' = 'BUCKET',
     'table.type' = 'COPY_ON_WRITE',
     'write.operation' = 'bulk_insert', 
     'write.tasks' = '6', 
     'hoodie.bucket.index.num.buckets' = '6', 
     'write.sort.memory' = '256', 
     'hoodie.bucket.index.hash.field' = 'id' 
   ```
   
   * bucket_bulk_insert so slow : abount 4000 records /min
   
   <img width="1515" alt="image" src="https://user-images.githubusercontent.com/29246713/160319772-8e01087a-98b6-44d8-a0fc-f2aebdd39c49.png">
   
   * too many small hdfs file 
   
   <img width="1014" alt="image" src="https://user-images.githubusercontent.com/29246713/160319884-de51b98f-3099-4c5b-97af-0133426476d2.png">
   
   **Stacktrace**
   ``` nothing```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080474039


   Fix altogether in https://github.com/apache/hudi/pull/5093


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080237177


   > See the fix here: #5151
   too many small flies
   <img width="1070" alt="image" src="https://user-images.githubusercontent.com/29246713/160336697-66ff7f67-e1ef-4a6e-9427-1381eacf4788.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Guanpx edited a comment on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
Guanpx edited a comment on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080237177


   > See the fix here: #5151
   
   
   * too many small flies
   <img width="1070" alt="image" src="https://user-images.githubusercontent.com/29246713/160336697-66ff7f67-e1ef-4a6e-9427-1381eacf4788.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Guanpx edited a comment on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
Guanpx edited a comment on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080133387


   for this pr https://github.com/apache/hudi/pull/5135 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080140118


   > See the fix here: #5151
   
   Thank you very much, I will try again again now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080138417


   See the fix here: https://github.com/apache/hudi/pull/5151


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 closed issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
danny0405 closed issue #5150:
URL: https://github.com/apache/hudi/issues/5150


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index

Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080133387


   https://github.com/apache/hudi/pull/5135


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org