You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/28 03:11:00 UTC
[GitHub] [hudi] Guanpx opened a new issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Guanpx opened a new issue #5150:
URL: https://github.com/apache/hudi/issues/5150
**Describe the problem you faced**
Flink + hudi cow + BUCKET index + bulk_insert
bucket_bulk_insert **so slow** and generate **too many hdfs small flie**
**To Reproduce**
Steps to reproduce the behavior:
1. use bulk_insert cow table with Flink BUCKET index, that data size about 500w(5GB) (Flink batch mode)
**Expected behavior**
data source from hive, sink to hudi with Flink **and** without too many small file.
**Environment Description**
* Hadoop version : 1.14.3
* Hudi version : master-0.11.0 (2022-03-28 10:00am, UTC+8)
* Hadoop version : 3.0.0
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
**Additional context**
* hudi config
```
'connector' = 'hudi',
'path' = 'hdfs://nameservice-ha/hudi/dw/rds.db/xxxx',
'hoodie.parquet.compression.codec'= 'snappy',
'index.type' = 'BUCKET',
'table.type' = 'COPY_ON_WRITE',
'write.operation' = 'bulk_insert',
'write.tasks' = '6',
'hoodie.bucket.index.num.buckets' = '6',
'write.sort.memory' = '256',
'hoodie.bucket.index.hash.field' = 'id'
```
* bucket_bulk_insert so slow : abount 4000 records /min
<img width="1515" alt="image" src="https://user-images.githubusercontent.com/29246713/160319772-8e01087a-98b6-44d8-a0fc-f2aebdd39c49.png">
* too many small hdfs file
<img width="1014" alt="image" src="https://user-images.githubusercontent.com/29246713/160319884-de51b98f-3099-4c5b-97af-0133426476d2.png">
**Stacktrace**
``` nothing```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080474039
Fix altogether in https://github.com/apache/hudi/pull/5093
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080237177
> See the fix here: #5151
too many small flies
<img width="1070" alt="image" src="https://user-images.githubusercontent.com/29246713/160336697-66ff7f67-e1ef-4a6e-9427-1381eacf4788.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Guanpx edited a comment on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
Guanpx edited a comment on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080237177
> See the fix here: #5151
* too many small flies
<img width="1070" alt="image" src="https://user-images.githubusercontent.com/29246713/160336697-66ff7f67-e1ef-4a6e-9427-1381eacf4788.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Guanpx edited a comment on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
Guanpx edited a comment on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080133387
for this pr https://github.com/apache/hudi/pull/5135
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080140118
> See the fix here: #5151
Thank you very much, I will try again again now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080138417
See the fix here: https://github.com/apache/hudi/pull/5151
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 closed issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
danny0405 closed issue #5150:
URL: https://github.com/apache/hudi/issues/5150
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Guanpx commented on issue #5150: [SUPPORT] bucket_bulk_insert so slow and generate too many hdfs small flie with Flink BUCKET index
Posted by GitBox <gi...@apache.org>.
Guanpx commented on issue #5150:
URL: https://github.com/apache/hudi/issues/5150#issuecomment-1080133387
https://github.com/apache/hudi/pull/5135
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org