You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "huyuanfeng2018 (via GitHub)" <gi...@apache.org> on 2023/03/23 03:23:07 UTC

[GitHub] [hudi] huyuanfeng2018 opened a new issue, #8274: [SUPPORT] Append Mode should support close the bloom filter option

huyuanfeng2018 opened a new issue, #8274:
URL: https://github.com/apache/hudi/issues/8274

   Write in insert mode, but also write bloomfilter according to recordkey at the same time, I think you can set an option to turn off this function to increase write throughput
   
   我在0.13分支没有找到对应的设置,应该是默认会开启
   ```
     private static HoodieRowDataFileWriter newParquetInternalRowFileWriter(
         Path path, HoodieWriteConfig writeConfig, RowType rowType, HoodieTable table)
         throws IOException {
       BloomFilter filter = BloomFilterFactory.createBloomFilter(
           writeConfig.getBloomFilterNumEntries(),
           writeConfig.getBloomFilterFPP(),
           writeConfig.getDynamicBloomFilterMaxNumEntries(),
           writeConfig.getBloomFilterType());
       HoodieRowDataParquetWriteSupport writeSupport =
           new HoodieRowDataParquetWriteSupport(table.getHadoopConf(), rowType, filter);
       return new HoodieRowDataParquetWriter(
           path, new HoodieParquetConfig<>(
           writeSupport,
           writeConfig.getParquetCompressionCodec(),
           writeConfig.getParquetBlockSize(),
           writeConfig.getParquetPageSize(),
           writeConfig.getParquetMaxFileSize(),
           writeSupport.getHadoopConf(),
           writeConfig.getParquetCompressionRatio(),
           writeConfig.parquetDictionaryEnabled()));
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1480626101

   > You can supply a PR though, let's see how much gains we can get for the write throughput
   
   Yes, I recently tested the performance of massive real-time writing using iceberg and hudi. It seems that the logic of the two is basically the same in append mode, but hudi seems to have a lot of poor throughput, so I want to see what causes it. I'm making some comparisons. I think I can try turning off the write blog function first to see how much improvement there is. Perhaps you can give me some hints about the possibility of causing a significant difference between the two, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1482544908

   > You can supply a PR though, let's see how much gains we can get for the write throughput.
   
   I ran the write with bloomfilter and the write without bloomfilter respectively during the peak period of our business on two days from 21:00 p.m. It is considered that it can reflect the peak consumption rate of hudi, and the results are as follows:
   
   1. Turn on the bloom filter:
   ![image](https://user-images.githubusercontent.com/40817998/227490546-7a971a62-2e5a-4bfa-8b7e-6e7524cd31e7.png)
   2. Turn off the bloom filter:
   ![image](https://user-images.githubusercontent.com/40817998/227490665-a2fa7c74-9a7a-42e6-8292-e0247144d403.png)
   
   So, I think bloom filter may have a certain impact on write throughput, and if it is turned off, there may be more objective benefits
   @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1484369028

   > > a certain impact on write throughput
   > 
   > I'm confused why turning off the BF increased the write throughput.
   I think that when writing, a BF structure will be inserted at the same time, which will increase the writing time
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486238560

   I'm just confused by your screenshot because from the picture the performance with BF enabled seems better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1484369322

   > > a certain impact on write throughput
   > 
   > I'm confused why turning off the BF increased the write throughput.
   
   I think that when writing, a BF structure will be inserted at the same time, which will increase the writing time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1480548998

   cc @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486339837

   Okay, that is the ballpark no of performance gains for disabling the BF?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1483712177

   > a certain impact on write throughput
   
   I'm confused why turning off the BF increased the write throughput.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "kazdy (via GitHub)" <gi...@apache.org>.
kazdy commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1480907317

   Hudi does a lot of additional stuff when compared to Iceberg. Eg metadata table maintenance itself in my case takes the same time as read transform and write part (around 3m in CoW for inline metadata table maintenence) have you tried to disable metadata table or use async metadata table service? In my case microbatch processing time dropped from 7m to 3m once I enabled it.
   If you want higher throughput you can also disable cleaning, compaction and clustering and run it in a separate job.
   Have you tried these things?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 closed issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 closed issue #8274: [SUPPORT] Append Mode should support close the bloom filter option
URL: https://github.com/apache/hudi/issues/8274


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486122206

   > Then why `turning off the BF` increases the performance then?
   
   I think the writing performance we are talking about may be different. The writing performance I want to express is the performance of the overall data entering the lake process, not just the performance of writing to the parquet file. I close it after writing to the parquet. After writing the data structure of BF, the overall performance is certain. Rather than the performance of writing to parquet, these two are theoretically unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1480615959

   Yeah, we can do that if we are sure the bloom filter is not needed, but this is also risky because you have no idea whether the table could be updated in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1485050732

   Then why `turning off the BF` increases the performance then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486120280

   
   > Then why `turning off the BF` increases the performance then?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486251767

   > I'm just confused by your screenshot because from the picture the performance with BF enabled seems better.
   
   sorry,I got them backwards😓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486252266

   > I'm just confused by your screenshot because from the picture the performance with BF enabled seems better.
   
   sorry,I got them backwards😓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] huyuanfeng2018 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.
huyuanfeng2018 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1486379272

   > Okay, that is the ballpark no of performance gains for disabling the BF?
   
   In our scenario, probably yes
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8274: [SUPPORT] Append Mode should support close the bloom filter option

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8274:
URL: https://github.com/apache/hudi/issues/8274#issuecomment-1480617067

   You can supply a PR though, let's see how much gains we can get for the write throughput.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org