You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/26 03:23:14 UTC

[GitHub] [hudi] CodeCooker17 opened a new issue, #6791: [SUPPORT] Optimized the way to get HoodieBaseFile of loadColumnRangesFromFiles of Bloom Index

CodeCooker17 opened a new issue, #6791:
URL: https://github.com/apache/hudi/issues/6791

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   When using Bloom Index for loadColumnRangesFromFiles in the tagLocation process, the existing method is to obtain the hoodieBaseFile by requesting the Driver side. When the amount of data is large and the parallelism is high, there is a certain network performance bottleneck, resulting in very slow tagloacation.
   
   A clear and concise description of the problem.
   When using Bloom Index for loadColumnRangesFromFiles in the tagLocation process, the existing method is to obtain the hoodieBaseFile by requesting the Driver side. When the amount of data is large and the parallelism is high, there is a certain network performance bottleneck, resulting in very slow tagloacation.
   However, hoodieBaseFile can be obtained directly through HoodieIndexUtils.getLatestBaseFilesForAllPartitions() in loadColumnRangesFromFiles(), so it can effectively improve the performance of TagLoaction of Bloom Index.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * Spark version :3.1.2
   
   * Hive version : 2.3.1
   
   * Hadoop version : 2.6.5
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua closed issue #6791: [SUPPORT] Optimized the way to get HoodieBaseFile of loadColumnRangesFromFiles of Bloom Index

Posted by GitBox <gi...@apache.org>.
yihua closed issue #6791: [SUPPORT] Optimized the way to get HoodieBaseFile of loadColumnRangesFromFiles of Bloom Index
URL: https://github.com/apache/hudi/issues/6791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6791: [SUPPORT] Optimized the way to get HoodieBaseFile of loadColumnRangesFromFiles of Bloom Index

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6791:
URL: https://github.com/apache/hudi/issues/6791#issuecomment-1258153174

   @CodeCooker17 Thanks for raising this!  I saw that a Jira ticket, HUDI-4917, has been created and you put up a PR.  Let's track it there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org