You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/10/27 06:48:00 UTC
[jira] [Created] (HUDI-315) Reimplement statistics/workload profile
collected during writes using Spark 2.x custom accumulators
Vinoth Chandar created HUDI-315:
-----------------------------------
Summary: Reimplement statistics/workload profile collected during writes using Spark 2.x custom accumulators
Key: HUDI-315
URL: https://issues.apache.org/jira/browse/HUDI-315
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Performance, Write Client
Reporter: Vinoth Chandar
https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1
In Hudi, there are two places where we need to obtain statistics on the input data
- HoodieBloomIndex : for knowing what partitions need to be loaded and checked against (is this still needed with the timeline server enabled is a separate question)
- Workload profile to get a sense of number of updates, inserts to each partition/file group
Both of them issue their own groupBy or shuffle computation today. This can be avoided using an accumulator
--
This message was sent by Atlassian Jira
(v8.3.4#803005)