You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Thirumalai Raj R (Jira)" <ji...@apache.org> on 2021/06/10 04:51:00 UTC

[jira] [Commented] (HUDI-1628) Improve data locality during ingestion

    [ https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360549#comment-17360549 ] 

Thirumalai Raj R commented on HUDI-1628:
----------------------------------------

Hi [~vinoth] / [~satishkotha] , is anyone working on this feature ? When we tried to insert data into Hudi COW table with drop duplicates enabled using Spark Streaming (DStreams) the pipeline wasn't scaling because Min Max pruning in HoodieBloomIndex wasn't efficient and the exploded RDD size was >5X which caused bottleneck in the shuffle stage. 

If no one has started working on this, I would like to understand the requirements better and contribute to it 

> Improve data locality during ingestion
> --------------------------------------
>
>                 Key: HUDI-1628
>                 URL: https://issues.apache.org/jira/browse/HUDI-1628
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Writer Core
>            Reporter: satish
>            Priority: Major
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
>  for more details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)