You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2022/06/14 00:31:00 UTC

[jira] [Created] (HUDI-4249) Fix in-memory HoodieData implementations to operate lazily

Alexey Kudinkin created HUDI-4249:
-------------------------------------

             Summary: Fix in-memory HoodieData implementations to operate lazily
                 Key: HUDI-4249
                 URL: https://issues.apache.org/jira/browse/HUDI-4249
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.12.0


Currently both `HoodieListData` and `HoodieMapPairData` operate eagerly on their payloads meaning that each transformation is immediately applied. 

This has following performance drawbacks:
 # It always executes full transformation regardless of whether the whole sequence will be required, potentially wasting quite a bit of compute.
 # It also might be the cause of OOMs if the sequence potentially could be larger than available memory (where caller might be relying on assumption that it would be performing stream processing)

 

Instead it should be rebased to hold `Stream`s internally and provide semantic close to Spark's RDD container.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)