You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2022/02/09 01:44:00 UTC

[jira] [Created] (HUDI-3397) Make sure Spark RDDs triggering actual FS activity are only dereferenced once

Alexey Kudinkin created HUDI-3397:
-------------------------------------

             Summary: Make sure Spark RDDs triggering actual FS activity are only dereferenced once
                 Key: HUDI-3397
                 URL: https://issues.apache.org/jira/browse/HUDI-3397
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin


Currently, RDD `collect()` operation is treated quite loosely and there are multiple flows which used to dereference RDDs (for ex, through `collect`, `count`, etc) that way triggering the same operations being carried out multiple times, occasionally duplicating the output already persisted on FS.

Check out HUDI-3370 for recent example.

NOTE: Even though Spark caching is supposed to make sure that we aren't writing to FS multiple times, we can't solely rely on caching to guarantee exactly once execution.

Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in "commit" operation and all the other operations are only relying on _derivative_ data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)