You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2021/04/28 17:57:00 UTC

[jira] [Created] (BEAM-12245) Memoize DataFrame operations

Brian Hulette created BEAM-12245:
------------------------------------

             Summary: Memoize DataFrame operations
                 Key: BEAM-12245
                 URL: https://issues.apache.org/jira/browse/BEAM-12245
             Project: Beam
          Issue Type: Improvement
          Components: sdk-py-core
            Reporter: Brian Hulette


Currently performing an operation on a deferred dataframe always produces a _new_ deferred dataframe. This means a call like to_pcollection(df.mean(), df.mean()), will produce two distinct PCollections duplicating the same computation.

This is particularly problematic for the interactive use-case where, to_pcollection is used inside of ib.collect() in combination with PCollection caching. Collecting df.mean() two different times will duplicate the computation unnecessarily.

We should cache the output expressions produced by operations to prevent this.

We need to be mindful of inplace operations when implementing this:
- Two calls to df.mean() should produce the same result iff df has not been mutated in between.
- If the output of one call to df.mean() is mutated, it must not mutate the output of another call to df.mean().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)