You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:25:33 UTC

[jira] [Updated] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

     [ https://issues.apache.org/jira/browse/SPARK-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-16921:
---------------------------------
    Labels: bulk-closed  (was: )

> RDD/DataFrame persist() and cache() should return Python context managers
> -------------------------------------------------------------------------
>
>                 Key: SPARK-16921
>                 URL: https://issues.apache.org/jira/browse/SPARK-16921
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core, SQL
>            Reporter: Nicholas Chammas
>            Priority: Minor
>              Labels: bulk-closed
>
> [Context managers|https://docs.python.org/3/reference/datamodel.html#context-managers] are a natural way to capture closely related setup and teardown code in Python.
> For example, they are commonly used when doing file I/O:
> {code}
> with open('/path/to/file') as f:
>     contents = f.read()
>     ...
> {code}
> Once the program exits the with block, {{f}} is automatically closed.
> I think it makes sense to apply this pattern to persisting and unpersisting DataFrames and RDDs. There are many cases when you want to persist a DataFrame for a specific set of operations and then unpersist it immediately afterwards.
> For example, take model training. Today, you might do something like this:
> {code}
> labeled_data.persist()
> model = pipeline.fit(labeled_data)
> labeled_data.unpersist()
> {code}
> If {{persist()}} returned a context manager, you could rewrite this as follows:
> {code}
> with labeled_data.persist():
>     model = pipeline.fit(labeled_data)
> {code}
> Upon exiting the {{with}} block, {{labeled_data}} would automatically be unpersisted.
> This can be done in a backwards-compatible way since {{persist()}} would still return the parent DataFrame or RDD as it does today, but add two methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org