You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Haejoon Lee (Jira)" <ji...@apache.org> on 2022/03/02 20:53:00 UTC

[jira] [Updated] (SPARK-38353) Instrument __enter__ and __exit__ magic methods for pandas API on Spark

     [ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haejoon Lee updated SPARK-38353:
--------------------------------
    Summary: Instrument __enter__ and __exit__ magic methods for pandas API on Spark  (was: Instrument __enter__ and __exit__ magic methods for Pandas module)

> Instrument __enter__ and __exit__ magic methods for pandas API on Spark
> -----------------------------------------------------------------------
>
>                 Key: SPARK-38353
>                 URL: https://issues.apache.org/jira/browse/SPARK-38353
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Yihong He
>            Priority: Minor
>
> Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
>     [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
>     self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
>     self.assert_eq(
>         repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True))
>     ){code}
>  
> pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org