You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2016/08/10 08:48:39 UTC

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/14579

    [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers

    JIRA: https://issues.apache.org/jira/browse/SPARK-16921
    
    Context managers are a natural way to capture closely related setup and teardown code in Python. It can be useful to apply this pattern to persisting/unpersisting RDDs and DataFrames.
    
    This PR makes RDDs and DataFrames implement the context manager `__enter__` and `__exit__`  functions, allowing code such as:
    
    ```python
    with labeled_data.persist():
        model = pipeline.fit(labeled_data)
    ```
    
    ## How was this patch tested?
    
    New doc tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-16921-rdd-df-ctxmgr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14579
    
----
commit 2b4e56e72bf3cd291349baf6feb197666d368b67
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-08-10T08:39:18Z

    Make RDD and DataFrame a context manager

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    cc @nchammas @holdenk @rxin 
    
    **Note:**
    This is implemented by adding the `__enter__` and `__exit__` methods to RDD/DataFrame directly. This allows some potentially weird stuff, such as any instance of RDD/DF, including any method returning `self`, can be used in a `with` statement, e.g. this works:
    
    ```python
    with rdd.map(lambda x: x) as x:
        ...
    ``` 
    
    Clearly this doesn't make a lot of sense. However, I looked at the 2 options of (a) a separate context manager wrapper class returned by `persist`; and (b) trying to dynamically add the methods (or at least `__enter__`) in `persist`.
    
    The problem with (a) is that `persist` needs to return an RDD/DF instance, so this breaks chaining behavior such as `rdd.cache().count()` etc.
    
    The problem with (b) is that the special method `__enter__` is called in the context of `with` as `type(rdd).__enter__(rdd)` (see [PEP 343](https://www.python.org/dev/peps/pep-0343/)). So it does not help to add a method dynamically to an instance, it must be done to the class. In this case, then after the first `with` statement usage, *all* existing and future instances of RDD/DF have the `__enter__` method, putting us in the same situation as the approach in this PR of having the methods defined on the class (with associated allowed "weirdness").
    
    So, if we want to avoid that, the only option I see is a variant of (a) above - adding a `cached`/`persisted` method that returns a context manager, so it would look like this:
    
    ```python
    with cached(rdd) as x:
        x.count()
    ```
    
    This is less "elegant" but more explicit.
    
    Any other smart ideas for handling option (b) above, please do shout!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    just a gentle ping - would be cool to add this for 2.1 we have the time :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas I looked at the `@contextmanager` decorator. It is an easy way to create a method that returns a context manager, but is is essentially _only_ usable in a `with` statement as it returns a `contextlib.GeneratorContextManager`. For this use case it does not solve the issue that we need to return the RDD/DF instance from `cache`.
    
    As far as I can see for the `closing` helper function, it is just a [context manager itself](https://github.com/python/cpython/blob/master/Lib/contextlib.py#L163), so the same as the final variant for option (a) mentioned above, i.e. `with cached(rdd) as ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    yeah it would break pipelining - I don't think it will necessarily throw an error though.
    
    e.g.
    
    ```
    In [22]: rdd = sc.parallelize(["b", "a", "c"])
    
    In [23]: type(rdd)
    Out[23]: pyspark.rdd.RDD
    
    In [24]: mapped = rdd.map(lambda x: x)
    
    In [25]: type(mapped)
    Out[25]: pyspark.rdd.PipelinedRDD
    
    In [26]: mapped._is_pipelinable()
    Out[26]: True
    
    In [27]: p = mapped.cache()
    
    In [28]: type(p)
    Out[28]: pyspark.rdd.PersistedRDD
    
    In [29]: p._is_pipelinable()
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-29-02496125eccd> in <module>()
    ----> 1 p._is_pipelinable()
    
    AttributeError: 'PersistedRDD' object has no attribute '_is_pipelinable'
    
    In [30]: mapped2 = p.map(lambda x: x)
    
    In [31]: type(mapped2)
    Out[31]: pyspark.rdd.PipelinedRDD
    ```
    
    So I think chaining will work, but the pipelined RDD thinks mapped2 is the 1st transformation, while it is actually the 2nd. I _think_ this will just be an efficiency issue rather than a correctness issue however. 
    
    We could possibly work around it with some type checking etc but it then starts to feel like adding more complexity than the feature is worth...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Sure - so at a bit of a high level and not like exactly on point - copying data out of the Python back to the JVM is kind of slow so if we have multiple python operations we can put together in the same task then we try and do this. Since caching is handled by storing the data inside of the JVM a cached RDD can't be pipelined since we need to copy the result to the JVM. You can see the details in the PipelinedRDD in rdd.py.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    **[Test build #63520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)** for PR 14579 at commit [`2b4e56e`](https://github.com/apache/spark/commit/2b4e56e72bf3cd291349baf6feb197666d368b67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Ah, you're right.
    
    So if we want to avoid needing magic methods in the main RDD/DataFrame classes and avoid needing a separate utility method like `cache()`, I think one option available to us is to have separate `PersistedRDD` and `PersistedDataFrame` classes that simply wrap the base RDD and DataFrames classes and add the appropriate magic methods.
    
    `.persist()` and `.cache()` would then return instances of these classes, which should satisfy the `type(x).__enter__(x)` behavior while still preserving backwards compatibility and method chaining.
    
    What do you think of that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r74813935
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    Is that true? Doesn't it call `__enter__` on the instance of `rdd.cache().map(...)` where `is_cached` is set to False?
    
    Quick verification:
    
    ```python
    def __enter__(self)
        if self.is_cached:
            return self
        else:
            raise ValueError("r")
    
     with rdd.cache().map(lambda x: x) as t:
        pass
    ```
    
    raises a `ValueError`
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63520/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    I like it personally - if no one has a good reason why not it seems like a very reasonable approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas @holdenk @davies @rxin how about the approach of @MechCoder in https://github.com/apache/spark/pull/14579#discussion_r74813935?
    
    I think this will work well, so we could raise an error to prevent (almost all I think) usages outside of the intended pattern of `with some_rdd.cache() as x:` or `with some_rdd_already_cached as x:` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas so if we go with the subclassing approach but keep the current cache/persist interface (e.g. no special utility function) a user could easily write something like:
    `magic = rdd.persist()
    with magic as awesome:
        awesome.count()
    magic.map(lambda x: x + 1)`
    
    I don't _believe_ `__exit__()` could easily return a result that would updated `magic` to be `rdd` (infact `__exit__()` generally doesn't seem to return a result - instead its expected to do the teardown logic internally).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Just pinging to see how its going?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Sorry, you're right, `__exit__()`'s return value is not going to be consumed anywhere. What I meant is that `unpersist()` would return the base RDD or DataFrame object.
    
    But I'm not seeing the issue with the example you posted. Reformatting for clarity:
    
    ```python
    magic = rdd.persist()
    
    with magic as awesome:
        awesome.count()
    
    magic.map(lambda x: x + 1)
    ```
    
    Are you saying `magic.map()` will error? Why would it?
    
    `magic` would be an instance of `PersistedRDD`, which in turn is a subclass of `RDD`, which has `map()` and all of the usual methods defined, plus the magic methods we need for the context manager.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Thanks @MLnick for taking this on and for breaking down what you've found so far.
    
    I took a look through [`contextlib`](https://docs.python.org/3/library/contextlib.html) for inspiration, and I wonder if the source code for [`closing()`](https://docs.python.org/3/library/contextlib.html#contextlib.closing) offers a template we can follow that would let `persist()` return an RDD/DataFrame instance with the correct magic methods, without having to modify the class.
    
    Have you taken a look at that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @MLnick still interested in updating this? (Just looking over the older Python PRs) :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579

Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is used so I'll have to defer to y'all on this. Does the `cached()` utility method have this same problem?

> We could possibly work around it with some type checking etc but it then starts to feel like adding more complexity than the feature is worth...

Agreed.

At this point, actually, I'm beginning to feel this feature is not worth it.

Context managers seem to work best when the objects they're working on have clear open/close-style semantics. File handles, network connections, and the like fit this pattern well.

In fact, the [doc for `with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement) says:

> This allows common `try...except...finally` usage patterns to be encapsulated for convenient reuse.

RDDs and DataFrames, on the other hand, don't have a simple open/close or `try...except...finally` pattern. And when we try to map one onto persist and unpersist, we get the various side-effects we've been discussing here.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r74307747
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -221,6 +227,21 @@ def context(self):
         def cache(self):
             """
             Persist this RDD with the default storage level (C{MEMORY_ONLY}).
    +
    +        :py:meth:`cache` can be used in a 'with' statement. The RDD will be automatically
    +        unpersisted once the 'with' block is exited. Note however that any actions on the RDD
    +        that require the RDD to be cached, should be invoked inside the 'with' block; otherwise,
    +        caching will have no effect.
    --- End diff --
    
    Agreed, especially since this is technically a new Public API that we are potentially committing to for the life of the 2.x line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Looks good to me. \U0001f44d


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r74811837
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    We could do it - but users can still then do `with rdd.cache().map(...) as x:` and it would be valid. So it doesn't fully solve the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, so the examples are going a bit over my head. \U0001f605)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    > the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted
    
    How so? Wouldn't `__exit__()` simply return the parent RDD or DataFrame object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14579


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas to answer your question above (https://github.com/apache/spark/pull/14579#issuecomment-239185438) - in short no.
    
    The semantics of the utility method will be the same as for the `closing` example. `persisted` will return a wrapper class that implements the context manager methods. When entering the `with` statement, the `__enter__` method is called, which returns the underlying `rdd` instance - this is in turn bound to the variable following `as`. So it would look something like this:
    
    ```python
    class persisted():
        def __init__(self, thing):
            self.thing = thing
        def __enter__(self):
            return self.thing
        def __exit__(self, *exc_info):
            self.thing.unpersist()
    ```
    
    If someone tries to do `persisted(rdd).map(...)` it will throw an `AttributeError`.
    
    The "file-like" version is what is currently implemented in this PR, and it works if `__enter__` returns `self` which is what `file` et al do. Of course those classes seem to not tend to have other methods that return `self` so don't suffer the same chaining issue we run into with `RDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas a utility method (e.g. `cached`) - actually it will be a context manager class implemented in the same way as `closing` - will work because it only needs to return the context manager, not the RDD instance. So there is no chaining requirement, and it will only work in a `with` statement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Hey all - sorry I haven't been able to focus on this. It shouldn't be tough to do, but it will need some tests. If we  can find someone who wants to take it over I think it makes a decent starter task :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    One minor thing to keep in mind - the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted - but I think that is a relatively minor issue.
    
    On the whole I think modify base rdd and dataframe classes (option A / option 4) which is the one @MLnick has implemented here is probably one of the more reasonable options - the `with` statement doesn't add anything if the RDD/DataFrame isn't persisted but can do cleanup if it is.
    
    But if there is a better way to do this I'd be excited to find out as well :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    After looking at it and considering all the above, I would say the options are (1) do nothing; or (2) if we want to support this use case, then we implement a single utility method (I would say called `persisted`) that is a context manager.
    
    Even though we could _almost_ achieve things with subclassing, we do sort of break something and add too much risk/complexity vs reward of the feature IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Thanks for the quick overview. That's pretty straightforward, actually! I'll take a look at `PipelinedRDD` for the details. \U0001f44d


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    I hope to get to this soon - It's just the test cases that I need to get to also!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r74811199
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    Is it reasonable just to raise an error saying that the context manager is meant to work only with cached RDD's (Dataframes) if `self.is_cached` is not set to True?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Yup! just been snowed under :( but will update with the approach above asap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    gentle ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by MechCoder <gi...@git.apache.org>.

Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r75567101
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    yeas, also known as the "If you don't know what to do; raise an Error" approach :p 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by ueshin <gi...@git.apache.org>.

Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @MLnick Hi, are you still working on this? If so, could you fix conflicts and update please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    > So there is no chaining requirement, and it will only work in a with statement.
    
    @MLnick - Couldn't we also create a scenario (like @holdenk did earlier) where a user does something like this?
    
    ```python
    persisted_rdd = persisted(rdd)
    persisted_rdd.map(...).filter(...).count()
    ```
    
    This would break pipelining too, no?
    
    And I think the expectation would be for it not to break pipelining, because existing common context managers in Python don't have a requirement that they _must_ be used in a `with` block.
    
    For example, `f = open(file)` works fine, as does `s = requests.Session()`, and the resulting objects have the same behavior as they would inside a `with` block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Do you have time to update this @MLnick or maybe would it be OK if someone else made an updated PR based on this? It would be a nice feature to have for 2.2 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    **[Test build #63520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)** for PR 14579 at commit [`2b4e56e`](https://github.com/apache/spark/commit/2b4e56e72bf3cd291349baf6feb197666d368b67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @MLnick - or if you don't have a chance would it be ok for us to find someone (perhaps someone new to the project) to take this over and bring it to the finish line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r74305543
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -221,6 +227,21 @@ def context(self):
         def cache(self):
             """
             Persist this RDD with the default storage level (C{MEMORY_ONLY}).
    +
    +        :py:meth:`cache` can be used in a 'with' statement. The RDD will be automatically
    +        unpersisted once the 'with' block is exited. Note however that any actions on the RDD
    +        that require the RDD to be cached, should be invoked inside the 'with' block; otherwise,
    +        caching will have no effect.
    --- End diff --
    
    Super minor documentation suggestion - but I was thinking maybe a version changed directive could be helpful to call out that its new functionality (both in RDD and DF)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    None of our options seems great, but if I had to rank them I would say:
    
    1. Add new `Persisted...` classes.
    2. Make no changes.
    3. Add separate `persisted()` or `cached()` utility method.
    4. Modify base RDD and DataFrame classes.
    
    Adding new internal classes for this use-case honestly seems a bit heavy-handed to me, so if we are against that then I would lean towards not doing anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r75566681
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    Thats an interesting approach @MechCoder I think that could be a way to clarify how to expect to use the context manager to users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Is there any reason why it is not merged yet? I personally like this too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14579#discussion_r76189167
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri
             self._id = jrdd.id()
             self.partitioner = None
     
    +    def __enter__(self):
    --- End diff --
    
    hmmm, yes this does happen to work, because most operations boil down to something like `mapPartitions` which creates a new `PipelineRDD` which is not cached, or a new `RDD` which is again not cached.
    
    I think it will work for `DataFrame` too for similar reason - most operations return a new `DataFrame` instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Right I wouldn't expect it to error with subclassing - just not pipeline successfully - but only in a very long shot corner case.
    
    I think the try/finally with persistance is not an uncommon pattern (we have something similar happen frequently inside of Spark ML/mllib but its in Scala code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    @nchammas to be clear - subclassing only breaks pipelining if the persisted_rdd is later unpersisted (e.g. used with a `with` statement or otherwise) - otherwise you can't pipeline on top of a persisted rdd anyways (which is why I say its a corner case).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org