You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2016/06/20 07:29:36 UTC

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/13780

    [SPARK-16063][SQL] Add getStoragelevel to Dataset

    [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.
    
    ## How was this patch tested?
    
    Updated `DatasetCacheSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark ds-storagelevel

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13780
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Sorry for the delay.  I'm going to merge this to master.  I'll update the since versions while merging.  Thanks for working on this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67724699
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -21,11 +21,32 @@ import scala.language.postfixOps
     
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.test.SharedSQLContext
    +import org.apache.spark.storage.StorageLevel
     
     
     class DatasetCacheSuite extends QueryTest with SharedSQLContext {
       import testImplicits._
     
    +  test("get storage level") {
    +    val ds1 = Seq("1", "2").toDS().as("a")
    +    val ds2 = Seq(2, 3).toDS().as("b")
    +
    +    // default storage level
    +    ds1.persist()
    +    ds2.cache()
    +    assert(ds1.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    assert(ds2.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    // unpersist
    +    ds1.unpersist()
    +    assert(ds1.storageLevel() == StorageLevel.NONE)
    +    // non-default storage level
    +    ds1.persist(StorageLevel.MEMORY_ONLY_2)
    --- End diff --
    
    When writing black-box testing, I might just try all the levels in the test case, even we can include some customized StorageLevel, which is diffrent from the defined one.
    ```scala
        Seq(StorageLevel.OFF_HEAP, MEMORY_AND_DISK_SER_2, ..., StorageLevel.NONE).foreach { level =>
          ds1.persist(level)
          assert(ds1.storageLevel() == level)
          ds1.unpersist()
          assert(ds1.storageLevel() == StorageLevel.NONE)
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67646564
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2308,6 +2309,18 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
    +   *
    +   * @group basic
    +   * @since 2.0.0
    +   */
    +  def getStorageLevel(): StorageLevel = {
    --- End diff --
    
    just call it storageLevel?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67831555
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    --- End diff --
    
    @rxin I updated the default in the doc, as it was actually incorrect previously.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r72437362
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    ping @rxin on this comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    so just following up because I know there is some other loosely blocked on this. Do @rxin @marmbrus @davies @gatorsmile have any comments? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67846201
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -21,11 +21,32 @@ import scala.language.postfixOps
     
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.test.SharedSQLContext
    +import org.apache.spark.storage.StorageLevel
     
     
     class DatasetCacheSuite extends QueryTest with SharedSQLContext {
       import testImplicits._
     
    +  test("get storage level") {
    +    val ds1 = Seq("1", "2").toDS().as("a")
    +    val ds2 = Seq(2, 3).toDS().as("b")
    +
    +    // default storage level
    +    ds1.persist()
    +    ds2.cache()
    +    assert(ds1.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    assert(ds2.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    // unpersist
    +    ds1.unpersist()
    +    assert(ds1.storageLevel() == StorageLevel.NONE)
    +    // non-default storage level
    +    ds1.persist(StorageLevel.MEMORY_ONLY_2)
    --- End diff --
    
    I'm kinda neutral on this - it doesn't really seem necessary to me, since pretty much by definition if one storage level works then they all do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60925/consoleFull)** for PR 13780 at commit [`cfe9ab7`](https://github.com/apache/spark/commit/cfe9ab7ffcb5717d60608fb366f71432c52491a0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    just some minor comments. this looks pretty good 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67652627
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2308,6 +2309,18 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
    +   *
    +   * @group basic
    +   * @since 2.0.0
    +   */
    +  def getStorageLevel(): StorageLevel = {
    --- End diff --
    
    I thought it would be more familiar for RDD users, but no strong feeling about it. Will update 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    ping @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60843/consoleFull)** for PR 13780 at commit [`90684bd`](https://github.com/apache/spark/commit/90684bdc9da819354f886b3ec851c0962a544858).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r73519433
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    Yes, this is true. The issue is that the deser versions were deprecated in #10092 and made to equal the ser versions, and so can't actually be specified in Python any more. Hence we have a discrepancy now between Python RDDs (always stored ser) and DataFrames (stored deser in tungsten/spark sql binary format by default, but it is possible to store ser though AFAIK that will always be non-optimal so should certainly be discouraged).
    
    cc @gatorsmile @davies @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60843/consoleFull)** for PR 13780 at commit [`90684bd`](https://github.com/apache/spark/commit/90684bdc9da819354f886b3ec851c0962a544858).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    ping @rxin @marmbrus @davies @gatorsmile for comment on the Python storage level issue I mention at https://github.com/apache/spark/pull/13780#discussion_r67833027


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    @marmbrus thanks for merging this. For me there is still an open question around handling of deser storage levels on the PySpark side (see my comments https://github.com/apache/spark/pull/13780/files#r67833027). Would like to get your thoughts on that.
    
    What is blocked on this by the way? (Just to understand).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60922/consoleFull)** for PR 13780 at commit [`93af5ec`](https://github.com/apache/spark/commit/93af5ec1c142e81d9b2df88d2c067151432426c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60836/consoleFull)** for PR 13780 at commit [`8b4a490`](https://github.com/apache/spark/commit/8b4a4908f6fd5f8d5c019254390821250f9e4f43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67832026
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2308,6 +2308,18 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
    +   *
    +   * @group basic
    +   * @since 2.0.0
    +   */
    +  def storageLevel(): StorageLevel = {
    --- End diff --
    
    remove parenthesis?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60836/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60922/consoleFull)** for PR 13780 at commit [`93af5ec`](https://github.com/apache/spark/commit/93af5ec1c142e81d9b2df88d2c067151432426c1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r68719350
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    ping @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60925/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    cc @gatorsmile @marmbrus @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60925 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60925/consoleFull)** for PR 13780 at commit [`cfe9ab7`](https://github.com/apache/spark/commit/cfe9ab7ffcb5717d60608fb366f71432c52491a0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67833616
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    One option is to set the default storage level here to `None` instead, and if it's not set call `_jfd.persist()` to ensure behaviour is the same as `cache`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Can you make sure Python Dataframe has this method too?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60922/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67865315
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala ---
    @@ -21,11 +21,32 @@ import scala.language.postfixOps
     
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.test.SharedSQLContext
    +import org.apache.spark.storage.StorageLevel
     
     
     class DatasetCacheSuite extends QueryTest with SharedSQLContext {
       import testImplicits._
     
    +  test("get storage level") {
    +    val ds1 = Seq("1", "2").toDS().as("a")
    +    val ds2 = Seq(2, 3).toDS().as("b")
    +
    +    // default storage level
    +    ds1.persist()
    +    ds2.cache()
    +    assert(ds1.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    assert(ds2.storageLevel() == StorageLevel.MEMORY_AND_DISK)
    +    // unpersist
    +    ds1.unpersist()
    +    assert(ds1.storageLevel() == StorageLevel.NONE)
    +    // non-default storage level
    +    ds1.persist(StorageLevel.MEMORY_ONLY_2)
    --- End diff --
    
    I knew. : ) That is white box testing. Normally, writing test cases should not be done by the same person who wrote the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r73432445
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    One downside of that approach is the user can't easily explicitly cache in-memory only deserialized or even cache on two machines deserialized easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67832007
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    +        """Sets the storage level to persist the contents of the :class:`DataFrame` across
    +        operations after the first time it is computed. This can only be used to assign
    +        a new storage level if the :class:`DataFrame` does not have a storage level set yet.
    +        If no storage level is specified defaults to (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             javaStorageLevel = self._sc._getJavaStorageLevel(storageLevel)
             self._jdf.persist(javaStorageLevel)
             return self
     
    +    @since(2.0)
    +    def storageLevel(self):
    --- End diff --
    
    maybe mark this as a property so we don't need to use parenthesis?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    **[Test build #60836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60836/consoleFull)** for PR 13780 at commit [`8b4a490`](https://github.com/apache/spark/commit/8b4a4908f6fd5f8d5c019254390821250f9e4f43).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60843/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add storageLevel to Dataset

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13780


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Overall LGTM, just one minor comment about the test case. BTW, maybe you can change the PR title after you updating the function name? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Can you reset the non relevant changes?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13780#discussion_r67833027
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -376,24 +376,47 @@ def foreachPartition(self, f):
     
         @since(1.3)
         def cache(self):
    -        """ Persists with the default storage level (C{MEMORY_ONLY}).
    +        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
    +
    +        .. note:: the default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.
             """
             self.is_cached = True
             self._jdf.cache()
             return self
     
         @since(1.3)
    -    def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
    -        """Sets the storage level to persist its values across operations
    -        after the first time it is computed. This can only be used to assign
    -        a new storage level if the RDD does not have a storage level set yet.
    -        If no storage level is specified defaults to (C{MEMORY_ONLY}).
    +    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
    --- End diff --
    
    @rxin I updated the default here in `persist`, to match `cache`.
    
    But actually, it's still not quite correct - the default storage levels for Python are all serialized. But the MEMORY-based ones don't match the Scala side (which are deserialized). This was done for RDDs but doesn't quite work for DataFrames (since DF on the Scala side is cached deserialized by default).
    
    So here `df.cache()` results in `MEMORY_AND_DISK (deser)` while `df.persist()` results in `MEMORY_AND_DISK (ser)`. Ideally I'd say we don't want to encourage users to accidentally use the serialized forms for DF caching (since it is less efficient, as I understand it?). Let me know what you think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13780: [SPARK-16063][SQL] Add getStoragelevel to Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13780
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org