You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by shahidki31 <gi...@git.apache.org> on 2018/11/12 14:12:06 UTC

[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/23016

    [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the PrefixSpan

    ## What changes were proposed in this pull request?
    Mllib's Prefixspan - run method - cached RDD stays in cache. After run is comlpeted , rdd remain in cache.
    We need to unpersist the cached RDD after run method.
    
    
    ## How was this patch tested?
    Existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark SPARK-26006

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23016
    
----
commit 2e3b0891b54dd12441d8c55230837bc182e11608
Author: Shahid <sh...@...>
Date:   2018-11-12T14:08:33Z

    unpersist 'dataInternalRepr' after run method

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98951/testReport)** for PR 23016 at commit [`a4c5597`](https://github.com/apache/spark/commit/a4c55974a04878dfa92cdbb636823bbac10c8f64).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98728/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98853/testReport)** for PR 23016 at commit [`5c4949d`](https://github.com/apache/spark/commit/5c4949d675bb06689dae8a4748b83f271f176745).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by idlevi <gi...@git.apache.org>.
Github user idlevi commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Wouldn't the added count increase the execution time? did you do a benchmark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    > I am currently using it with spark 2.3 as
    > 
    > org.apache.spark
    > spark-mllib_2.11
    > 
    > How can i get this fix?
    
    You can cherry pick the commit from the master branch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98728/testReport)** for PR 23016 at commit [`2e3b089`](https://github.com/apache/spark/commit/2e3b0891b54dd12441d8c55230837bc182e11608).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98950/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by idlevi <gi...@git.apache.org>.
Github user idlevi commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    I am currently using it with spark 2.3 as 
     <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-mllib_2.11</artifactId>
            </dependency>
    How can i get this fix?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98951/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    @idlevi Actually, input and output of the prefix span are RDD. Earlier intermediate rdd was cached, now final rdd is cached, and materialized it. So, if you materialize the model, earlier it will compute from the intermediate level, now it directly get from the finalRdd.
     I ran all the UTs in the prefixSpanSuite, and there is hardly any time difference with/without the patch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98728/testReport)** for PR 23016 at commit [`2e3b089`](https://github.com/apache/spark/commit/2e3b0891b54dd12441d8c55230837bc182e11608).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Thank you @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98951/testReport)** for PR 23016 at commit [`a4c5597`](https://github.com/apache/spark/commit/a4c55974a04878dfa92cdbb636823bbac10c8f64).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    cc @srowen Kindly review 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98950/testReport)** for PR 23016 at commit [`7a4ede8`](https://github.com/apache/spark/commit/7a4ede8da3076892a9302d5d1e11bf6dcc77ff05).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Thanks @srowen , I will update the PR


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23016#discussion_r234395721
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala ---
    @@ -174,6 +174,10 @@ class PrefixSpan private (
         val freqSequences = results.map { case (seq: Array[Int], count: Long) =>
           new FreqSequence(toPublicRepr(seq), count)
         }
    +    // Cache the final RDD to the same storage level as input
    +    freqSequences.persist(data.getStorageLevel)
    --- End diff --
    
    The problem here is that it won't get persisted until something materializes it, and at that point its dependent RDD dataInternalRepr is already unpersisted.
    
    I'd say that _if_ the input's storage level isn't NONE, then persist freqSequences at the same level and .count() it to materialize it. Then unpersist dataInternalRepr in all events.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98950/testReport)** for PR 23016 at commit [`7a4ede8`](https://github.com/apache/spark/commit/7a4ede8da3076892a9302d5d1e11bf6dcc77ff05).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98853/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    **[Test build #98853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98853/testReport)** for PR 23016 at commit [`5c4949d`](https://github.com/apache/spark/commit/5c4949d675bb06689dae8a4748b83f271f176745).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

Posted by shahidki31 <gi...@git.apache.org>.
Github user shahidki31 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23016#discussion_r234396276
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala ---
    @@ -174,6 +174,10 @@ class PrefixSpan private (
         val freqSequences = results.map { case (seq: Array[Int], count: Long) =>
           new FreqSequence(toPublicRepr(seq), count)
         }
    +    // Cache the final RDD to the same storage level as input
    +    freqSequences.persist(data.getStorageLevel)
    --- End diff --
    
    @srowen  Yes. That is the correct approach. I updated the code. Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/23016


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/23016
  
    I'm not sure about that, because the returned PrefixSpanModel has an RDD that depends on that RDD. We could cache the final RDD instead and materialize it; that could make more sense. 
    
    In other places we have done such a thing only when the input is cached, in order to kind of follow the caller's lead, but there isn't a consistent standard for this.
    
    I'd be OK improving this to persist the final RDD only, and then unpersist the intermediate one. That makes at least more sense. You can cache at the same storage level as the input (which might be NONE)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org