You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2016/07/24 12:48:13 UTC

[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] unused broadcast variabl...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/14333

    [SPARK-16696][ML][MLLib] unused broadcast variables do destroy call to release memory in time

    ## What changes were proposed in this pull request?
    
    update unused broadcast in KMeans/LDAOptimizer/Word2Vec,
    use destroy(false) to release memory in time.
    
    and several place destroy() update to destroy(false) so that it will be async-called,
    it will better than blocking called. 
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark broadvar_unpersist_to_destroy

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14333
    
----
commit 52afc038c79ab8176bf760d65793e8d5f94d4d4a
Author: WeichenXu <we...@outlook.com>
Date:   2016-07-20T14:12:41Z

    broadvar_unpersist_to_destroy

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62779/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62768/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Yes, I suppose the issue is consistency ... there are loads of places where RDDs and broadcasts aren't really cleaned up properly in the code. Maybe it's fine to take extra steps here to at least ensure that all broadcasts we know of are handled correctly. OK I think it's pretty good if you're OK with the change as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62778/consoleFull)** for PR 14333 at commit [`f129a2b`](https://github.com/apache/spark/commit/f129a2b575ac9523672e321e64151bbff64e71c5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen 
    The `bcNewCenters` in `KMeans` has some problem.
    Check the code logic in detail, we can find that in each loop, it should destroy the broadcast var `bcNewCenters` generated in the previous loop, not the one generated in current loop. Like what
    is done to the `costs: RDD`, which use a `preCosts` var to save that generated in previous loop.
    I update the code. 
    
    The second problem, what's the meaning of `broadcast.unpersist`, eh, I think, there is another senario, suppose there is a RDD lineage, when executing in normal case, it executed successfully, and in code we can unpersist useless broadcast var in time, but, if some exception happened, the spark can recovery  from it, it need to recovery the broken RDD from the RDD lineage and in such case may re-use the broadcast var we had unpersisted. If we simply destroy it, the broadcast var cannot be recover
    so that the recovery will fail.
    
    So that I think the safe place to use `broadcast.destroy` is the place where some action to RDD has successfully executed, and the whole RDD lineage is no longer needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen 
    I check code again, the problem I mentioned above
    `But now I found another problem in BisectKMeans:
    in line 191 there is a iteration it also need this pattern \u201cpersist current step RDD, unpersist previous\u201d`
    is NOT A PROBLEM, I misread code.
    
    And I check other modifications in this PR again and it seems no problems. 
    
    So I think the PR is OK now. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62793/consoleFull)** for PR 14333 at commit [`01f4d3a`](https://github.com/apache/spark/commit/01f4d3ab2e4c591b7b27b1ed92fbe07e1fc876d1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    I don't understand the last change. As far as I can see it can be destroyed inside the loop iteration. It's also possible to reuse the broadcast (declare outside the loop), and unpersist each iteration, and destroy afterwards. But I don't see the need to hold on to all these broadcasts? 
    
    Yes you make a fair point about recovery. I think your changes are safe in this respect, good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen yeah, the code logic here seems confusing, but I think it is right.
    Now I can explain it in a clear way:
    in essence, the logic can be expressed as following:
    A0->I1->A1->I2->A2->...
    A0 is the initial `assignments`, I1 is step-1 `indices`, A1 is step-1 `assignment`, I2 is step-2 `indices`, and so on.
    There is dependency between them as the arrows show.
    Now the key point is that when we compute I(K), we must make sure I(K-1) is persisted, and I(K-2) and older ones can be unpersisted.
    NOW, check my code logic, in fact, in each iteration, I do the following thing:
    1. unpersist I(K-1)
    2. compute I(K+1) using A(K), and because of dependency, A(K) must use I(K), And I(K) is STILL PERSISTED.
    3. compute A(K+1) using I(K+1)
    
    But now I found another problem in BisectKMeans:
    in line 191 there is a iteration it also need this pattern \u201cpersist current step RDD, unpersist previous\u201d
    and the iteration's first RDD relates to code here.
    The problem seems a little troblesome so we'd better create another PR to handle it ?
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62907/consoleFull)** for PR 14333 at commit [`dc17da8`](https://github.com/apache/spark/commit/dc17da8eec232fcf2296deefb64222a6d07a0983).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62769/consoleFull)** for PR 14333 at commit [`c40f7f8`](https://github.com/apache/spark/commit/c40f7f829ca03d6c0ad55cc75964e7be5c59d748).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCent...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14333


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62779 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62779/consoleFull)** for PR 14333 at commit [`ecd15b2`](https://github.com/apache/spark/commit/ecd15b2b7b7777ac21a9e37f998747a752903530).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen The KMeans.initKMeansParallel already implements the pattern "persist current step RDD, and unpersist previous one", but I think an RDD persisted can also break down because of disk error or something? If we want to reach the goal "the broadcast can safely be disposed(`broadcast.destroy`) at each iteration too, rather than only at the end", I think it need to use `RDD.checkpoint` instead of `RDD.persist` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62908/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    yeah, but the `bcSyn0Global` in Word2Vec is a difference case, it looks safe there to destroy, 
    because in each loop iteration, the RDD transform which use `bcSyn0Global` ends with a `collect`,
    after the `collect` action, we no longer need the RDD(the RDD's all computation has done, no more possible recovery) so we also can destroy the `bcSyn0Global` directly in each loop iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Yeah, I think you're right, because the unpersisted RDD can still be recomputed but not a destroyed Broadcast. Hm, then isn't this also true of `bcSyn0Global`?
    
    I suppose I think we should prefer to keep it simple and correct first, and only introduce complexity to optimize while preserving correctness. If some of the current unpersist calls can't be safely changed to destroy, maybe best to leave them rather than find a way to destroy them, if we don't know that it's a problem.
    
    I think we still have an RDD-related problem here in that the intermediate RDDs aren't unpersisted, and all of them remain persisted after the loop. Kind of a separate issue, I suppose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen
    I check the code about KMean `bcNewCenters` again, if we want to make sure the recovery of RDD will successful in any unexcepted case,  we have to keep all the `bcNewCenters` generated in each loop, until the loop is done.
    because each loop the `costs:RDD` is build using preCosts:RDD, so that it became a RDD link. and the loop will and only will keep latest two RDDs being persisted. if the last two RDDs is broken, spark will need to rebuild them from the first RDD, in such case, each historical `bcNewCenters` generated will be used.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62793/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen 
    I check `RDD.persist` referenced place:
    AFTSuvivalRegression, LinearRegression, LogisticRegression, will persist input training RDD and unpersist them when `train` return, seems OK.
    `recommend.ALS` persist many RDDs and seems unpersist them all OK.
    mllib `BisectingKMeans.run` contains a TODO "unpersist old indices", I'll check it now.
    Others seems OK.
    
    `Broadcast.persist` referenced place already checked in this PR I think they are all properly handled here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62778/consoleFull)** for PR 14333 at commit [`f129a2b`](https://github.com/apache/spark/commit/f129a2b575ac9523672e321e64151bbff64e71c5).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62793/consoleFull)** for PR 14333 at commit [`01f4d3a`](https://github.com/apache/spark/commit/01f4d3ab2e4c591b7b27b1ed92fbe07e1fc876d1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62769/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62907/consoleFull)** for PR 14333 at commit [`dc17da8`](https://github.com/apache/spark/commit/dc17da8eec232fcf2296deefb64222a6d07a0983).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    @srowen 
    The sparkContext, by default, will running a cleaner to release not referenced RDD/broadcasts on background. But, I think, we'd better to release them by ourselves because the SparkContext auto-cleaner  depends on java-gc. If gc not triggered the cleaner won't release the unused broadcasts.
    As we can see, if a driver-side broadcast has been unpersisted but not destroyed, its metadata will be keep in the context and the content of the var broadcasted is kept in driver-side, here in KMeans `bcNewCenters` is a two-dimension array so I think it is better to release them as soon as possible.
    About the overhead, I think to track these history `bcNewCenters` broadcasts only need a reference list and it can destroyed in async way so the overhead is acceptable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62908/consoleFull)** for PR 14333 at commit [`7f042a2`](https://github.com/apache/spark/commit/7f042a2172166d0de413297351b4fe9b04168071).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    How about the same for `bcNewCenters` in `KMeans`?
    
    Yeah, it seems like it's pretty rare to want to call `unpersist` here. The only context where it seems valid are the two left in `Word2Vec` like `bcSyn0Global` where the driver's state needs to be rebroadcast on each loop. But even then you could make a new broadcast for the same variable and destroy it in the loop.
    
    I suppose it saves the overhead of new bookkeeping for a `Broadcast`. But unless I miss something it's not really worth the separate API. I wouldn't go so far as deprecating `unpersist` but doesn't look like something that would be added today if it weren't there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62768/consoleFull)** for PR 14333 at commit [`52afc03`](https://github.com/apache/spark/commit/52afc038c79ab8176bf760d65793e8d5f94d4d4a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCent...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14333#discussion_r72416823
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
    @@ -194,8 +196,9 @@ class BisectingKMeans private (
               newClusters = summarize(d, newAssignments)
               newClusterCenters = newClusters.mapValues(_.center).map(identity)
             }
    -        // TODO: Unpersist old indices.
    -        val indices = updateAssignments(assignments, divisibleIndices, newClusterCenters).keys
    +        if (preIndices != null) preIndices.unpersist()
    +        preIndices = indices
    +        indices = updateAssignments(assignments, divisibleIndices, newClusterCenters).keys
    --- End diff --
    
    You are probably ahead of me on this, but let me check something. To compute `assignments` in the next line, the current `indices` is needed, and that in turn needs the current value of `assignments`, which needs the previous copy of `indices` (`preIndices`). But that was unpersisted just above. Should `preIndices` be unpersisted later, after `indices` is materialized?
    
    But ... is there even an action here? if this just creating a large lineage then the persisting isn't helping much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62779/consoleFull)** for PR 14333 at commit [`ecd15b2`](https://github.com/apache/spark/commit/ecd15b2b7b7777ac21a9e37f998747a752903530).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62778/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] fix KMeans bcNewCenters unpersi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62908/consoleFull)** for PR 14333 at commit [`7f042a2`](https://github.com/apache/spark/commit/7f042a2172166d0de413297351b4fe9b04168071).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62769/consoleFull)** for PR 14333 at commit [`c40f7f8`](https://github.com/apache/spark/commit/c40f7f829ca03d6c0ad55cc75964e7be5c59d748).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Oh, it is indeed building up a lineage. I think it's easier to leave this broadcast as-is then unless we know that destroying them is essential for reclaiming driver resources.
    
    Here's another issue though: we now have a lineage where all of the RDDs are persisted (the cost RDDs), but I think only the last one is unpersisted. Ideally each would be persisted, materialized, and then unpersist the previous one. We had this problem and solution pattern in Word2Vec. Does that make sense?
    
    It might be worth tackling here, because, if that's implemented, then I think the broadcast can safely be disposed at each iteration too, rather than only at the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    OK sounds good. So maybe we're back to this: for `bcNewCenters`, is it really worth the overhead to track and destroy them? or just settle for unpersisting within the loop? I could go either way, your call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62907/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    **[Test build #62768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62768/consoleFull)** for PR 14333 at commit [`52afc03`](https://github.com/apache/spark/commit/52afc038c79ab8176bf760d65793e8d5f94d4d4a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14333
  
    If the last problem is really pretty related to this code, then it should change here as well. However if you're not sure there's an easy fix, we can leave it for later. Are you comfortable that the current change is ready?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org