You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/05/19 18:15:59 UTC

[GitHub] spark pull request: [WIP]Improve ALS resource usage

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/828

    [WIP]Improve ALS resource usage 

    Now,In ALS algorithm, RDD can not be cleaned

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark checkpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/828.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #828
    
----
commit 6d7f2408a40bf4bb2889bf66fa61bced782cdefc
Author: witgo <wi...@qq.com>
Date:   2014-05-19T15:35:17Z

    ALS can't clean shuffle

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43569842
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43589674
  
    [The code](https://github.com/witgo/spark/commit/6d7f2408a40bf4bb2889bf66fa61bced782cdefc#diff-2b593e0b4bd6eddab37f04968baa826c) will make the checkpoint directory larger and is not clear .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43564707
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43608181
  
    @mateiz @mengxr  
    I added a new operation `cachePoint` of  RDD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43790944
  
    @mateiz, @mengxr 
    I am using [the code](https://github.com/witgo/spark/compare/cachePoint) to test ALS.
    A brief description of the test:
    
    | Item | Description |
    | ------------- | ----------- |
    |cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`|
    |data| `700 million`|
    |code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`|
    |time|`12.5 h`|
    |shuffle write| `4.72T`|
    |largest local dir|`200G`|


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo closed the pull request at:

    https://github.com/apache/spark/pull/828


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43583620
  
    @mateiz  It is not necessary to write it in the file system.After all, there is no other RDD in reading it.I think it should be put checkpoint data into blockManager, so performance will be much higher.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43524803
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43840745
  
    @tdas 
    You're right. the code breaks the fault-tolerance properties of RDDs.
    The perfect solution is the automatic cleanup and rebuilding shuffle data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43569848
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15084/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43564725
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43656940
  
    Another [solution](https://github.com/witgo/spark/compare/cachePoint).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43825145
  
    I dont think this cachePoint is a good idea at all. While it *can* give better performance, it fundamentally breaks the fault-tolerance properties of RDDs. If a cachePoint() an RDD with MEMORY_ONLY, and then the executor dies, you have no way to recover the lost partitions as there is not lineage information to how that RDD was created. All of Spark operations maintain this guarantee of fault-tolerance despite failed workers and breaking that is a bad idea. So this is a fundamentally unsafe operation to expose to the end-user.
    
    In fact this is the same reason why checkpoint() has been implemented using HDFS, so that fault-tolerance property is maintained (data save to fault-tolerant storage) even if executors die. 
    
    That said, there is a good middle ground out here. We can do what cachePoint() does while ensuring that the data is replicated within the executors (so better fault-tolerance guarantee) but not expose it to the users (so that it does break public API semantics). This would be a ALS-only solution.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-44742037
  
    This solution is not perfect. temporarily close this. The new #929 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43564458
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43575644
  
    @witgo Could you check if checkpoint is used, how long it takes for a simple `model.predict(user, product)` call, compared to in-memory cached?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-44122991
  
    I am using [the code](https://github.com/witgo/spark/compare/cleanup_checkpoint_date_als) to test ALS.
    A brief description of the test:
    
    | Item | Description |
    | ------------- | ----------- |
    |cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`|
    |data| `700 million`|
    |code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`|
    |time|`18.7 h`|
    |shuffle write| `4.72T`|
    |largest local dir|`200G`|
    |checkpoint  dir|`16.6G`|
    
    @mengxr  
     if checkpoint is used, ALS seemed a lot slower.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43579554
  
    @mengxr I was testing the changes.
    The environment is as follows,
    `700 million data`,`3 servers`,`36 core cpus`,`2.5T HDD`,`96G memory`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43581291
  
    @tdas CheckpointRDD is not properly cleaned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by Mark Hamstra <ma...@clearstorydata.com>.

That's more or less the definition of a checkpoint.

Sent from my iPhone

> On May 19, 2014, at 7:58 PM, witgo <gi...@git.apache.org> wrote:
> 
> Github user witgo commented on the pull request:
> 
>    https://github.com/apache/spark/pull/828#issuecomment-43581755
> 
>    @mateiz Why the checkpoint data must be written to the file system?.
> 
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43581755
  
    @mateiz Why the checkpoint data must be written to the file system?.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---