You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mpjlu <gi...@git.apache.org> on 2017/10/19 11:26:37 UTC

[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

GitHub user mpjlu opened a pull request:

    https://github.com/apache/spark/pull/19536

    [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

    ## What changes were proposed in this pull request?
    
    This is a reopen of PR: https://github.com/apache/spark/pull/13891 with mima fix.
    Please reference PR13891 for the performance result and discussion.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mpjlu/spark 6658

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19536
    
----
commit 8fb4a82be98588795454127f4281363d105146f0
Author: hqzizania <qi...@intel.com>
Date:   2016-06-24T06:26:31Z

    add dsyrk to ALS

commit 7e3d238c62269f923832e7cba237f750082450a9
Author: hqzizania <qi...@intel.com>
Date:   2016-06-24T16:28:57Z

    implicitprefs ut fails fix

commit 3607bdcd22cf03c068e777a1927ee3d5da49045a
Author: hqzizania <qi...@intel.com>
Date:   2016-06-28T13:51:55Z

    use "while" loop instead of "for"
    set stack size > 128 and comments added

commit 56194eb8dc3ea8c233dbb362eeb83af5091abcba
Author: hqzizania <qi...@intel.com>
Date:   2016-06-29T10:57:57Z

    add unit test for dostack ALS

commit dc4f4badba26635aa95c6ade5a589d4bd50ae886
Author: hqzizania <qi...@intel.com>
Date:   2016-10-21T05:34:51Z

    reset threshold values for doStack and remove UT

commit d29fd67a2a24675b7be2f7f51ba170fda11a85d7
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T01:58:18Z

    add threshold param to ALS

commit 294164d839b0ce191fee341b0eb82b81d506d8c8
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T02:47:50Z

    nit fix

commit 513e7915ecb807bc04ed8a17fdaa121e9ac578b5
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T02:56:39Z

    nit

commit f56b586092a24c1010326ed7f94b6b1eb427d378
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T04:41:14Z

    Merge remote-tracking branch 'origin/master' into ALSdsyrk

commit 1081e64c3fbd31c3d35b987b3200eae8c8c688e2
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T04:44:52Z

    mima fix

commit a6b5a16cd78e4efe99fda40f92592c9712b04146
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-24T05:04:39Z

    oops

commit 20774575ad51d2a10e47c8f8ee7581a1519ace6b
Author: hqzizania <hq...@gmail.com>
Date:   2016-10-25T17:28:45Z

    solve mima failure

commit 061df755c80bbb050dcc85605f382b11e974f872
Author: Peng Meng <pe...@intel.com>
Date:   2017-10-19T11:22:21Z

    reopen SPARK-6685

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu closed the pull request at:

    https://github.com/apache/spark/pull/19536


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82905/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    **[Test build #82905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82905/testReport)** for PR 19536 at commit [`4fdcbe0`](https://github.com/apache/spark/commit/4fdcbe01a1797800d4480a75e38942f5b2443ac4).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    cc @srowen  @mengxr @yanboliang 
    The performance data is updated, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19536#discussion_r146511769
  
    --- Diff: project/MimaExcludes.scala ---
    @@ -1043,6 +1043,9 @@ object MimaExcludes {
           // [SPARK-21680][ML][MLLIB]optimzie Vector coompress
           ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
           ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
    +    ) ++ Seq(
    +      // [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
    +      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
    --- End diff --
    
    I am ok to remove this, and use a loose threshold (e.g. 100), which is  helpful for most cases. How about it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19536#discussion_r145677681
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
       @Since("2.2.0")
       def setColdStartStrategy(value: String): this.type = set(coldStartStrategy, value)
     
    +  @Since("2.3.0")
    +  def setThreshold(value: Int): this.type = set(threshold, value)
    --- End diff --
    
    I'm not sure callers can meaningfully understand and set this. Can't we pick a threshold programmatically?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by hqzizania <gi...@git.apache.org>.
Github user hqzizania commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    Wow, thank you for reopening. LOL @mpjlu 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19536#discussion_r145735791
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
       @Since("2.2.0")
       def setColdStartStrategy(value: String): this.type = set(coldStartStrategy, value)
     
    +  @Since("2.3.0")
    +  def setThreshold(value: Int): this.type = set(threshold, value)
    --- End diff --
    
    Yes, my test results is better than the previous results,  especially for native BLAS.  I will update my test results here soon, and I will change this set. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    **[Test build #82905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82905/testReport)** for PR 19536 at commit [`4fdcbe0`](https://github.com/apache/spark/commit/4fdcbe01a1797800d4480a75e38942f5b2443ac4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    Hi @yanboliang , I reopen this PR per your suggestion, thanks.
    I have tested the code, the performance result is matched with @hqzizania 's results. 
    Thanks @hqzizania .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19536#discussion_r145678163
  
    --- Diff: project/MimaExcludes.scala ---
    @@ -1043,6 +1043,9 @@ object MimaExcludes {
           // [SPARK-21680][ML][MLLIB]optimzie Vector coompress
           ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
           ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
    +    ) ++ Seq(
    +      // [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
    +      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
    --- End diff --
    
    This breaks an API unnecessarily, even though it's a dev API. I think we instead need to remove this as a user-facing param and avoid it altogether.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    
    ![image](https://user-images.githubusercontent.com/13826327/31934047-8e886360-b8dd-11e7-996a-d734ac39bc5b.png)
    
    
    100k*100k, sparsity   0.05 | old | old | New
    -- | -- | -- | --
    rank | f2j | mkl | mkl
    20 | 4 | 9 | 4
    50 | 13 | 12 | 13
    100 | 40 | 60 | 30
    200 | 156 | 90 | 66
    500 | 900 | 720 | 270
    
    
    
    
    This is the performance data of this PR.
    The date size is 100,000*100,000, the data sparsity is 0.05.
    The test environment is:
    3 workers: 35 cores/worker, 210G memory/worker, 1 executor/worker.
    The native BLAS is Intel MKL



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS

Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/19536
  
    Because I don't have the environment to continue this work, I will close it. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org