You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mpjlu <gi...@git.apache.org> on 2017/10/19 11:26:37 UTC
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
GitHub user mpjlu opened a pull request:
https://github.com/apache/spark/pull/19536
[SPARK-6685][ML]Use DSYRK to compute AtA in ALS
## What changes were proposed in this pull request?
This is a reopen of PR: https://github.com/apache/spark/pull/13891 with mima fix.
Please reference PR13891 for the performance result and discussion.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mpjlu/spark 6658
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19536.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19536
----
commit 8fb4a82be98588795454127f4281363d105146f0
Author: hqzizania <qi...@intel.com>
Date: 2016-06-24T06:26:31Z
add dsyrk to ALS
commit 7e3d238c62269f923832e7cba237f750082450a9
Author: hqzizania <qi...@intel.com>
Date: 2016-06-24T16:28:57Z
implicitprefs ut fails fix
commit 3607bdcd22cf03c068e777a1927ee3d5da49045a
Author: hqzizania <qi...@intel.com>
Date: 2016-06-28T13:51:55Z
use "while" loop instead of "for"
set stack size > 128 and comments added
commit 56194eb8dc3ea8c233dbb362eeb83af5091abcba
Author: hqzizania <qi...@intel.com>
Date: 2016-06-29T10:57:57Z
add unit test for dostack ALS
commit dc4f4badba26635aa95c6ade5a589d4bd50ae886
Author: hqzizania <qi...@intel.com>
Date: 2016-10-21T05:34:51Z
reset threshold values for doStack and remove UT
commit d29fd67a2a24675b7be2f7f51ba170fda11a85d7
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T01:58:18Z
add threshold param to ALS
commit 294164d839b0ce191fee341b0eb82b81d506d8c8
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T02:47:50Z
nit fix
commit 513e7915ecb807bc04ed8a17fdaa121e9ac578b5
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T02:56:39Z
nit
commit f56b586092a24c1010326ed7f94b6b1eb427d378
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T04:41:14Z
Merge remote-tracking branch 'origin/master' into ALSdsyrk
commit 1081e64c3fbd31c3d35b987b3200eae8c8c688e2
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T04:44:52Z
mima fix
commit a6b5a16cd78e4efe99fda40f92592c9712b04146
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-24T05:04:39Z
oops
commit 20774575ad51d2a10e47c8f8ee7581a1519ace6b
Author: hqzizania <hq...@gmail.com>
Date: 2016-10-25T17:28:45Z
solve mima failure
commit 061df755c80bbb050dcc85605f382b11e974f872
Author: Peng Meng <pe...@intel.com>
Date: 2017-10-19T11:22:21Z
reopen SPARK-6685
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu closed the pull request at:
https://github.com/apache/spark/pull/19536
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19536
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82905/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19536
**[Test build #82905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82905/testReport)** for PR 19536 at commit [`4fdcbe0`](https://github.com/apache/spark/commit/4fdcbe01a1797800d4480a75e38942f5b2443ac4).
* This patch **fails to generate documentation**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:
https://github.com/apache/spark/pull/19536
cc @srowen @mengxr @yanboliang
The performance data is updated, thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:
https://github.com/apache/spark/pull/19536#discussion_r146511769
--- Diff: project/MimaExcludes.scala ---
@@ -1043,6 +1043,9 @@ object MimaExcludes {
// [SPARK-21680][ML][MLLIB]optimzie Vector coompress
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
+ ) ++ Seq(
+ // [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
+ ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
--- End diff --
I am ok to remove this, and use a loose threshold (e.g. 100), which is helpful for most cases. How about it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/19536#discussion_r145677681
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
@@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
@Since("2.2.0")
def setColdStartStrategy(value: String): this.type = set(coldStartStrategy, value)
+ @Since("2.3.0")
+ def setThreshold(value: Int): this.type = set(threshold, value)
--- End diff --
I'm not sure callers can meaningfully understand and set this. Can't we pick a threshold programmatically?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by hqzizania <gi...@git.apache.org>.
Github user hqzizania commented on the issue:
https://github.com/apache/spark/pull/19536
Wow, thank you for reopening. LOL @mpjlu
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on a diff in the pull request:
https://github.com/apache/spark/pull/19536#discussion_r145735791
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
@@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
@Since("2.2.0")
def setColdStartStrategy(value: String): this.type = set(coldStartStrategy, value)
+ @Since("2.3.0")
+ def setThreshold(value: Int): this.type = set(threshold, value)
--- End diff --
Yes, my test results is better than the previous results, especially for native BLAS. I will update my test results here soon, and I will change this set.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19536
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19536
**[Test build #82905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82905/testReport)** for PR 19536 at commit [`4fdcbe0`](https://github.com/apache/spark/commit/4fdcbe01a1797800d4480a75e38942f5b2443ac4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:
https://github.com/apache/spark/pull/19536
Hi @yanboliang , I reopen this PR per your suggestion, thanks.
I have tested the code, the performance result is matched with @hqzizania 's results.
Thanks @hqzizania .
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/19536#discussion_r145678163
--- Diff: project/MimaExcludes.scala ---
@@ -1043,6 +1043,9 @@ object MimaExcludes {
// [SPARK-21680][ML][MLLIB]optimzie Vector coompress
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
+ ) ++ Seq(
+ // [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
+ ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
--- End diff --
This breaks an API unnecessarily, even though it's a dev API. I think we instead need to remove this as a user-facing param and avoid it altogether.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:
https://github.com/apache/spark/pull/19536
![image](https://user-images.githubusercontent.com/13826327/31934047-8e886360-b8dd-11e7-996a-d734ac39bc5b.png)
100k*100k, sparsity 0.05 | old | old | New
-- | -- | -- | --
rank | f2j | mkl | mkl
20 | 4 | 9 | 4
50 | 13 | 12 | 13
100 | 40 | 60 | 30
200 | 156 | 90 | 66
500 | 900 | 720 | 270
This is the performance data of this PR.
The date size is 100,000*100,000, the data sparsity is 0.05.
The test environment is:
3 workers: 35 cores/worker, 210G memory/worker, 1 executor/worker.
The native BLAS is Intel MKL
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19536: [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
Posted by mpjlu <gi...@git.apache.org>.
Github user mpjlu commented on the issue:
https://github.com/apache/spark/pull/19536
Because I don't have the environment to continue this work, I will close it. Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org