You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/12 02:39:12 UTC
[GitHub] [spark] zhengruifeng opened a new pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
zhengruifeng opened a new pull request #30009:
URL: https://github.com/apache/spark/pull/30009
### What changes were proposed in this pull request?
1, use `blockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors;
2, infer an appropriate `blockSizeInMB` if set 0, based on data sparsity;
### Why are the changes needed?
the performance gain is mainly related to the nnz of block.
### Does this PR introduce _any_ user-facing change?
yes, param `blockSize` -> `blockSizeInMB` in master
### How was this patch tested?
added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907))
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707644950
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689240
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724432836
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35431/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709903604
**[Test build #129886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706823314
**[Test build #129653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725825105
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912028
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130977/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725902842
**[Test build #130977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268740
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478934
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658147
**[Test build #129740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708651145
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708682659
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725960732
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952841
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912019
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709713528
> Have you benchmark on other BLAS besides f2jBLAS ?
@WeichenXu123 both f2jBlas and openBlas were benchmarked, and recorded in the result excel file.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706824192
ping @WeichenXu123
@zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:
```
mypy --no-incremental --config python/mypy.ini python/pyspark
python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation
```
I installed `mypy` by `sudo apt install mypy` in ubuntu 18.04,
I am not very similar to `mypy`, do I need to configure it somewhere?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707025789
@zero323 Yes, that is because the version installed via `sudo apt install mypy` is too old (`0.560`).
`pip install mypy` works for me. Thank you!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937870
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35582/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541792
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720876636
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706838987
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912019
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832945
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-726033937
Thanks @WeichenXu123 @mengxr @zero323 for review!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654735
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815422
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716271105
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-723879005
@mengxr I will update this PR after:
1, naming the parameter
2, choice of default value: it looks like we can adopt 1MB for both sparse and dense cases. If so, I will remove the logic to compute `avgNNZ` in `linearSVC`.
Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976582
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112004
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34608/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724488915
**[Test build #130842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706839002
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725810555
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707622010
**[Test build #129742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725914725
**[Test build #130981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724525731
**[Test build #130842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119800
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658672
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846483
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662881
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r519784311
##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
/** @group expertGetParam */
final def getBlockSize: Int = $(blockSize)
}
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+ /**
+ * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0..
+ * @group expertParam
+ */
+ final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))
Review comment:
> a block can exceed this size
Only will slightly exceed the limit so not a matter.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119252
**[Test build #130001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725848161
**[Test build #130969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884551
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724409079
**[Test build #130822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707622010
**[Test build #129742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708862751
**[Test build #129786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937840
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816547
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475324
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943526
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943505
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34492/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724526178
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437534
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707470333
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708651145
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709903604
**[Test build #129886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708601573
**[Test build #129756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716271105
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725848161
**[Test build #130969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724488915
**[Test build #130842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725914725
**[Test build #130981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725806330
**[Test build #130960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706839002
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716270883
**[Test build #130251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503843833
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ new Iterator[InstanceBlock] {
+ private var numCols = -1L
+ private val buff = mutable.ArrayBuilder.make[Instance]
+ private var buffCnt = 0L
+ private var buffNnz = 0L
+ private var buffUnitWeight = true
+ private var block = Option.empty[InstanceBlock]
Review comment:
private var block: Option[InstanceBlock] = None
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ new Iterator[InstanceBlock] {
+ private var numCols = -1L
+ private val buff = mutable.ArrayBuilder.make[Instance]
+ private var buffCnt = 0L
+ private var buffNnz = 0L
+ private var buffUnitWeight = true
+ private var block = Option.empty[InstanceBlock]
+
+ private def flush(): Unit = {
+ block = Some(InstanceBlock.fromInstances(buff.result()))
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ }
+
+ private def blockify(): Unit = {
+ block = None
+
+ while (block.isEmpty && iterator.hasNext) {
+ val instance = iterator.next()
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+
+ // Check if enough memory remains to add this instance to the block.
+ if (getBlockMemUsage(numCols, buffCnt + 1L, buffNnz + nnz,
+ buffUnitWeight && (instance.weight == 1)) > maxMemUsage) {
+ // Check if this instance is too large
+ require(buffCnt > 0, s"instance $instance exceeds memory limit $maxMemUsage, " +
+ s"please increase block size")
+ flush()
+ }
+
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
Review comment:
After flush, buffCnt/buffNnz clear to be 0, but then you increase one and then exit loop. Then next batch the initial buffCnt/buffNnz won't be 0.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976582
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724527891
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35451/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708601573
**[Test build #129756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709705904
@mengxr Do you want to take a look ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518645766
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+ *
+ * @param dim size of vector.
+ * @param avgNNZ average nnz of vectors.
+ * @param blasLevel level of BLAS operation.
+ */
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+ // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+ // and fallback to the Java implementation (f2jBLAS) if necessary.
+ // The suggested value for dense cases is 0.25.
+ 0.25
Review comment:
We may also change it to 1.0 for dence cases (to use 1.0 as the default value for all cases), the speedup at 1.0MB is only a little lower than that at 0.25MB.
There was [another performance test](https://issues.apache.org/jira/browse/SPARK-31714) on the implements of prediction in training, which maybe worthwhile to refer to:
```
test("performance: gemv vs foreachNonZero(std)") {
for (numRows <- Seq(16, 64, 256, 1024, 4096); numCols <- Seq(16, 64, 256, 1024, 4096)) {
val rng = new Random(123)
val matrix = Matrices.dense(numRows, numCols,
Array.fill(numRows * numCols)(rng.nextDouble)).toDense
val vectors = matrix.rowIter.toArray
val coefVec = Vectors.dense(Array.fill(numCols)(rng.nextDouble))
val coefArr = coefVec.toArray
val stdVec = Vectors.dense(Array.fill(numCols)(rng.nextDouble))
val stdArr = stdVec.toArray
val start1 = System.nanoTime
Seq.range(0, 100).foreach { _ => matrix.multiply(coefVec) }
val dur1 = System.nanoTime - start1
val start2 = System.nanoTime
Seq.range(0, 100).foreach { _ =>
vectors.map { vector =>
var sum = 0.0
vector.foreachNonZero { (i, v) =>
val std = stdArr(i)
if (std != 0) sum += coefArr(i) * v
}
sum
}
}
val dur2 = System.nanoTime - start2
println(s"numRows=$numRows, numCols=$numCols, gemv: $dur1, foreachNonZero(std): $dur2, " +
s"foreachNonZero(std)/gemv: ${dur2.toDouble / dur1}")
}
}
```
output:
```
numRows=16, numCols=16, gemv: 543897, foreachNonZero(std): 4683864, foreachNonZero(std)/gemv: 8.611674636925741
numRows=16, numCols=64, gemv: 274878, foreachNonZero(std): 2996356, foreachNonZero(std)/gemv: 10.90067593623353
numRows=16, numCols=256, gemv: 771816, foreachNonZero(std): 9081260, foreachNonZero(std)/gemv: 11.76609450957223
numRows=16, numCols=1024, gemv: 1537698, foreachNonZero(std): 23386693, foreachNonZero(std)/gemv: 15.208898626388276
numRows=16, numCols=4096, gemv: 5577804, foreachNonZero(std): 87389503, foreachNonZero(std)/gemv: 15.667367121541023
numRows=64, numCols=16, gemv: 173518, foreachNonZero(std): 1384669, foreachNonZero(std)/gemv: 7.979973259258405
numRows=64, numCols=64, gemv: 313941, foreachNonZero(std): 4403461, foreachNonZero(std)/gemv: 14.026396679630887
numRows=64, numCols=256, gemv: 981895, foreachNonZero(std): 19443231, foreachNonZero(std)/gemv: 19.801741530408037
numRows=64, numCols=1024, gemv: 3908960, foreachNonZero(std): 88985415, foreachNonZero(std)/gemv: 22.764473159101144
numRows=64, numCols=4096, gemv: 16075758, foreachNonZero(std): 366740675, foreachNonZero(std)/gemv: 22.81327418588909
numRows=256, numCols=16, gemv: 329479, foreachNonZero(std): 5341171, foreachNonZero(std)/gemv: 16.210960334346044
numRows=256, numCols=64, gemv: 948949, foreachNonZero(std): 17126600, foreachNonZero(std)/gemv: 18.047966750584067
numRows=256, numCols=256, gemv: 3947789, foreachNonZero(std): 81207071, foreachNonZero(std)/gemv: 20.570266293360664
numRows=256, numCols=1024, gemv: 14635992, foreachNonZero(std): 350779742, foreachNonZero(std)/gemv: 23.96692632791819
numRows=256, numCols=4096, gemv: 71265609, foreachNonZero(std): 1423000813, foreachNonZero(std)/gemv: 19.967566866649523
numRows=1024, numCols=16, gemv: 916645, foreachNonZero(std): 19432942, foreachNonZero(std)/gemv: 21.200074183571612
numRows=1024, numCols=64, gemv: 3479825, foreachNonZero(std): 66857430, foreachNonZero(std)/gemv: 19.212871336920678
numRows=1024, numCols=256, gemv: 13680423, foreachNonZero(std): 312189763, foreachNonZero(std)/gemv: 22.82018348409256
numRows=1024, numCols=1024, gemv: 68880268, foreachNonZero(std): 1401019163, foreachNonZero(std)/gemv: 20.339920323771096
numRows=1024, numCols=4096, gemv: 293455450, foreachNonZero(std): 5744847994, foreachNonZero(std)/gemv: 19.576559215376644
numRows=4096, numCols=16, gemv: 3714086, foreachNonZero(std): 82488401, foreachNonZero(std)/gemv: 22.20960984748334
numRows=4096, numCols=64, gemv: 14273712, foreachNonZero(std): 279946980, foreachNonZero(std)/gemv: 19.612766461870606
numRows=4096, numCols=256, gemv: 70000687, foreachNonZero(std): 1311574476, foreachNonZero(std)/gemv: 18.73659434228124
numRows=4096, numCols=1024, gemv: 289944283, foreachNonZero(std): 5695483201, foreachNonZero(std)/gemv: 19.643371278336257
numRows=4096, numCols=4096, gemv: 1169773295, foreachNonZero(std): 23019445987, foreachNonZero(std)/gemv: 19.678553173843827
- performance: gemv vs foreachNonZero(std)
```
new implements are based on `gemv`, while implements in branch-3.0 are beased on `foreachNonZero(std)`.
I think that a blockSize larger than 64X256 may be acceptable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884988
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518485738
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+ *
+ * @param dim size of vector.
+ * @param avgNNZ average nnz of vectors.
+ * @param blasLevel level of BLAS operation.
+ */
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+ // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+ // and fallback to the Java implementation (f2jBLAS) if necessary.
+ // The suggested value for dense cases is 0.25.
+ 0.25
+ } else {
+ // When the dataset is sparse, Spark will use its own Scala implementation.
+ // The suggested value for sparse cases is 64.0.
+ 64.0
Review comment:
I agree that 64MB will to big for a kmeans with large `k`. For kmeans and multi-class logistic regression, I added a `blasLevel`, maybe we also need to add a param `k`. But for now we may leave it alone, I agree that we can use a conservative value 1MB here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831455
**[Test build #130960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait HasMaxBlockSizeInMB extends Params `
* `class HasMaxBlockSizeInMB(Params):`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506018249
##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
##########
@@ -199,14 +193,11 @@ class LinearSVC @Since("2.2.0") (
instr.logNamedValue("lowestLabelWeight", labelSummarizer.histogram.min.toString)
instr.logNamedValue("highestLabelWeight", labelSummarizer.histogram.max.toString)
instr.logSumOfWeights(summarizer.weightSum)
- if ($(blockSize) > 1) {
- val scale = 1.0 / summarizer.count / numFeatures
- val sparsity = 1 - summarizer.numNonzeros.toArray.map(_ * scale).sum
- instr.logNamedValue("sparsity", sparsity.toString)
- if (sparsity > 0.5) {
- instr.logWarning(s"sparsity of input dataset is $sparsity, " +
- s"which may hurt performance in high-level BLAS.")
- }
+ if (actualBlockSizeInMB == 0) {
+ val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum
Review comment:
will the additional summarizer consume time ?
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,62 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ new Iterator[InstanceBlock] {
+ private var numCols = -1L
+ private val buff = mutable.ArrayBuilder.make[Instance]
+
+ override def hasNext: Boolean = iterator.hasNext
+
+ override def next(): InstanceBlock = {
+ buff.clear()
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+ var blockMemUsage = 0L
+
+ while (iterator.hasNext && blockMemUsage < maxMemUsage) {
+ val instance = iterator.next()
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+ blockMemUsage = getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight)
+ }
+
+ // the block mem usage may slightly exceed threshold, not a big issue.
+ // and this ensure even if one row exceed block limit, each block has one row
+ InstanceBlock.fromInstances(buff.result())
+ }
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ 0.25
+ } else {
+ 64.0
+ }
Review comment:
Document why choose the value ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937860
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884988
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] mengxr commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
mengxr commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r517854961
##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
/** @group expertGetParam */
final def getBlockSize: Int = $(blockSize)
}
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+ /**
+ * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0..
+ * @group expertParam
+ */
+ final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))
Review comment:
Shall we call it `maxBlockSizeMB`? The current name suggests that we try to match the block size. Calling it `maxBlockSizeMB` would leave us some space for optimization.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+ *
+ * @param dim size of vector.
+ * @param avgNNZ average nnz of vectors.
+ * @param blasLevel level of BLAS operation.
+ */
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
Review comment:
Not used.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+ *
+ * @param dim size of vector.
+ * @param avgNNZ average nnz of vectors.
+ * @param blasLevel level of BLAS operation.
+ */
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+ // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+ // and fallback to the Java implementation (f2jBLAS) if necessary.
+ // The suggested value for dense cases is 0.25.
+ 0.25
+ } else {
+ // When the dataset is sparse, Spark will use its own Scala implementation.
+ // The suggested value for sparse cases is 64.0.
+ 64.0
Review comment:
A little surprise to see the default suddenly jumps from 0.25MB to 64MB. This is very risky because 64MB sparse data could generate much bigger dense result, e.g., in multi-class logistic regression or k-means, if we eventually blockify their implementation. In your benchmark, it seems we start observing speed-up at 1MB. I will be very conservative here.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
Review comment:
Could you link to the JIRA or this PR that has the performance tests?
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ var numCols = -1L
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+
+ if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ Iterator.single(block)
+ } else Iterator.empty
+ } ++ {
+ if (buffCnt > 0) {
+ val block = InstanceBlock.fromInstances(buff.result())
+ Iterator.single(block)
+ } else Iterator.empty
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+
+ /**
+ * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+ *
+ * @param dim size of vector.
+ * @param avgNNZ average nnz of vectors.
+ * @param blasLevel level of BLAS operation.
+ */
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+ // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+ // and fallback to the Java implementation (f2jBLAS) if necessary.
+ // The suggested value for dense cases is 0.25.
+ 0.25
Review comment:
We need more comments to explain how this default value is picked. So this is roughly 180x180 double, which seems quite small to me.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-722762550
@mengxr Thanks for reviewing!
> Does your benchmark code count pre-processing time?
yes, pre-processing time is taken into account.
> Could you paste your benchmark code and environment specs?
Dataset: [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2)
numInstances: 100,000; numFeatures: 2,000
env: ubuntu 18.04
cmd: bin/spark-shell --driver-memory=64G --conf spark.driver.maxResultSize=10g
code:
```
import scala.util.Random
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.regression._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
def getSparseUDF(dim: Int) = {
val rng = new Random(123)
val newIndices = rng.shuffle(Seq.range(0, dim)).take(2000).toArray.sorted
udf { vec: Vector =>
Vectors.sparse(dim, newIndices, vec.toArray).compressed
}
}
new LinearSVC().setMaxIter(20).fit(df)
val svc = new LinearSVC().setMaxIter(100).setTol(0)
for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000); size <- Seq(0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0)) {
Thread.sleep(60000)
val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
val start = System.currentTimeMillis
val model = svc.setBlockSizeInMB(size).fit(ds)
val end = System.currentTimeMillis
println((model.uid, dim, size, end - start, model.coefficients.toString.take(100)))
}
// for branch-3.0
for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000)) {
Thread.sleep(60000)
val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
val start = System.currentTimeMillis
val model = svc.fit(ds)
val end = System.currentTimeMillis
println((model.uid, dim, -1, end - start, model.coefficients.toString.take(100)))
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689240
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832945
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952841
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707460083
**[Test build #129724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-717647271
also ping @srowen
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712053130
**[Test build #130001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708849480
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666392
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725871798
**[Test build #130969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait HasMaxBlockSizeInMB extends Params `
* `class HasMaxBlockSizeInMB(Params):`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884551
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725795212
**[Test build #130958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832933
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654611
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475316
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506021907
##########
File path: mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala
##########
@@ -74,4 +74,36 @@ class InstanceSuite extends SparkFunSuite{
}
}
+ test("InstanceBlock: blokify with max memory usage") {
+ val instance1 = Instance(19.0, 2.0, Vectors.dense(1.0, 7.0))
+ val instance2 = Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse)
+ val instances = Seq(instance1, instance2)
+
+ val blocks = InstanceBlock
+ .blokifyWithMaxMemUsage(Iterator.apply(instance1, instance2), 128).toArray
+ require(blocks.length == 1)
+ val block = blocks.head
+ assert(block.size === 2)
+ assert(block.numFeatures === 2)
+ block.instanceIterator.zipWithIndex.foreach {
+ case (instance, i) =>
+ assert(instance.label === instances(i).label)
+ assert(instance.weight === instances(i).weight)
+ assert(instance.features.toArray === instances(i).features.toArray)
+ }
+ Seq(0, 1).foreach { i =>
+ val nzIter = block.getNonZeroIter(i)
+ val vec = Vectors.sparse(2, nzIter.toSeq)
+ assert(vec.toArray === instances(i).features.toArray)
+ }
+
+ // instances larger than maxMemUsage
+ val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
+ InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size
+
+ // different numFeatures
+ intercept[IllegalArgumentException] {
+ InstanceBlock.blokifyWithMaxMemUsage(Iterator.apply(instance1, bigInstance), 64).size
+ }
+ }
Review comment:
add test:
* Generate a sparse and dense instance mixed list (a list which some segment is dense but others are very sparse), verify each block size won't exceed the blockMem limit too much. (Such as: (actual block mem size)/confg <= 1.1 ?)
##########
File path: mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala
##########
@@ -74,4 +74,36 @@ class InstanceSuite extends SparkFunSuite{
}
}
+ test("InstanceBlock: blokify with max memory usage") {
+ val instance1 = Instance(19.0, 2.0, Vectors.dense(1.0, 7.0))
+ val instance2 = Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse)
+ val instances = Seq(instance1, instance2)
+
+ val blocks = InstanceBlock
+ .blokifyWithMaxMemUsage(Iterator.apply(instance1, instance2), 128).toArray
+ require(blocks.length == 1)
+ val block = blocks.head
+ assert(block.size === 2)
+ assert(block.numFeatures === 2)
+ block.instanceIterator.zipWithIndex.foreach {
+ case (instance, i) =>
+ assert(instance.label === instances(i).label)
+ assert(instance.weight === instances(i).weight)
+ assert(instance.features.toArray === instances(i).features.toArray)
+ }
+ Seq(0, 1).foreach { i =>
+ val nzIter = block.getNonZeroIter(i)
+ val vec = Vectors.sparse(2, nzIter.toSeq)
+ assert(vec.toArray === instances(i).features.toArray)
+ }
+
+ // instances larger than maxMemUsage
+ val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
+ InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size
Review comment:
Verify block contains 1 row.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884545
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506032899
##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
##########
@@ -199,14 +193,11 @@ class LinearSVC @Since("2.2.0") (
instr.logNamedValue("lowestLabelWeight", labelSummarizer.histogram.min.toString)
instr.logNamedValue("highestLabelWeight", labelSummarizer.histogram.max.toString)
instr.logSumOfWeights(summarizer.weightSum)
- if ($(blockSize) > 1) {
- val scale = 1.0 / summarizer.count / numFeatures
- val sparsity = 1 - summarizer.numNonzeros.toArray.map(_ * scale).sum
- instr.logNamedValue("sparsity", sparsity.toString)
- if (sparsity > 0.5) {
- instr.logWarning(s"sparsity of input dataset is $sparsity, " +
- s"which may hurt performance in high-level BLAS.")
- }
+ if (actualBlockSizeInMB == 0) {
+ val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum
Review comment:
yes, one more metric `numNonZeros` will be computed.
Since it still need only one pass, I think the additional time should not be significant.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112035
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440823
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725806330
**[Test build #130960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725920282
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707617758
**[Test build #129740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708650667
**[Test build #129756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937125
**[Test build #129886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851948
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815422
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893156
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725874235
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724526178
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725795212
**[Test build #130958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112035
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662881
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716253408
**[Test build #130251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268740
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724436653
**[Test build #130822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884714
**[Test build #129786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541792
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478699
**[Test build #129724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708888321
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816530
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725872196
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506167739
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,62 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ new Iterator[InstanceBlock] {
+ private var numCols = -1L
+ private val buff = mutable.ArrayBuilder.make[Instance]
+
+ override def hasNext: Boolean = iterator.hasNext
+
+ override def next(): InstanceBlock = {
+ buff.clear()
+ var buffCnt = 0L
+ var buffNnz = 0L
+ var buffUnitWeight = true
+ var blockMemUsage = 0L
+
+ while (iterator.hasNext && blockMemUsage < maxMemUsage) {
+ val instance = iterator.next()
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
+ blockMemUsage = getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight)
+ }
+
+ // the block mem usage may slightly exceed threshold, not a big issue.
+ // and this ensure even if one row exceed block limit, each block has one row
+ InstanceBlock.fromInstances(buff.result())
+ }
+ }
+ }
+
+ def blokifyWithMaxMemUsage(
+ instances: RDD[Instance],
+ maxMemUsage: Long): RDD[InstanceBlock] = {
+ require(maxMemUsage > 0)
+ instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+ }
+
+ def inferBlockSizeInMB(
+ dim: Int,
+ avgNNZ: Double,
+ blasLevel: Int = 2): Double = {
+ if (dim <= avgNNZ * 3) {
+ 0.25
+ } else {
+ 64.0
+ }
Review comment:
Current strategy is quitely simple, I think we may use a complex costmodel if necessay in the future.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437534
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851384
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706834138
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666392
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725305176
@zhengruifeng
* merge my update PR (fix 2.13 scala issue) https://github.com/apache/spark/pull/30327
* change param name to be `maxBlockSizeInMB`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816547
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689222
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-726014822
Merged to master. Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851948
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725899630
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725911880
**[Test build #130977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait HasMaxBlockSizeInMB extends Params `
* `class HasMaxBlockSizeInMB(Params):`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831725
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724482556
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503159247
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
private[spark] object InstanceBlock {
+ private def getBlockSize(
+ numCols: Long,
+ numRows: Long,
+ nnz: Long,
+ allUnitWeight: Boolean): Long = {
+ val doubleBytes = java.lang.Double.BYTES
+ val arrayHeader = 12L
+ val denseSize = Matrices.getDenseSize(numCols, numRows)
+ val sparseSize = Matrices.getSparseSize(nnz, numRows + 1)
+ val matrixSize = math.min(denseSize, sparseSize)
+ if (allUnitWeight) {
+ matrixSize + doubleBytes * numRows + arrayHeader * 2
Review comment:
should be + 1x arrayHeader ?
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemoryUsage(
+ iterator: Iterator[Instance],
+ maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemoryUsage > 0)
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var numCols = -1L
+ var count = 0L
+ var nnz = 0L
Review comment:
nnz => buffNnz
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemoryUsage(
+ iterator: Iterator[Instance],
+ maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemoryUsage > 0)
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var numCols = -1L
+ var count = 0L
+ var nnz = 0L
+ var allUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val n = instance.features.numNonzeros
+ var block = Option.empty[InstanceBlock]
+ // Check if enough memory remains to add this instance to the block.
+ if (getBlockSize(numCols, count + 1L, nnz + n,
+ allUnitWeight && (instance.weight == 1)) > maxMemoryUsage) {
+ // Check if this instance is too large
+ require(count > 0, s"instance $instance exceeds memory limit $maxMemoryUsage, " +
+ s"please increase block size")
+
+ block = Some(InstanceBlock.fromInstances(buff.result()))
+ buff.clear()
+ count = 0L
+ nnz = 0L
+ allUnitWeight = true
+ }
+ buff += instance
+ count += 1L
+ nnz += n
+ allUnitWeight &&= (instance.weight == 1)
+ block.iterator
+ } ++ {
+ val instances = buff.result()
+ if (instances.nonEmpty) {
+ Iterator.single(InstanceBlock.fromInstances(instances))
+ } else Iterator.empty
+ }
+ }
Review comment:
This iterator logic here we'd better use for loop with `yield`, it will be more clear to read.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemoryUsage(
+ iterator: Iterator[Instance],
+ maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemoryUsage > 0)
+ val buff = mutable.ArrayBuilder.make[Instance]
+ var numCols = -1L
+ var count = 0L
+ var nnz = 0L
+ var allUnitWeight = true
+
+ iterator.flatMap { instance =>
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val n = instance.features.numNonzeros
Review comment:
n => nnz
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
private[spark] object InstanceBlock {
+ private def getBlockSize(
Review comment:
to be semantic accurate, rename to getBlockMemUsage
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437536
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130822/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708862751
**[Test build #129786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712101113
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34608/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440815
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35431/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712053130
**[Test build #130001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720876636
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716252855
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832950
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35566/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-722762550
@mengxr Thanks for reviewing!
> Does your benchmark code count pre-processing time?
yes, pre-processing time is taken into account.
> Could you paste your benchmark code and environment specs?
Dataset: [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2)
numInstances: 100,000; numFeatures: 2,000
env: ubuntu 18.04
cmd: bin/spark-shell --driver-memory=64G --conf spark.driver.maxResultSize=10g
code:
```
import scala.util.Random
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.regression._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
def getSparseUDF(dim: Int) = {
val rng = new Random(123)
val newIndices = rng.shuffle(Seq.range(0, dim)).take(2000).toArray.sorted
udf { vec: Vector =>
Vectors.sparse(dim, newIndices, vec.toArray).compressed
}
}
new LinearSVC().setMaxIter(20).fit(df)
val svc = new LinearSVC().setMaxIter(100).setTol(0)
for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000); size <- Seq(0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0)) {
Thread.sleep(60000)
val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
val start = System.currentTimeMillis
val model = svc.setBlockSizeInMB(size).fit(ds)
val end = System.currentTimeMillis
println((model.uid, dim, size, end - start, model.coefficients.toString.take(100)))
}
// for branch-3.0
for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000)) {
Thread.sleep(60000)
val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
val start = System.currentTimeMillis
val model = svc.fit(ds)
val end = System.currentTimeMillis
println((model.uid, dim, -1, end - start, model.coefficients.toString.take(100)))
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268731
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725913955
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976554
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716263315
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952328
**[Test build #130981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait HasMaxBlockSizeInMB extends Params `
* `class HasMaxBlockSizeInMB(Params):`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725847488
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658672
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440823
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654594
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666376
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943526
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475324
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724409079
**[Test build #130822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503191442
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
private[spark] object InstanceBlock {
+ private def getBlockSize(
+ numCols: Long,
+ numRows: Long,
+ nnz: Long,
+ allUnitWeight: Boolean): Long = {
+ val doubleBytes = java.lang.Double.BYTES
+ val arrayHeader = 12L
+ val denseSize = Matrices.getDenseSize(numCols, numRows)
+ val sparseSize = Matrices.getSparseSize(nnz, numRows + 1)
+ val matrixSize = math.min(denseSize, sparseSize)
+ if (allUnitWeight) {
+ matrixSize + doubleBytes * numRows + arrayHeader * 2
Review comment:
there is still two arrays, the weight array is `Array.emptyDoubleArray`, so there is two arrayHeader?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662416
**[Test build #129742 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708245249
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725872196
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725902842
**[Test build #130977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831725
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706917827
> @zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:
>
> ```
> mypy --no-incremental --config python/mypy.ini python/pyspark
> python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation
> ```
>
> I installed `mypy` by `sudo apt install mypy` in ubuntu 18.04,
> I am not very similar to `mypy`, do I need to configure it somewhere?
No additional configuration should be required, but the version from Ubuntu errors is pretty old, and at first glance it doesn't support error codes (`[import]` part).
Personally I'd recommend either [venv](https://docs.python.org/3/library/venv.html) or miniconda, but if you want quick fix, installing pip and making user install should do the trick
```
sudo apt purge mypy
sudo apt install python3-pip
pip install mypy
```
I've checked things on my side (mypy 0.790, current stable), for both master and this PR, and things look good.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119800
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478934
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503848595
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
}
+
+ def blokifyWithMaxMemUsage(
+ iterator: Iterator[Instance],
+ maxMemUsage: Long): Iterator[InstanceBlock] = {
+ require(maxMemUsage > 0)
+
+ new Iterator[InstanceBlock] {
+ private var numCols = -1L
+ private val buff = mutable.ArrayBuilder.make[Instance]
+ private var buffCnt = 0L
+ private var buffNnz = 0L
+ private var buffUnitWeight = true
+ private var block = Option.empty[InstanceBlock]
+
+ private def flush(): Unit = {
+ block = Some(InstanceBlock.fromInstances(buff.result()))
+ buff.clear()
+ buffCnt = 0L
+ buffNnz = 0L
+ buffUnitWeight = true
+ }
+
+ private def blockify(): Unit = {
+ block = None
+
+ while (block.isEmpty && iterator.hasNext) {
+ val instance = iterator.next()
+ if (numCols < 0L) numCols = instance.features.size
+ require(numCols == instance.features.size)
+ val nnz = instance.features.numNonzeros
+
+ // Check if enough memory remains to add this instance to the block.
+ if (getBlockMemUsage(numCols, buffCnt + 1L, buffNnz + nnz,
+ buffUnitWeight && (instance.weight == 1)) > maxMemUsage) {
+ // Check if this instance is too large
+ require(buffCnt > 0, s"instance $instance exceeds memory limit $maxMemUsage, " +
+ s"please increase block size")
+ flush()
+ }
+
+ buff += instance
+ buffCnt += 1L
+ buffNnz += nnz
+ buffUnitWeight &&= (instance.weight == 1)
Review comment:
After flush, buffCnt/buffNnz clear to be 0, but then you increase one and then exit loop. Then next batch the initial buffCnt/buffNnz won't be 0.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893138
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709935090
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34492/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937860
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518482624
##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
/** @group expertGetParam */
final def getBlockSize: Int = $(blockSize)
}
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+ /**
+ * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0..
+ * @group expertParam
+ */
+ final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))
Review comment:
or `maxBlockSizeInMB`? to keep in line with existing [`maxMemoryInMB`](https://github.com/apache/spark/blob/bc7885901dd99de21ecbf269d72fa37a393b2ffc/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L121) in `treeParams.scala`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937564
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846174
**[Test build #129653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait HasBlockSizeInMB extends Params `
* `class HasBlockSizeInMB(Params):`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884560
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35575/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518626371
##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
/** @group expertGetParam */
final def getBlockSize: Int = $(blockSize)
}
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+ /**
+ * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0..
+ * @group expertParam
+ */
+ final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))
Review comment:
in current pr, a block can exceed this size. I guess `maxBlockSize...` may suggest that a block must be not larger than this value.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654611
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893156
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541763
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35451/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707460083
**[Test build #129724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706823314
**[Test build #129653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937564
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846483
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707617758
**[Test build #129740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716253408
**[Test build #130251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 closed pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
WeichenXu123 closed pull request #30009:
URL: https://github.com/apache/spark/pull/30009
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440829
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35431/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815167
**[Test build #130958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org