You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/12 02:39:12 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

zhengruifeng opened a new pull request #30009:
URL: https://github.com/apache/spark/pull/30009


   ### What changes were proposed in this pull request?
   1, use `blockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors;
   2, infer an appropriate `blockSizeInMB` if set 0, based on data sparsity;
   
   ### Why are the changes needed?
   the performance gain is mainly related to the nnz of block.
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, param `blockSize` -> `blockSizeInMB` in master
   
   
   ### How was this patch tested?
   added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907))
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707644950


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689240






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724432836


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35431/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709903604


   **[Test build #129886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706823314


   **[Test build #129653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725825105


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912028


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130977/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725902842


   **[Test build #130977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268740






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478934






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658147


   **[Test build #129740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708651145






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708682659


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725960732


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952841






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912019






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709713528


   > Have you benchmark on other BLAS besides f2jBLAS ?
   
   @WeichenXu123  both f2jBlas and openBlas were benchmarked, and recorded in the result excel file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706824192


   ping @WeichenXu123 
   
   @zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:
   
   ```
   mypy --no-incremental --config python/mypy.ini python/pyspark
   python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation
   ``` 
   
   I installed `mypy` by `sudo apt install mypy` in ubuntu 18.04,
   I am not very similar to `mypy`, do I need to configure it somewhere?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707025789


   @zero323 Yes, that is because the version installed via `sudo apt install mypy` is too old (`0.560`).  
   `pip install mypy` works for me. Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937870


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35582/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541792






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720876636






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706838987


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725912019


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832945






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-726033937


   Thanks @WeichenXu123 @mengxr @zero323 for review!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654735


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815422






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716271105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-723879005


   @mengxr I will update this PR after:
   1, naming the parameter
   2, choice of default value: it looks like we can adopt 1MB for both sparse and dense cases. If so, I will remove the logic to compute `avgNNZ` in `linearSVC`.
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976582






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112004


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724488915


   **[Test build #130842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706839002






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725810555


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707622010


   **[Test build #129742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725914725


   **[Test build #130981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724525731


   **[Test build #130842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119800






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658672






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846483






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662881






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r519784311



##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
   /** @group expertGetParam */
   final def getBlockSize: Int = $(blockSize)
 }
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+  /**
+   * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be &gt;= 0..
+   * @group expertParam
+   */
+  final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))

Review comment:
       > a block can exceed this size
   Only will slightly exceed the limit so not a matter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119252


   **[Test build #130001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725848161


   **[Test build #130969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884551






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724409079


   **[Test build #130822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707622010


   **[Test build #129742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708862751


   **[Test build #129786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937840


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816547






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475324






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943526






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943505


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34492/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724526178






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437534


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707470333


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708651145






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709903604


   **[Test build #129886 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708601573


   **[Test build #129756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716271105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725848161


   **[Test build #130969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724488915


   **[Test build #130842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130842/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725914725


   **[Test build #130981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725806330


   **[Test build #130960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706839002






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716270883


   **[Test build #130251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503843833



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    new Iterator[InstanceBlock] {
+      private var numCols = -1L
+      private val buff = mutable.ArrayBuilder.make[Instance]
+      private var buffCnt = 0L
+      private var buffNnz = 0L
+      private var buffUnitWeight = true
+      private var block = Option.empty[InstanceBlock]

Review comment:
       private var block: Option[InstanceBlock] = None

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    new Iterator[InstanceBlock] {
+      private var numCols = -1L
+      private val buff = mutable.ArrayBuilder.make[Instance]
+      private var buffCnt = 0L
+      private var buffNnz = 0L
+      private var buffUnitWeight = true
+      private var block = Option.empty[InstanceBlock]
+
+      private def flush(): Unit = {
+        block = Some(InstanceBlock.fromInstances(buff.result()))
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+      }
+
+      private def blockify(): Unit = {
+        block = None
+
+        while (block.isEmpty && iterator.hasNext) {
+          val instance = iterator.next()
+          if (numCols < 0L) numCols = instance.features.size
+          require(numCols == instance.features.size)
+          val nnz = instance.features.numNonzeros
+
+          // Check if enough memory remains to add this instance to the block.
+          if (getBlockMemUsage(numCols, buffCnt + 1L, buffNnz + nnz,
+            buffUnitWeight && (instance.weight == 1)) > maxMemUsage) {
+            // Check if this instance is too large
+            require(buffCnt > 0, s"instance $instance exceeds memory limit $maxMemUsage, " +
+              s"please increase block size")
+            flush()
+          }
+
+          buff += instance
+          buffCnt += 1L
+          buffNnz += nnz
+          buffUnitWeight &&= (instance.weight == 1)

Review comment:
       After flush, buffCnt/buffNnz clear to be 0, but then you increase one and then exit loop. Then next batch the initial buffCnt/buffNnz won't be 0.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976582






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724527891


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35451/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708601573


   **[Test build #129756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709705904


   @mengxr Do you want to take a look ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518645766



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+   *
+   * @param dim size of vector.
+   * @param avgNNZ average nnz of vectors.
+   * @param blasLevel level of BLAS operation.
+   */
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+      // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+      // and fallback to the Java implementation (f2jBLAS) if necessary.
+      // The suggested value for dense cases is 0.25.
+      0.25

Review comment:
       We may also change it to 1.0 for dence cases (to use 1.0 as the default value for all cases), the speedup at 1.0MB is only a little lower than that at 0.25MB.
   
   
   There was [another performance test](https://issues.apache.org/jira/browse/SPARK-31714) on the implements of prediction in training, which maybe worthwhile to refer to:
   
   ```
   test("performance: gemv vs foreachNonZero(std)") {
     for (numRows <- Seq(16, 64, 256, 1024, 4096); numCols <- Seq(16, 64, 256, 1024, 4096)) {
       val rng = new Random(123)
       val matrix = Matrices.dense(numRows, numCols,
         Array.fill(numRows * numCols)(rng.nextDouble)).toDense
       val vectors = matrix.rowIter.toArray
       val coefVec = Vectors.dense(Array.fill(numCols)(rng.nextDouble))
       val coefArr = coefVec.toArray
       val stdVec = Vectors.dense(Array.fill(numCols)(rng.nextDouble))
       val stdArr = stdVec.toArray
   
       val start1 = System.nanoTime
       Seq.range(0, 100).foreach { _ => matrix.multiply(coefVec) }
       val dur1 = System.nanoTime - start1
   
       val start2 = System.nanoTime
       Seq.range(0, 100).foreach { _ =>
         vectors.map { vector =>
           var sum = 0.0
           vector.foreachNonZero { (i, v) =>
             val std = stdArr(i)
             if (std != 0) sum += coefArr(i) * v
           }
           sum
         }
       }
       val dur2 = System.nanoTime - start2
   
       println(s"numRows=$numRows, numCols=$numCols, gemv: $dur1, foreachNonZero(std): $dur2, " +
         s"foreachNonZero(std)/gemv: ${dur2.toDouble / dur1}")
     }
   }
   ```
   
   output:
   ```
   numRows=16, numCols=16, gemv: 543897, foreachNonZero(std): 4683864, foreachNonZero(std)/gemv: 8.611674636925741
   numRows=16, numCols=64, gemv: 274878, foreachNonZero(std): 2996356, foreachNonZero(std)/gemv: 10.90067593623353
   numRows=16, numCols=256, gemv: 771816, foreachNonZero(std): 9081260, foreachNonZero(std)/gemv: 11.76609450957223
   numRows=16, numCols=1024, gemv: 1537698, foreachNonZero(std): 23386693, foreachNonZero(std)/gemv: 15.208898626388276
   numRows=16, numCols=4096, gemv: 5577804, foreachNonZero(std): 87389503, foreachNonZero(std)/gemv: 15.667367121541023
   numRows=64, numCols=16, gemv: 173518, foreachNonZero(std): 1384669, foreachNonZero(std)/gemv: 7.979973259258405
   numRows=64, numCols=64, gemv: 313941, foreachNonZero(std): 4403461, foreachNonZero(std)/gemv: 14.026396679630887
   numRows=64, numCols=256, gemv: 981895, foreachNonZero(std): 19443231, foreachNonZero(std)/gemv: 19.801741530408037
   numRows=64, numCols=1024, gemv: 3908960, foreachNonZero(std): 88985415, foreachNonZero(std)/gemv: 22.764473159101144
   numRows=64, numCols=4096, gemv: 16075758, foreachNonZero(std): 366740675, foreachNonZero(std)/gemv: 22.81327418588909
   numRows=256, numCols=16, gemv: 329479, foreachNonZero(std): 5341171, foreachNonZero(std)/gemv: 16.210960334346044
   numRows=256, numCols=64, gemv: 948949, foreachNonZero(std): 17126600, foreachNonZero(std)/gemv: 18.047966750584067
   numRows=256, numCols=256, gemv: 3947789, foreachNonZero(std): 81207071, foreachNonZero(std)/gemv: 20.570266293360664
   numRows=256, numCols=1024, gemv: 14635992, foreachNonZero(std): 350779742, foreachNonZero(std)/gemv: 23.96692632791819
   numRows=256, numCols=4096, gemv: 71265609, foreachNonZero(std): 1423000813, foreachNonZero(std)/gemv: 19.967566866649523
   numRows=1024, numCols=16, gemv: 916645, foreachNonZero(std): 19432942, foreachNonZero(std)/gemv: 21.200074183571612
   numRows=1024, numCols=64, gemv: 3479825, foreachNonZero(std): 66857430, foreachNonZero(std)/gemv: 19.212871336920678
   numRows=1024, numCols=256, gemv: 13680423, foreachNonZero(std): 312189763, foreachNonZero(std)/gemv: 22.82018348409256
   numRows=1024, numCols=1024, gemv: 68880268, foreachNonZero(std): 1401019163, foreachNonZero(std)/gemv: 20.339920323771096
   numRows=1024, numCols=4096, gemv: 293455450, foreachNonZero(std): 5744847994, foreachNonZero(std)/gemv: 19.576559215376644
   numRows=4096, numCols=16, gemv: 3714086, foreachNonZero(std): 82488401, foreachNonZero(std)/gemv: 22.20960984748334
   numRows=4096, numCols=64, gemv: 14273712, foreachNonZero(std): 279946980, foreachNonZero(std)/gemv: 19.612766461870606
   numRows=4096, numCols=256, gemv: 70000687, foreachNonZero(std): 1311574476, foreachNonZero(std)/gemv: 18.73659434228124
   numRows=4096, numCols=1024, gemv: 289944283, foreachNonZero(std): 5695483201, foreachNonZero(std)/gemv: 19.643371278336257
   numRows=4096, numCols=4096, gemv: 1169773295, foreachNonZero(std): 23019445987, foreachNonZero(std)/gemv: 19.678553173843827
   - performance: gemv vs foreachNonZero(std)
   ```
   
   new implements are based on `gemv`, while implements in branch-3.0 are beased on `foreachNonZero(std)`.
   
   I think that a blockSize larger than 64X256 may be acceptable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884988






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518485738



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+   *
+   * @param dim size of vector.
+   * @param avgNNZ average nnz of vectors.
+   * @param blasLevel level of BLAS operation.
+   */
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+      // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+      // and fallback to the Java implementation (f2jBLAS) if necessary.
+      // The suggested value for dense cases is 0.25.
+      0.25
+    } else {
+      // When the dataset is sparse, Spark will use its own Scala implementation.
+      // The suggested value for sparse cases is 64.0.
+      64.0

Review comment:
       I agree that 64MB will to big for a kmeans with large `k`. For kmeans and multi-class logistic regression,  I added a `blasLevel`, maybe we also need to add a param `k`. But for now we may leave it alone, I agree that we can use a conservative value 1MB here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831455


   **[Test build #130960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait HasMaxBlockSizeInMB extends Params `
     * `class HasMaxBlockSizeInMB(Params):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506018249



##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
##########
@@ -199,14 +193,11 @@ class LinearSVC @Since("2.2.0") (
     instr.logNamedValue("lowestLabelWeight", labelSummarizer.histogram.min.toString)
     instr.logNamedValue("highestLabelWeight", labelSummarizer.histogram.max.toString)
     instr.logSumOfWeights(summarizer.weightSum)
-    if ($(blockSize) > 1) {
-      val scale = 1.0 / summarizer.count / numFeatures
-      val sparsity = 1 - summarizer.numNonzeros.toArray.map(_ * scale).sum
-      instr.logNamedValue("sparsity", sparsity.toString)
-      if (sparsity > 0.5) {
-        instr.logWarning(s"sparsity of input dataset is $sparsity, " +
-          s"which may hurt performance in high-level BLAS.")
-      }
+    if (actualBlockSizeInMB == 0) {
+      val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum

Review comment:
       will the additional summarizer consume time ?

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,62 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    new Iterator[InstanceBlock] {
+      private var numCols = -1L
+      private val buff = mutable.ArrayBuilder.make[Instance]
+
+      override def hasNext: Boolean = iterator.hasNext
+
+      override def next(): InstanceBlock = {
+        buff.clear()
+        var buffCnt = 0L
+        var buffNnz = 0L
+        var buffUnitWeight = true
+        var blockMemUsage = 0L
+
+        while (iterator.hasNext && blockMemUsage < maxMemUsage) {
+          val instance = iterator.next()
+          if (numCols < 0L) numCols = instance.features.size
+          require(numCols == instance.features.size)
+          val nnz = instance.features.numNonzeros
+
+          buff += instance
+          buffCnt += 1L
+          buffNnz += nnz
+          buffUnitWeight &&= (instance.weight == 1)
+          blockMemUsage = getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight)
+        }
+
+        // the block mem usage may slightly exceed threshold, not a big issue.
+        // and this ensure even if one row exceed block limit, each block has one row
+        InstanceBlock.fromInstances(buff.result())
+      }
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      0.25
+    } else {
+      64.0
+    }

Review comment:
       Document why choose the value ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937860






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884988






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mengxr commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
mengxr commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r517854961



##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
   /** @group expertGetParam */
   final def getBlockSize: Int = $(blockSize)
 }
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+  /**
+   * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be &gt;= 0..
+   * @group expertParam
+   */
+  final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))

Review comment:
       Shall we call it `maxBlockSizeMB`? The current name suggests that we try to match the block size. Calling it `maxBlockSizeMB` would leave us some space for optimization.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+   *
+   * @param dim size of vector.
+   * @param avgNNZ average nnz of vectors.
+   * @param blasLevel level of BLAS operation.
+   */
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {

Review comment:
       Not used.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+   *
+   * @param dim size of vector.
+   * @param avgNNZ average nnz of vectors.
+   * @param blasLevel level of BLAS operation.
+   */
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+      // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+      // and fallback to the Java implementation (f2jBLAS) if necessary.
+      // The suggested value for dense cases is 0.25.
+      0.25
+    } else {
+      // When the dataset is sparse, Spark will use its own Scala implementation.
+      // The suggested value for sparse cases is 64.0.
+      64.0

Review comment:
       A little surprise to see the default suddenly jumps from 0.25MB to 64MB. This is very risky because 64MB sparse data could generate much bigger dense result, e.g., in multi-class logistic regression or k-means, if we eventually blockify their implementation. In your benchmark, it seems we start observing speed-up at 1MB. I will be very conservative here.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.

Review comment:
       Could you link to the JIRA or this PR that has the performance tests?

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,74 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    var numCols = -1L
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var buffCnt = 0L
+    var buffNnz = 0L
+    var buffUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val nnz = instance.features.numNonzeros
+      buff += instance
+      buffCnt += 1L
+      buffNnz += nnz
+      buffUnitWeight &&= (instance.weight == 1)
+
+      if (getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight) >= maxMemUsage) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+        Iterator.single(block)
+      } else Iterator.empty
+    } ++ {
+      if (buffCnt > 0) {
+        val block = InstanceBlock.fromInstances(buff.result())
+        Iterator.single(block)
+      } else Iterator.empty
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+
+  /**
+   * Suggested value for BlockSizeInMB, based on performance tests of BLAS operation.
+   *
+   * @param dim size of vector.
+   * @param avgNNZ average nnz of vectors.
+   * @param blasLevel level of BLAS operation.
+   */
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      // When the dataset is relatively dense, Spark will use netlib-java for optimised numerical
+      // processing, which will try to use nativeBLAS implementations (like OpenBLAS, Intel MKL),
+      // and fallback to the Java implementation (f2jBLAS) if necessary.
+      // The suggested value for dense cases is 0.25.
+      0.25

Review comment:
       We need more comments to explain how this default value is picked. So this is roughly 180x180 double, which seems quite small to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-722762550


   @mengxr  Thanks for reviewing!
   
   > Does your benchmark code count pre-processing time?
   
   yes, pre-processing time is taken into account.
   
   > Could you paste your benchmark code and environment specs? 
   
   Dataset: [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2)
   numInstances: 100,000; numFeatures: 2,000
    
   env:  ubuntu 18.04
   cmd: bin/spark-shell --driver-memory=64G --conf spark.driver.maxResultSize=10g 
   
   code:
   ```
   import scala.util.Random
    
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.classification._
   import org.apache.spark.ml.regression._
   import org.apache.spark.sql.functions._
   import org.apache.spark.storage.StorageLevel
    
   val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2)
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
    
   def getSparseUDF(dim: Int) = {
   val rng = new Random(123)
   val newIndices = rng.shuffle(Seq.range(0, dim)).take(2000).toArray.sorted
   udf { vec: Vector =>
   Vectors.sparse(dim, newIndices, vec.toArray).compressed
   }
   }
    
   new LinearSVC().setMaxIter(20).fit(df)
    
   val svc = new LinearSVC().setMaxIter(100).setTol(0)
    
   for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000); size <- Seq(0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0)) {
   Thread.sleep(60000)
   val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
   val start = System.currentTimeMillis
   val model = svc.setBlockSizeInMB(size).fit(ds)
   val end = System.currentTimeMillis
   println((model.uid, dim, size, end - start, model.coefficients.toString.take(100)))
   }
    
    
   // for branch-3.0
   for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000)) {
   Thread.sleep(60000)
   val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
   val start = System.currentTimeMillis
   val model = svc.fit(ds)
   val end = System.currentTimeMillis
   println((model.uid, dim, -1, end - start, model.coefficients.toString.take(100)))
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689240






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832945


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952841






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707460083


   **[Test build #129724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-717647271


   also ping @srowen 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712053130


   **[Test build #130001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708849480


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725871798


   **[Test build #130969 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130969/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait HasMaxBlockSizeInMB extends Params `
     * `class HasMaxBlockSizeInMB(Params):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884551


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725795212


   **[Test build #130958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832933


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35566/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654611






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475316


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34330/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506021907



##########
File path: mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala
##########
@@ -74,4 +74,36 @@ class InstanceSuite extends SparkFunSuite{
     }
   }
 
+  test("InstanceBlock: blokify with max memory usage") {
+    val instance1 = Instance(19.0, 2.0, Vectors.dense(1.0, 7.0))
+    val instance2 = Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse)
+    val instances = Seq(instance1, instance2)
+
+    val blocks = InstanceBlock
+      .blokifyWithMaxMemUsage(Iterator.apply(instance1, instance2), 128).toArray
+    require(blocks.length == 1)
+    val block = blocks.head
+    assert(block.size === 2)
+    assert(block.numFeatures === 2)
+    block.instanceIterator.zipWithIndex.foreach {
+      case (instance, i) =>
+        assert(instance.label === instances(i).label)
+        assert(instance.weight === instances(i).weight)
+        assert(instance.features.toArray === instances(i).features.toArray)
+    }
+    Seq(0, 1).foreach { i =>
+      val nzIter = block.getNonZeroIter(i)
+      val vec = Vectors.sparse(2, nzIter.toSeq)
+      assert(vec.toArray === instances(i).features.toArray)
+    }
+
+    // instances larger than maxMemUsage
+    val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
+    InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size
+
+    // different numFeatures
+    intercept[IllegalArgumentException] {
+      InstanceBlock.blokifyWithMaxMemUsage(Iterator.apply(instance1, bigInstance), 64).size
+    }
+  }

Review comment:
       add test:
   * Generate a sparse and dense instance mixed list (a list which some segment is dense but others are very sparse), verify each block size won't exceed the blockMem limit too much. (Such as: (actual block mem size)/confg <= 1.1 ?)

##########
File path: mllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala
##########
@@ -74,4 +74,36 @@ class InstanceSuite extends SparkFunSuite{
     }
   }
 
+  test("InstanceBlock: blokify with max memory usage") {
+    val instance1 = Instance(19.0, 2.0, Vectors.dense(1.0, 7.0))
+    val instance2 = Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse)
+    val instances = Seq(instance1, instance2)
+
+    val blocks = InstanceBlock
+      .blokifyWithMaxMemUsage(Iterator.apply(instance1, instance2), 128).toArray
+    require(blocks.length == 1)
+    val block = blocks.head
+    assert(block.size === 2)
+    assert(block.numFeatures === 2)
+    block.instanceIterator.zipWithIndex.foreach {
+      case (instance, i) =>
+        assert(instance.label === instances(i).label)
+        assert(instance.weight === instances(i).weight)
+        assert(instance.features.toArray === instances(i).features.toArray)
+    }
+    Seq(0, 1).foreach { i =>
+      val nzIter = block.getNonZeroIter(i)
+      val vec = Vectors.sparse(2, nzIter.toSeq)
+      assert(vec.toArray === instances(i).features.toArray)
+    }
+
+    // instances larger than maxMemUsage
+    val bigInstance = Instance(-1.0, 2.0, Vectors.dense(Array.fill(10000)(1.0)))
+    InstanceBlock.blokifyWithMaxMemUsage(Iterator.fill(10)(bigInstance), 64).size

Review comment:
       Verify block contains 1 row.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884545


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506032899



##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala
##########
@@ -199,14 +193,11 @@ class LinearSVC @Since("2.2.0") (
     instr.logNamedValue("lowestLabelWeight", labelSummarizer.histogram.min.toString)
     instr.logNamedValue("highestLabelWeight", labelSummarizer.histogram.max.toString)
     instr.logSumOfWeights(summarizer.weightSum)
-    if ($(blockSize) > 1) {
-      val scale = 1.0 / summarizer.count / numFeatures
-      val sparsity = 1 - summarizer.numNonzeros.toArray.map(_ * scale).sum
-      instr.logNamedValue("sparsity", sparsity.toString)
-      if (sparsity > 0.5) {
-        instr.logWarning(s"sparsity of input dataset is $sparsity, " +
-          s"which may hurt performance in high-level BLAS.")
-      }
+    if (actualBlockSizeInMB == 0) {
+      val avgNNZ = summarizer.numNonzeros.activeIterator.map(_._2 / summarizer.count).sum

Review comment:
       yes, one more metric `numNonZeros` will be computed.
   Since it still need only one pass, I think the additional time should not be significant.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112035






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440823






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725806330


   **[Test build #130960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130960/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725920282


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35582/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707617758


   **[Test build #129740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708650667


   **[Test build #129756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129756/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937125


   **[Test build #129886 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129886/testReport)** for PR 30009 at commit [`c0a734d`](https://github.com/apache/spark/commit/c0a734de5e4d4df819caa4f86634242966d5786b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815422






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893156






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725874235


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35575/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724526178






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725795212


   **[Test build #130958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712112035






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662881






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716253408


   **[Test build #130251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268740






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724436653


   **[Test build #130822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708884714


   **[Test build #129786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541792






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478699


   **[Test build #129724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708888321


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816530


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725872196






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r506167739



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,62 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    new Iterator[InstanceBlock] {
+      private var numCols = -1L
+      private val buff = mutable.ArrayBuilder.make[Instance]
+
+      override def hasNext: Boolean = iterator.hasNext
+
+      override def next(): InstanceBlock = {
+        buff.clear()
+        var buffCnt = 0L
+        var buffNnz = 0L
+        var buffUnitWeight = true
+        var blockMemUsage = 0L
+
+        while (iterator.hasNext && blockMemUsage < maxMemUsage) {
+          val instance = iterator.next()
+          if (numCols < 0L) numCols = instance.features.size
+          require(numCols == instance.features.size)
+          val nnz = instance.features.numNonzeros
+
+          buff += instance
+          buffCnt += 1L
+          buffNnz += nnz
+          buffUnitWeight &&= (instance.weight == 1)
+          blockMemUsage = getBlockMemUsage(numCols, buffCnt, buffNnz, buffUnitWeight)
+        }
+
+        // the block mem usage may slightly exceed threshold, not a big issue.
+        // and this ensure even if one row exceed block limit, each block has one row
+        InstanceBlock.fromInstances(buff.result())
+      }
+    }
+  }
+
+  def blokifyWithMaxMemUsage(
+      instances: RDD[Instance],
+      maxMemUsage: Long): RDD[InstanceBlock] = {
+    require(maxMemUsage > 0)
+    instances.mapPartitions(iter => blokifyWithMaxMemUsage(iter, maxMemUsage))
+  }
+
+  def inferBlockSizeInMB(
+      dim: Int,
+      avgNNZ: Double,
+      blasLevel: Int = 2): Double = {
+    if (dim <= avgNNZ * 3) {
+      0.25
+    } else {
+      64.0
+    }

Review comment:
       Current strategy is quitely simple, I think we may use a complex costmodel if necessay in the future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437534






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851384


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706834138


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34257/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725305176


   @zhengruifeng 
   
   * merge my update PR (fix 2.13 scala issue) https://github.com/apache/spark/pull/30327
   * change param name to be `maxBlockSizeInMB`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725816547






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708689222


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34362/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-726014822


   Merged to master. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720851948






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725899630


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725911880


   **[Test build #130977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait HasMaxBlockSizeInMB extends Params `
     * `class HasMaxBlockSizeInMB(Params):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831725






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724482556


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503159247



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
 
 private[spark] object InstanceBlock {
 
+  private def getBlockSize(
+      numCols: Long,
+      numRows: Long,
+      nnz: Long,
+      allUnitWeight: Boolean): Long = {
+    val doubleBytes = java.lang.Double.BYTES
+    val arrayHeader = 12L
+    val denseSize = Matrices.getDenseSize(numCols, numRows)
+    val sparseSize = Matrices.getSparseSize(nnz, numRows + 1)
+    val matrixSize = math.min(denseSize, sparseSize)
+    if (allUnitWeight) {
+      matrixSize + doubleBytes * numRows + arrayHeader * 2

Review comment:
       should be + 1x arrayHeader ?

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemoryUsage(
+      iterator: Iterator[Instance],
+      maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemoryUsage > 0)
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var numCols = -1L
+    var count = 0L
+    var nnz = 0L

Review comment:
       nnz => buffNnz

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemoryUsage(
+      iterator: Iterator[Instance],
+      maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemoryUsage > 0)
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var numCols = -1L
+    var count = 0L
+    var nnz = 0L
+    var allUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val n = instance.features.numNonzeros
+      var block = Option.empty[InstanceBlock]
+      // Check if enough memory remains to add this instance to the block.
+      if (getBlockSize(numCols, count + 1L, nnz + n,
+        allUnitWeight && (instance.weight == 1)) > maxMemoryUsage) {
+        // Check if this instance is too large
+        require(count > 0, s"instance $instance exceeds memory limit $maxMemoryUsage, " +
+          s"please increase block size")
+
+        block = Some(InstanceBlock.fromInstances(buff.result()))
+        buff.clear()
+        count = 0L
+        nnz = 0L
+        allUnitWeight = true
+      }
+      buff += instance
+      count += 1L
+      nnz += n
+      allUnitWeight &&= (instance.weight == 1)
+      block.iterator
+    } ++ {
+      val instances = buff.result()
+      if (instances.nonEmpty) {
+        Iterator.single(InstanceBlock.fromInstances(instances))
+      } else Iterator.empty
+    }
+  }

Review comment:
       This iterator logic here we'd better use for loop with `yield`, it will be more clear to read.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,65 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemoryUsage(
+      iterator: Iterator[Instance],
+      maxMemoryUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemoryUsage > 0)
+    val buff = mutable.ArrayBuilder.make[Instance]
+    var numCols = -1L
+    var count = 0L
+    var nnz = 0L
+    var allUnitWeight = true
+
+    iterator.flatMap { instance =>
+      if (numCols < 0L) numCols = instance.features.size
+      require(numCols == instance.features.size)
+      val n = instance.features.numNonzeros

Review comment:
       n => nnz

##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
 
 private[spark] object InstanceBlock {
 
+  private def getBlockSize(

Review comment:
       to be semantic accurate, rename to getBlockMemUsage




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724437536


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130822/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708862751


   **[Test build #129786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129786/testReport)** for PR 30009 at commit [`df02e98`](https://github.com/apache/spark/commit/df02e98b1dca2e2f1b5da2fa30ba62d62a52dd47).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712101113


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34608/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440815


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35431/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712053130


   **[Test build #130001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130001/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-720876636






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716252855


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725832950


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35566/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-722762550


   @mengxr  Thanks for reviewing!
   
   > Does your benchmark code count pre-processing time?
   yes, pre-processing time is taken into account.
   
   > Could you paste your benchmark code and environment specs? 
   Dataset: [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2)
   numInstances: 100,000; numFeatures: 2,000
    
   env:  ubuntu 18.04
   cmd: bin/spark-shell --driver-memory=64G --conf spark.driver.maxResultSize=10g 
   
   code:
   ```
   import scala.util.Random
    
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.classification._
   import org.apache.spark.ml.regression._
   import org.apache.spark.sql.functions._
   import org.apache.spark.storage.StorageLevel
    
   val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2)
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
    
   def getSparseUDF(dim: Int) = {
   val rng = new Random(123)
   val newIndices = rng.shuffle(Seq.range(0, dim)).take(2000).toArray.sorted
   udf { vec: Vector =>
   Vectors.sparse(dim, newIndices, vec.toArray).compressed
   }
   }
    
   new LinearSVC().setMaxIter(20).fit(df)
    
   val svc = new LinearSVC().setMaxIter(100).setTol(0)
    
   for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000); size <- Seq(0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0)) {
   Thread.sleep(60000)
   val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
   val start = System.currentTimeMillis
   val model = svc.setBlockSizeInMB(size).fit(ds)
   val end = System.currentTimeMillis
   println((model.uid, dim, size, end - start, model.coefficients.toString.take(100)))
   }
    
    
   // for branch-3.0
   for (dim <- Seq(2000, 3000, 4000, 5000, 10000, 20000, 200000)) {
   Thread.sleep(60000)
   val ds = if (dim == 2000) { df } else { val sparseUDF = getSparseUDF(dim); df.withColumn("features", sparseUDF(col("features"))) }
   val start = System.currentTimeMillis
   val model = svc.fit(ds)
   val end = System.currentTimeMillis
   println((model.uid, dim, -1, end - start, model.coefficients.toString.take(100)))
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716268731


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725913955


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725976554


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35587/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716263315


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725952328


   **[Test build #130981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130981/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait HasMaxBlockSizeInMB extends Params `
     * `class HasMaxBlockSizeInMB(Params):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725847488


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707658672






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440823


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654594


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34346/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707666376


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34348/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709943526






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707475324






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724409079


   **[Test build #130822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130822/testReport)** for PR 30009 at commit [`150f7da`](https://github.com/apache/spark/commit/150f7da6c0165c7e5b70b052055b4c7eab10e2f7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503191442



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -100,6 +102,23 @@ private[spark] case class InstanceBlock(
 
 private[spark] object InstanceBlock {
 
+  private def getBlockSize(
+      numCols: Long,
+      numRows: Long,
+      nnz: Long,
+      allUnitWeight: Boolean): Long = {
+    val doubleBytes = java.lang.Double.BYTES
+    val arrayHeader = 12L
+    val denseSize = Matrices.getDenseSize(numCols, numRows)
+    val sparseSize = Matrices.getSparseSize(nnz, numRows + 1)
+    val matrixSize = math.min(denseSize, sparseSize)
+    if (allUnitWeight) {
+      matrixSize + doubleBytes * numRows + arrayHeader * 2

Review comment:
       there is still two arrays, the weight array is `Array.emptyDoubleArray`, so there is two arrayHeader?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707662416


   **[Test build #129742 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129742/testReport)** for PR 30009 at commit [`9245263`](https://github.com/apache/spark/commit/92452631c11e71b71cb99a7b61d8b8661b150c17).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708245249


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725872196






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725902842


   **[Test build #130977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130977/testReport)** for PR 30009 at commit [`a69ca83`](https://github.com/apache/spark/commit/a69ca83c393f63a1fed13393ee5b3e04cffa384f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725831725






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706917827


   
   > @zero323 I send a new PR here, thanks for reviewing. I tried to verify consistency of annotations locally, but the following cmd failed:
   > 
   > ```
   > mypy --no-incremental --config python/mypy.ini python/pyspark
   > python/pyspark/ml/linalg/__init__.pyi:25: error: misplaced type annotation
   > ```
   > 
   > I installed `mypy` by `sudo apt install mypy` in ubuntu 18.04,
   > I am not very similar to `mypy`, do I need to configure it somewhere?
   
   No additional configuration should be required, but the version from Ubuntu errors is pretty old, and at first glance it doesn't support error codes (`[import]` part). 
   
   Personally I'd recommend either [venv](https://docs.python.org/3/library/venv.html) or miniconda, but if you want quick fix, installing pip and making user install should do the trick
   
   ```
   sudo apt purge mypy
   sudo apt install python3-pip
   pip install mypy
   ```
   
   I've checked things on my side (mypy 0.790, current stable), for both master and this PR, and things look good.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-712119800






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707478934






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r503848595



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala
##########
@@ -114,6 +133,85 @@ private[spark] object InstanceBlock {
   def blokify(instances: RDD[Instance], blockSize: Int): RDD[InstanceBlock] = {
     instances.mapPartitions(_.grouped(blockSize).map(InstanceBlock.fromInstances))
   }
+
+  def blokifyWithMaxMemUsage(
+      iterator: Iterator[Instance],
+      maxMemUsage: Long): Iterator[InstanceBlock] = {
+    require(maxMemUsage > 0)
+
+    new Iterator[InstanceBlock] {
+      private var numCols = -1L
+      private val buff = mutable.ArrayBuilder.make[Instance]
+      private var buffCnt = 0L
+      private var buffNnz = 0L
+      private var buffUnitWeight = true
+      private var block = Option.empty[InstanceBlock]
+
+      private def flush(): Unit = {
+        block = Some(InstanceBlock.fromInstances(buff.result()))
+        buff.clear()
+        buffCnt = 0L
+        buffNnz = 0L
+        buffUnitWeight = true
+      }
+
+      private def blockify(): Unit = {
+        block = None
+
+        while (block.isEmpty && iterator.hasNext) {
+          val instance = iterator.next()
+          if (numCols < 0L) numCols = instance.features.size
+          require(numCols == instance.features.size)
+          val nnz = instance.features.numNonzeros
+
+          // Check if enough memory remains to add this instance to the block.
+          if (getBlockMemUsage(numCols, buffCnt + 1L, buffNnz + nnz,
+            buffUnitWeight && (instance.weight == 1)) > maxMemUsage) {
+            // Check if this instance is too large
+            require(buffCnt > 0, s"instance $instance exceeds memory limit $maxMemUsage, " +
+              s"please increase block size")
+            flush()
+          }
+
+          buff += instance
+          buffCnt += 1L
+          buffNnz += nnz
+          buffUnitWeight &&= (instance.weight == 1)

Review comment:
       After flush, buffCnt/buffNnz clear to be 0, but then you increase one and then exit loop. Then next batch the initial buffCnt/buffNnz won't be 0.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893138


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34393/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709935090


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34492/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725937860


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518482624



##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
   /** @group expertGetParam */
   final def getBlockSize: Int = $(blockSize)
 }
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+  /**
+   * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be &gt;= 0..
+   * @group expertParam
+   */
+  final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))

Review comment:
       or `maxBlockSizeInMB`? to keep in line with existing [`maxMemoryInMB`](https://github.com/apache/spark/blob/bc7885901dd99de21ecbf269d72fa37a393b2ffc/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L121) in `treeParams.scala` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846174


   **[Test build #129653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `trait HasBlockSizeInMB extends Params `
     * `class HasBlockSizeInMB(Params):`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725884560


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35575/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #30009:
URL: https://github.com/apache/spark/pull/30009#discussion_r518626371



##########
File path: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
##########
@@ -562,4 +562,22 @@ trait HasBlockSize extends Params {
   /** @group expertGetParam */
   final def getBlockSize: Int = $(blockSize)
 }
+
+/**
+ * Trait for shared param blockSizeInMB (default: 0.0). This trait may be changed or
+ * removed between minor versions.
+ */
+trait HasBlockSizeInMB extends Params {
+
+  /**
+   * Param for Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be &gt;= 0..
+   * @group expertParam
+   */
+  final val blockSizeInMB: DoubleParam = new DoubleParam(this, "blockSizeInMB", "Maximum memory in MB for stacking input data in blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. If 0, try to infer an appropriate value based on the statistics of dataset. Must be >= 0.", ParamValidators.gtEq(0.0))

Review comment:
       in current pr, a block can exceed this size. I guess `maxBlockSize...` may suggest that a block must be not larger than this value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707654611






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-708893156






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724541763


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35451/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707460083


   **[Test build #129724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129724/testReport)** for PR 30009 at commit [`9cd1053`](https://github.com/apache/spark/commit/9cd10535b962b18421a6a21a79564fe6e7fae157).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706823314


   **[Test build #129653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129653/testReport)** for PR 30009 at commit [`eb0cf6b`](https://github.com/apache/spark/commit/eb0cf6b21c913545f89b7ff8cb3ba6fa65a51556).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-709937564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-706846483






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-707617758


   **[Test build #129740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129740/testReport)** for PR 30009 at commit [`08cf27d`](https://github.com/apache/spark/commit/08cf27d5df69c6d5a5a8d448fb3a3043a6c474a0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-716253408


   **[Test build #130251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130251/testReport)** for PR 30009 at commit [`fc1bc87`](https://github.com/apache/spark/commit/fc1bc87faf48d84c968fb8ca54309ad9bb35fd78).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 closed pull request #30009: [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
WeichenXu123 closed pull request #30009:
URL: https://github.com/apache/spark/pull/30009


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-724440829


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35431/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30009: [SPARK-32907][ML] adaptively blockify instances - LinearSVC

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30009:
URL: https://github.com/apache/spark/pull/30009#issuecomment-725815167


   **[Test build #130958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130958/testReport)** for PR 30009 at commit [`a82e5f5`](https://github.com/apache/spark/commit/a82e5f51722309afdc8b746d032f631b9421d64f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org