You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/08 09:20:22 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #31090: [SPARK-34047][ML] save tree model in single partition

zhengruifeng opened a new pull request #31090:
URL: https://github.com/apache/spark/pull/31090


   ### What changes were proposed in this pull request?
   save a tree model in single partition, like other impls
   
   
   ### Why are the changes needed?
   current model saving may generate too many small files
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761962052


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759178976


   @srowen reasonable.
   I just create a `RandomForestClassificationModel` with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
   But for a DecisionTree, whose size is definitely small enough (about 2.3M in above rf model), I think it is safe to use single partition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-764190859


   Merged to master, thanks @srowen for reviewing!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763530382


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134265/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762590209


   **[Test build #134209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759179996


   > Do we do this for other models?
   
   Yes, for most classificaion and regression models, we save them in single partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761978689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946


   **[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-757590339


   ping @srowen 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763521998


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763493557


   **[Test build #134265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r559870512



##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
       DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
       val (nodeData, _) = NodeData.build(instance.rootNode, 0)
       val dataPath = new Path(path, "data").toString
-      sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+      // 2,000,000 nodes is about 40MB
+      val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt

Review comment:
       ok, I will increase this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760247175


   Hm, the description says this is all to make GBT/DT consistent with other impls that save in 1 partition? that's a fine reason to make this change. I'm saying that seems like fine logic. Basing it on node count also seems healthy if you want to change all implementations of tree models to work that way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761978689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589048






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756669909


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761966300


   **[Test build #134173 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946


   **[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760034587


   @srowen For RF&GBT, maybe we can use `repartition((numTrees/10.0).ceil.toInt)` ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759203614


   **[Test build #133992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763557496


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760122648


   I just create another rf model with 10 trees and totally 2,789,824 nodes:
   ```
   scala> rfcm.trees.length
   res3: Int = 10
   
   scala> rfcm.trees.map(_.numNodes).sum
   res4: Int = 2789824
   
   scala> rfcm.save("/tmp/rfcm")
   ```
   
   save it to disk and its size is 49M.
   ```
   du -sh /tmp/rfcm 
   49M	/tmp/rfcm
   ```
   
   Since the model size is in propotion to number of nodes, so what about determine the number of paraitions by a formula like `numNodes / 1,000,000`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763530382


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134265/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762601528


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134209/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759184590


   **[Test build #133992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759207765


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759216351






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761969597


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763529928


   **[Test build #134265 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759197681


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762582892


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762572659


   **[Test build #134209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763541676


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759183123


   Sounds like a reasonable heuristic to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r560399388



##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
       DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
       val (nodeData, _) = NodeData.build(instance.rootNode, 0)
       val dataPath = new Path(path, "data").toString
-      sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+      // 7,280,000 nodes is about 128MB
+      val numDataParts = (instance.numNodes / 7280000.0).ceil.toInt

Review comment:
       Is there any easy place to expose a small shared method for this rather than duplicate it in several places?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763557496


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38851/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760121155


   I think that's kind of arbitrary.. I suppose if anything we should follow suit and save 1 partition per tree, by this logic. I'd simply favor making whatever change improves consistency.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #31090:
URL: https://github.com/apache/spark/pull/31090


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756680234


   **[Test build #133831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762572659


   **[Test build #134209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761942382


   I perfer determine the numParts by numNodes, I will update the description and PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761948942


   **[Test build #134173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763493557


   **[Test build #134265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759184590


   **[Test build #133992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756687306


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-764190859


   Merged to master, thanks @srowen for reviewing!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759216351






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760034587


   @srowen For RF&GBT, maybe we can determine the number of partitions by the total number of tree nodes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r559645279



##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
       DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
       val (nodeData, _) = NodeData.build(instance.rootNode, 0)
       val dataPath = new Path(path, "data").toString
-      sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+      // 2,000,000 nodes is about 40MB
+      val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt

Review comment:
       OK - my rule of thumb about partition sizes is "128MB" going back to the days of Hadoop. Any number in that range is about as good as the next, but I might increase this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760125006


   current impl doesn't save one tree per partition, do you mean  changing `sql.createDataFrame(nodeDataRDD).write.parquet(dataPath)`  to `sql.createDataFrame(nodeDataRDD).write.partitionBy("treeID").parquet(dataPath)`? @srowen 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589036


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759178976


   @srowen reasonable.
   I just create a `RandomForestClassificationModel` with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
   But for a DecisionTree, whose size is definitely small enough (I also create a decision tree with depth=30, its size is 3.9M), I think it is safe to use single partition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946


   **[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589048


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38794/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #31090:
URL: https://github.com/apache/spark/pull/31090


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761948942


   **[Test build #134173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org