You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/08 09:20:22 UTC
[GitHub] [spark] zhengruifeng opened a new pull request #31090: [SPARK-34047][ML] save tree model in single partition
zhengruifeng opened a new pull request #31090:
URL: https://github.com/apache/spark/pull/31090
### What changes were proposed in this pull request?
save a tree model in single partition, like other impls
### Why are the changes needed?
current model saving may generate too many small files
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761962052
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759178976
@srowen reasonable.
I just create a `RandomForestClassificationModel` with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
But for a DecisionTree, whose size is definitely small enough (about 2.3M in above rf model), I think it is safe to use single partition.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-764190859
Merged to master, thanks @srowen for reviewing!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763530382
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134265/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762590209
**[Test build #134209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759179996
> Do we do this for other models?
Yes, for most classificaion and regression models, we save them in single partitions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761978689
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946
**[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-757590339
ping @srowen
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763521998
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763493557
**[Test build #134265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r559870512
##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
val (nodeData, _) = NodeData.build(instance.rootNode, 0)
val dataPath = new Path(path, "data").toString
- sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+ // 2,000,000 nodes is about 40MB
+ val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt
Review comment:
ok, I will increase this
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760247175
Hm, the description says this is all to make GBT/DT consistent with other impls that save in 1 partition? that's a fine reason to make this change. I'm saying that seems like fine logic. Basing it on node count also seems healthy if you want to change all implementations of tree models to work that way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761978689
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589048
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756669909
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761966300
**[Test build #134173 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946
**[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760034587
@srowen For RF&GBT, maybe we can use `repartition((numTrees/10.0).ceil.toInt)` ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759203614
**[Test build #133992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763557496
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760122648
I just create another rf model with 10 trees and totally 2,789,824 nodes:
```
scala> rfcm.trees.length
res3: Int = 10
scala> rfcm.trees.map(_.numNodes).sum
res4: Int = 2789824
scala> rfcm.save("/tmp/rfcm")
```
save it to disk and its size is 49M.
```
du -sh /tmp/rfcm
49M /tmp/rfcm
```
Since the model size is in propotion to number of nodes, so what about determine the number of paraitions by a formula like `numNodes / 1,000,000`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763530382
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134265/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762601528
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134209/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759184590
**[Test build #133992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759207765
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759216351
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761969597
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763529928
**[Test build #134265 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759197681
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762582892
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762572659
**[Test build #134209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763541676
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759183123
Sounds like a reasonable heuristic to me.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r560399388
##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
val (nodeData, _) = NodeData.build(instance.rootNode, 0)
val dataPath = new Path(path, "data").toString
- sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+ // 7,280,000 nodes is about 128MB
+ val numDataParts = (instance.numNodes / 7280000.0).ceil.toInt
Review comment:
Is there any easy place to expose a small shared method for this rather than duplicate it in several places?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756700562
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763557496
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38851/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760121155
I think that's kind of arbitrary.. I suppose if anything we should follow suit and save 1 partition per tree, by this logic. I'd simply favor making whatever change improves consistency.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #31090:
URL: https://github.com/apache/spark/pull/31090
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756680234
**[Test build #133831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762572659
**[Test build #134209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134209/testReport)** for PR 31090 at commit [`8d5b076`](https://github.com/apache/spark/commit/8d5b0765a6a6c7911a72852581ce41a71b929ce2).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761942382
I perfer determine the numParts by numNodes, I will update the description and PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761948942
**[Test build #134173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-763493557
**[Test build #134265 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134265/testReport)** for PR 31090 at commit [`08733c8`](https://github.com/apache/spark/commit/08733c80d42b23b16e71b3c10de6bfaf4de73f35).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759184590
**[Test build #133992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133992/testReport)** for PR 31090 at commit [`9f6dffa`](https://github.com/apache/spark/commit/9f6dffa67ca6a217161c8cc95935de8ed4e9d746).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756687306
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-764190859
Merged to master, thanks @srowen for reviewing!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759216351
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760034587
@srowen For RF&GBT, maybe we can determine the number of partitions by the total number of tree nodes?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31090:
URL: https://github.com/apache/spark/pull/31090#discussion_r559645279
##########
File path: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
##########
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
val (nodeData, _) = NodeData.build(instance.rootNode, 0)
val dataPath = new Path(path, "data").toString
- sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+ // 2,000,000 nodes is about 40MB
+ val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt
Review comment:
OK - my rule of thumb about partition sizes is "128MB" going back to the days of Hadoop. Any number in that range is about as good as the next, but I might increase this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-760125006
current impl doesn't save one tree per partition, do you mean changing `sql.createDataFrame(nodeDataRDD).write.parquet(dataPath)` to `sql.createDataFrame(nodeDataRDD).write.partitionBy("treeID").parquet(dataPath)`? @srowen
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589036
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on pull request #31090: [SPARK-34047][ML] save decisiontree model in single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-759178976
@srowen reasonable.
I just create a `RandomForestClassificationModel` with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
But for a DecisionTree, whose size is definitely small enough (I also create a decision tree with depth=30, its size is 3.9M), I think it is safe to use single partition.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946
**[Test build #133831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133831/testReport)** for PR 31090 at commit [`ef555de`](https://github.com/apache/spark/commit/ef555de743fa6cfffc0d7e757b15f7d122064002).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31090: [SPARK-34047][ML] save tree model in single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-756647946
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-762589048
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38794/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #31090:
URL: https://github.com/apache/spark/pull/31090
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31090: [SPARK-34047][ML] tree models saving: compute numParts according to numNodes
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31090:
URL: https://github.com/apache/spark/pull/31090#issuecomment-761948942
**[Test build #134173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134173/testReport)** for PR 31090 at commit [`c4a77bc`](https://github.com/apache/spark/commit/c4a77bc3bae41343163208538c56d4e976352490).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org