You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/07 04:03:56 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #29018: [SPARK-32202][ML] tree models auto infer compact integer type

zhengruifeng opened a new pull request #29018:
URL: https://github.com/apache/spark/pull/29018


   ### What changes were proposed in this pull request?
   auto infer the compact integer type, for example if `maxBins < Byte.MaxValue`, use `Byte` instead of `Int` to store binned features.
   
   ### Why are the changes needed?
   save RAM
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites and added testsuites
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657505193


   I remove the specialization in methods and kryo registration, since they had ignorable impact on the performance.
   
   Then I retest on more config, and it seems that this regression comes from two parts:
   1, a constant regression;
   2, a bit regresion due to convert `Byte` to `Int` in training;
   
   Test Result:
   |Durations| GBT(depth=10;iter=10) | GBT(depth=10;iter=20) | RF(depth=10;iter=100) | RF(depth=20;iter=100)  |
   |------|----------|------------|----------|------------|
   |This PR|559299|1028470|149084|956002|
   |Master|515832|997470|135395|942217|
   |Regression|8.4%|3.1%|10.1%|1.4%|
   
   When the  computation of tree growth dominates the whole procedure, the regression (mainly due to part 2) can be tiny (<4%).
   
   
   ping @srowen 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657993178


   Despite the significant memory save, I'm also not sure how often memory is an issue when training the model. I guess CPU is more important here. This win only happens when maxBins is less, so sounds it is for limited cases, but the perf regression happens for all cases?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657984250


   @huaxingao @WeichenXu123 @viirya How do you think about saving ~70% (Array[Int] -> Array[Byte]) RAM at the cost of somewhat regression (1% ~ 10%)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654602602






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-727668811


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670404444


   **[Test build #127192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127192/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670481210


   **[Test build #127192 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127192/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).
    * This patch **fails PySpark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654586806






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654602602


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657394198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654586548


   **[Test build #125174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125174/testReport)** for PR 29018 at commit [`331023f`](https://github.com/apache/spark/commit/331023f09304eb8f895e1b4d460b9d0a9ec711a9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657394110


   **[Test build #125753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125753/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657394198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-658066066


   @viirya Thanks for reviewing!
   
   > This win only happens when maxBins is less
   
   Yes, but in most cases, maxBin(default=32) < 128
   
   > the perf regression happens for all cases
   
   Yes, I think so.
   
   > I'm also not sure how often memory is an issue when training the model
   
   It will make sense if there is no enough memory for orginal treePoint(Array[Int]). I personally think it maybe worthwhile if the regression is small enough, but I am not sure whether current performance results are OK.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654586806






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670481911






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670404444


   **[Test build #127192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127192/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-658478295


   Seems to me that the significant memory saving outweighs the performance degradation, but I am also not sure if user benefits more from memory saving or from performance gain.    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670481911


   Build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-656487627


   I had removed the specialization in those methods, expect in `TreePoint`, but there seems no significant improvement. I need more time to figure out while it is slower.
   
   > I honestly don't know how often the training is limited by memory vs CPU here?
   
   It is a trade-off between RAM vs CPU.
   I think CPU is a bit more important, since the training dataset tend to fit in memory.
   But if it can reduce 70% RAM usage at the cost of <5% slower, I think it worthwhile.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657452051


   **[Test build #125753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125753/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657453017






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #29018:
URL: https://github.com/apache/spark/pull/29018#discussion_r452627763



##########
File path: mllib/src/main/scala/org/apache/spark/ml/util/MLUtils.scala
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import org.apache.spark.SparkConf
+import org.apache.spark.ml.tree.impl._
+import org.apache.spark.util.Utils
+
+private[spark] object MLUtils {
+
+  private[this] var kryoRegistered: Boolean = false
+
+  def registerKryoClasses(conf: SparkConf): Unit = {

Review comment:
       I think mark it synchronized is good, here follows same impl in `GraphXUtils` and `SquaredEuclideanSilhouette`. I think we can make them `synchronized` in the future.
   
   > Do these really have to get serialized to make it work?
   
   Since current `TreePoint` was registered in `object KryoSerializer`, so I tried to also register `TreePoint[B]`, otherwise, I may need to swith to a new treepoint class.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657394110


   **[Test build #125753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125753/testReport)** for PR 29018 at commit [`19e4bcf`](https://github.com/apache/spark/commit/19e4bcfac6d876afd289b736bc19008032564c2d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657600128


   I'm a little concerned about the complexity and perf regression, despite the good memory win. Are we usually that memory limited vs CPU limited?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-657453017






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654602609


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125174/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29018:
URL: https://github.com/apache/spark/pull/29018


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654586548


   **[Test build #125174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125174/testReport)** for PR 29018 at commit [`331023f`](https://github.com/apache/spark/commit/331023f09304eb8f895e1b4d460b9d0a9ec711a9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #29018:
URL: https://github.com/apache/spark/pull/29018#discussion_r450796336



##########
File path: mllib/src/main/scala/org/apache/spark/ml/util/MLUtils.scala
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import org.apache.spark.SparkConf
+import org.apache.spark.ml.tree.impl._
+import org.apache.spark.util.Utils
+
+private[spark] object MLUtils {
+
+  private[this] var kryoRegistered: Boolean = false
+
+  def registerKryoClasses(conf: SparkConf): Unit = {

Review comment:
       synchronized?
   This is the only part that worries me. I don't know how stable the class names are or if there are other implications of registering them (probably not). Do these really have to get serialized to make it work?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-670481919


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127192/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654602492


   **[Test build #125174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125174/testReport)** for PR 29018 at commit [`331023f`](https://github.com/apache/spark/commit/331023f09304eb8f895e1b4d460b9d0a9ec711a9).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29018:
URL: https://github.com/apache/spark/pull/29018#issuecomment-654587121


   test code:
   ```scala
   import org.apache.spark.ml.classification._
   import org.apache.spark.storage.StorageLevel
   
   
   val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("label", (col("label")+1)/2)
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
   
   new RandomForestClassifier().fit(df)
   
   
   val dt = new RandomForestClassifier().setMaxDepth(10).setNumTrees(100)
   
   val start = System.currentTimeMillis; Seq.range(0, 10).foreach{_ => val model = dt.fit(df)}; val end = System.currentTimeMillis; end - start
   
   ```
   
   
   results:
   this PR:
   duration = 1356042
   ![image](https://user-images.githubusercontent.com/7322292/86707054-2212ed00-c04a-11ea-8806-67d94c73dae5.png)
   
   Master:
   duration = 1245078
   ![image](https://user-images.githubusercontent.com/7322292/86707096-2b03be80-c04a-11ea-8ace-8e0dfd6e4d4a.png)
   
   
   The RAM to persisit training datset was reduced from 777M to 237M. However, there was a performance regression, this PR is about 8% slower than master in above test.
   
   @srowen @WeichenXu123 @huaxingao How do you think about this? and is there some way to avoid this performace regression?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org