You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/12/22 16:43:58 UTC

[jira] [Resolved] (SPARK-17801) [ML]Random Forest Regression fails for large input

     [ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-17801.
-------------------------------
    Resolution: Not A Problem

I think this is just attributable to extremely high maxBins, and not a bug.

> [ML]Random Forest Regression fails for large input
> --------------------------------------------------
>
>                 Key: SPARK-17801
>                 URL: https://issues.apache.org/jira/browse/SPARK-17801
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04
>            Reporter: samkit
>            Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500    Maximum Bins:7477383     MaxDepth:27
> MinInstancesPerNode:8648      SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org