You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:43:12 UTC
[jira] [Resolved] (SPARK-24866) Artifactual ROC scores when scaling
up Random Forest classifier
[ https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24866.
----------------------------------
Resolution: Incomplete
> Artifactual ROC scores when scaling up Random Forest classifier
> ---------------------------------------------------------------
>
> Key: SPARK-24866
> URL: https://issues.apache.org/jira/browse/SPARK-24866
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.0
> Reporter: Evan Zamir
> Priority: Minor
> Labels: bulk-closed
>
> I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores (*bestValidationMetric*) increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org