You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Evan Zamir (JIRA)" <ji...@apache.org> on 2018/07/19 21:58:00 UTC

[jira] [Created] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier

Evan Zamir created SPARK-24866:
----------------------------------

             Summary: Artifactual ROC scores when scaling up Random Forest classifier
                 Key: SPARK-24866
                 URL: https://issues.apache.org/jira/browse/SPARK-24866
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
            Reporter: Evan Zamir


I'm encountering a very strange behavior that I can't explain away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core instance. On these models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good models obviously, but that's not the point here). After learning that Spark 2.3 introduced a parallelism parameter for the CV class, I decided to implement that and see if increasing the number of Core instances could also help speed up the models (which take several hours, sometimes up to a full day). To make a long story short, I have seen that on some of my datasets simply increasing the number of Core instances (i.e. 2), the ROC scores increase tremendously to the range of 0.85-0.95. For the life of me I can't figure out why simply increasing the number of instances (with absolutely no changes to code), would have this effect. I don't know if this is a Spark problem or somehow EMR, but I figured I'd post here and see if anyone has an idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org