You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Hendrickson (JIRA)" <ji...@apache.org> on 2016/04/14 00:02:25 UTC

[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

    [ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240105#comment-15240105 ] 

Seth Hendrickson commented on SPARK-14610:
------------------------------------------

One thing to note, is that fixing this actually uncovers a bug of sorts. There is an assertion in this method to verify that there are more than zero splits. However, due to the extra split being returned previously, this assertion did nothing. Now, the training will fail if there is a constant continuous feature. So, this PR will also remove this assertion and handle constant continuous features appropriately.

I can submit a PR for this soon.

> Remove superfluous split from random forest findSplitsForContinousFeature
> -------------------------------------------------------------------------
>
>                 Key: SPARK-14610
>                 URL: https://issues.apache.org/jira/browse/SPARK-14610
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest produces an unnecessary split. For example, if a continuous feature has unique values: {1, 2, 3}, then the possible splits generated by this method are:
> {1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>       val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>       assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org