You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Olivier Sannier (JIRA)" <ji...@apache.org> on 2018/11/16 16:31:00 UTC

[jira] [Commented] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights

    [ https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689633#comment-16689633 ] 

Olivier Sannier commented on SPARK-23709:
-----------------------------------------

Well, sure, but nobody answered on the mailing list and I'm convinced that the documentation should be update to provide the answer to this question.

So I'm left with no viable solution here, and that's pretty annoying.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23709
>                 URL: https://issues.apache.org/jira/browse/SPARK-23709
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Olivier Sannier
>            Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where each row is associated to a weight for the final dataset, ie for each tree asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to compute the variance is not always equal to the source data count, it is sometimes less, sometimes more.
> I went digging in the source and found the BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a Poisson distribution to assign a weight to each row. And this distribution does not guarantee that the total of weights for a given tree is equal to the source dataset count.
> Looking around in here, it seems this is done for performance reasons because the approximation it gives is good enough, especially when dealing with very large datasets.
> However, I could not find any documentation that clearly explains this. Would you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org