You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Olivier Sannier (JIRA)" <ji...@apache.org> on 2018/05/14 15:46:00 UTC

[jira] [Reopened] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights

     [ https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olivier Sannier reopened SPARK-23709:
-------------------------------------

I'm sorry, but I already did ask the question on the mailing list, and did not receive any answer.

I waited a couple of weeks before creating the issue here.

And to me, this is an issue, at least with documentation because the theory dictates that all datasets are the same size, which they are not with the current implementation.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23709
>                 URL: https://issues.apache.org/jira/browse/SPARK-23709
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Olivier Sannier
>            Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where each row is associated to a weight for the final dataset, ie for each tree asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to compute the variance is not always equal to the source data count, it is sometimes less, sometimes more.
> I went digging in the source and found the BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a Poisson distribution to assign a weight to each row. And this distribution does not guarantee that the total of weights for a given tree is equal to the source dataset count.
> Looking around in here, it seems this is done for performance reasons because the approximation it gives is good enough, especially when dealing with very large datasets.
> However, I could not find any documentation that clearly explains this. Would you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org