You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 姜鑫 <ji...@gmail.com> on 2022/10/02 04:00:31 UTC

Re: Spark ML VarianceThresholdSelector Unexpected Results

Thank you so much for the reply. You are right and maybe it would be better if it is mentioned in docs because in some other ml libraries e.g. sklearn, it uses population variance.



> 2022年9月30日 上午10:49,Sean Owen <sr...@gmail.com> 写道:
> 
> This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population.
> 
> On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 <jiangxin369@gmail.com <ma...@gmail.com>> wrote:
> Hi folks,
> 
> Has anyone used VarianceThresholdSelector refer to https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector <https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector> ? In the doc, an example is gaven and says `The variance for the 6 features are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47, 4.22, 9.56, and there should be only 3 columns selected. Is there something wrong with me or this is a bug?
> 
> 
> Regards,
> Xin