You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by 姜鑫 <ji...@gmail.com> on 2022/09/30 02:20:10 UTC

Spark ML VarianceThresholdSelector Unexpected Results

Hi folks,

Has anyone used VarianceThresholdSelector refer to https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector <https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector> ? In the doc, an example is gaven and says `The variance for the 6 features are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47, 4.22, 9.56, and there should be only 3 columns selected. Is there something wrong with me or this is a bug?


Regards,
Xin

Re: Spark ML VarianceThresholdSelector Unexpected Results

Posted by 姜鑫 <ji...@gmail.com>.

Thank you so much for the reply. You are right and maybe it would be better if it is mentioned in docs because in some other ml libraries e.g. sklearn, it uses population variance.



> 2022年9月30日 上午10:49，Sean Owen <sr...@gmail.com> 写道：
> 
> This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population.
> 
> On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 <jiangxin369@gmail.com <ma...@gmail.com>> wrote:
> Hi folks,
> 
> Has anyone used VarianceThresholdSelector refer to https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector <https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector> ? In the doc, an example is gaven and says `The variance for the 6 features are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47, 4.22, 9.56, and there should be only 3 columns selected. Is there something wrong with me or this is a bug?
> 
> 
> Regards,
> Xin

Re: Spark ML VarianceThresholdSelector Unexpected Results

Posted by Sean Owen <sr...@gmail.com>.

This is sample variance, not population (i.e. divide by n-1, not n). I
think that's justified as the data are notionally a sample from a
population.

On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 <ji...@gmail.com> wrote:

> Hi folks,
>
> Has anyone used VarianceThresholdSelector refer to
> https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector ?
> In the doc, an example is gaven and says `The variance for the 6 features
> are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after
> calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47,
> 4.22, 9.56, and there should be only 3 columns selected. Is there something
> wrong with me or this is a bug?
>
>
> Regards,
> Xin
>