You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/10/23 21:08:41 UTC

[jira] Commented: (MAHOUT-533) Clustering Standard Deviation Calculations Are Inaccurate

    [ https://issues.apache.org/jira/browse/MAHOUT-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924226#action_12924226 ] 

Ted Dunning commented on MAHOUT-533:
------------------------------------

{quote}
Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is in my plans for 0.5; however, Fuzzy K-Means requires that weighted observations be correctly handled and both K-Means algorithms require that observation statistics (see ClusterObservations) be passed from mapper to combiner to reducer and I have not been able to figure out how to do this yet with Online's state variables as opposed to RunningSums. There is also a performance difference between the two algorithms, since Online does a complete computation for each observe() and none in compute() whereas RunningSums has minimal per-observation math and does all the heavy lifting in compute().
{quote}

This is pretty straightforward.  Instead of counting the number of samples, increment the number of samples by the weight.  Likewise, where you divide by the number of samples, divide by the total weight so far.



> Clustering Standard Deviation Calculations Are Inaccurate
> ---------------------------------------------------------
>
>                 Key: MAHOUT-533
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-533
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> Mahout has two classes that compute Gaussian statistics: RunningSumsGaussianAccumulator and OnlineGaussianAccumulator. The first uses sum-of-squares and the second Welford's method. There is also a unit test (TestGaussianAccumulators) which compares their results over a sample dataset and illustrates the large differences in standard deviation produced. The Online accumulator is used in the CDbwEvaluator to compute its metrics. The RunningSums accumulator is only used by the unit test for comparison purposes.
> Today, the sum-of-squares method is used in AbstractCluster to compute mean and stdDev statistics for all Clusters. The stdDev values are not used by most of the clustering algorithms except for graphical displays so this does not cause an accuracy problem with the clustering results themselves. For Dirichlet process clustering; however, stdDev is relevant in computing pdf() and so it needs to be changed in those models. Even with this numerical error; however, Dirichlet performs pretty well. This is probably due to its sampling behavior not requiring precise standard deviations.
> Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is in my plans for 0.5; however, Fuzzy K-Means requires that weighted observations be correctly handled and both K-Means algorithms require that observation statistics (see ClusterObservations) be passed from mapper to combiner to reducer and I have not been able to figure out how to do this yet with Online's state variables as opposed to RunningSums. There is also a performance difference between the two algorithms, since Online does a complete computation for each observe() and none in compute() whereas RunningSums has minimal per-observation math and does all the heavy lifting in compute().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.