You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2013/12/04 07:07:36 UTC

[jira] [Comment Edited] (MAHOUT-1368) Convert OnlineSummarizer to use the new TDigest

    [ https://issues.apache.org/jira/browse/MAHOUT-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838638#comment-13838638 ] 

Suneel Marthi edited comment on MAHOUT-1368 at 12/4/13 6:07 AM:
----------------------------------------------------------------

Ted, we need to hold off on committing this patch until we fix the issue with ClusterQualitySummarizer which is broken after applying this patch.  I'll look at it tomorrow, its too late in the night now to wrap my head around it.

Running ClusterQualitySummarizer (after applying this patch) on output StreamingKMeans and it throws the following exception:-

{Code}
Average distance in cluster 0 [4]: 18723.469424
Average distance in cluster 1 [1169]: 13974.466645
Average distance in cluster 2 [1932]: 1273.335898
Exception in thread "main" java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
	at org.apache.mahout.math.stats.TDigest.quantile(TDigest.java:268)
	at org.apache.mahout.math.stats.OnlineSummarizer.getQuartile(OnlineSummarizer.java:83)
	at org.apache.mahout.math.stats.OnlineSummarizer.getMax(OnlineSummarizer.java:79)
	at org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:74)
	at org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:66)
	at org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.run(ClusterQualitySummarizer.java:141)
	at org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.main(ClusterQualitySummarizer.java:281)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

{Code}


was (Author: smarthi):
Ted, we need to hold off on committing this patch until we fix the issue with ClusterQualitySummarizer which is broken after applying this patch.  I'll look at it tomorrow, too late in the night to wrap my head around the issue.

> Convert OnlineSummarizer to use the new TDigest
> -----------------------------------------------
>
>                 Key: MAHOUT-1368
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1368
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Ted Dunning
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1368.patch
>
>
> The new TDigest provides better accuracy for quartile estimation as well as producing any other quantile you might like.  The current quartile estimation of the OnlineSummarizer fails for highly skewed distributions and can't really be extended to provide other quantiles.  The TDigest handles all of this.



--
This message was sent by Atlassian JIRA
(v6.1#6144)