You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2012/06/09 20:57:42 UTC

[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

     [ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-1030:
---------------------------------

    Attachment: MAHOUT-1030.patch

Here's a patch that modifies the cluster classification code to emit WeightedPropertyVectorWritables instead of WeightedVectorWritables. It does not compute a distance property; however, as this is not generally available for all Cluster types (really only DistanceMeasureClusters).

The more I think about the distance property calculation the more I am comfortable with the current (trunk) implementation. Consider that, for DistanceMeasureClusters at least, the pdf is:

pdf = 1/(1+distance)

It is trivial to back-calculate the distance from the pdf value that is already written in the WeightedVectorWritable. This is because, in the new implementation, the pdfs are calculated by the classify() policy method and it is the pdfs that are written.

Sooo, I'm not sure this is a must-fix or even a want-fix. Thoughts?

This patch also fixes a bug in the ClusterDumperWriter that correctly catches null properties fields read when the written properties map contains 0 entries. This part of the patch should go into trunk regardless of the above.
                
> Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira