You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Andrew Musselman (JIRA)" <ji...@apache.org> on 2013/10/30 23:07:26 UTC

[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

    [ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809686#comment-13809686 ] 

Andrew Musselman commented on MAHOUT-1030:
------------------------------------------

Grant, who should I talk to to get up to speed on this?

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 1.0, 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Re: [jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

Posted by Pat Ferrel <pa...@gmail.com>.

As I recall this goes back to 0.6 where the clustered points used to store the distance to the centroid in the properties of a WeightedPropertyVectorWritable then 0.7 changed the output to WeightedVectorWritable and the distance to centroid could not be stored in the new class.

I modified my code to loop through every vector calculating the distance to centroid for each. Doing this in mapreduce adds a pain point to consuming the output, especially since it duplicates information that was thrown away. Jeff tried a quick fix but it introduced more problems and so was dropped.

To change the subject slightly.

I’d be happy if this was closed and went into a feature request to make vectors able in general to keep around metadata like strings. This bigger issue is a major pain because it causes users to keep giant indexes of mahout id->name so vectors can be reattached to their external ids. The process of creating index and reverse index for mahout ids is a scaling problem because to do it in mapreduce is non-trivial so many (myself included) use in-memory hashmaps to do the id lookups. This kind of lookup happens every time you use Mahout with real-world data.

If all of Mahout guaranteed support for optionally attaching properties to vectors it would be much easier to use.  It would do away with the need for id translation and enable things like keeping the distance to centroid with clustered vectors.  

On Oct 30, 2013, at 3:07 PM, Andrew Musselman (JIRA) <ji...@apache.org> wrote:

   [ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809686#comment-13809686 ] 

Andrew Musselman commented on MAHOUT-1030:
------------------------------------------

Grant, who should I talk to to get up to speed on this?

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
> 
>                Key: MAHOUT-1030
>                URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>            Project: Mahout
>         Issue Type: Bug
>         Components: Clustering, Integration
>   Affects Versions: 0.7
>           Reporter: Jeff Eastman
>           Assignee: Andrew Musselman
>            Fix For: 1.0, 0.9
> 
>        Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
> 
> 
> Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
>> That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight.
>> 
>> On 6/7/12 1:00 PM, Pat Ferrel wrote:
>>> It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid?
>>> 
>>> I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large.
>>> 
>>> Am I missing something?
>>> 
>>> 
>> 

--
This message was sent by Atlassian JIRA
(v6.1#6144)