You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2008/03/30 21:02:24 UTC

[jira] Created: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Migrate Canopy and KMeans Implementations to Vectors
----------------------------------------------------

                 Key: MAHOUT-20
                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
             Project: Mahout
          Issue Type: Task
          Components: Clustering
    Affects Versions: 0.1
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman


Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by Grant Ingersoll <gs...@apache.org>.
Not everyone subscribes to issues, so the dev list is the typical way  
of telling people about changes in JIRA.  Thus, you should only see  
duplicates for issues you created or are somehow subscribed to.



On Apr 11, 2008, at 2:31 PM, Samee Zahur wrote:

>> Is everybody getting duplicate copies of these posts or is it just  
>> me?
>
> Me too, one directly from JIRA membership, and the other via
> mahout-dev. Maybe jira updates do not have to be sent to mahout-dev? I
> mean, if we wanted to remain updated about changes there, we could
> simply register there, right?
>
> Samee


Re: [jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by Samee Zahur <sa...@gmail.com>.
>Is everybody getting duplicate copies of these posts or is it just me?

Me too, one directly from JIRA membership, and the other via
mahout-dev. Maybe jira updates do not have to be sent to mahout-dev? I
mean, if we wanted to remain updated about changes there, we could
simply register there, right?

Samee

[jira] Resolved: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-20.
--------------------------------

    Resolution: Fixed
      Assignee: Jeff Eastman  (was: Isabel Drost)

r649728 committed the latest patch. All unit tests run, so closing this issue

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583481#action_12583481 ] 

Isabel Drost commented on MAHOUT-20:
------------------------------------

I have already done some migration for the distance metrics. So I guess I should have a first at least rudimentary patch available by the end of this week.

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by Ted Dunning <td...@veoh.com>.
I should add that the continuation style is susceptible to special casing by
the matrix code itself while the iterator is not.


On 4/9/08 9:21 AM, "Isabel Drost (JIRA)" <ji...@apache.org> wrote:

> 
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin
> .system.issuetabpanels:comment-tabpanel&focusedCommentId=12587253#action_12587
> 253 ] 
> 
> Isabel Drost commented on MAHOUT-20:
> ------------------------------------
> 
> 
> I guess, I will correct the first point Jeff mentioned on list.
> 
> 
>> One possible alternative might be to add a sort of iterator mechanism in the
>> Vector interface. That would
>> only visit non-null elements.
> 
> +1 to that. Maybe we could add that functionality after the patch is comitted,
> enhance the vector implementation after that? There were a few other points
> that Jeff mentioned. I would rather like to keep this patch focussed on the
> k-Means and Canopy classes and rather not touch the matrix stuff in it.
> 
> The alternative would be to wait with the patch until the proposed
> functionality is available in the matrix stuff.
> 
> 
> 
>> Migrate Canopy and KMeans Implementations to Vectors
>> ----------------------------------------------------
>> 
>>                 Key: MAHOUT-20
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>>             Project: Mahout
>>          Issue Type: Task
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Jeff Eastman
>>            Assignee: Isabel Drost
>>         Attachments: vectorClustering.txt
>> 
>> 
>> Canopy and KMeans clustering implementations use Float[] representations
>> instead of the new Vector package. They need to be migrated and the Vector
>> package may need some enhancement to support the notion of payloads. This
>> would be a good project for somebody new to the project who wants to get
>> involved. If somebody wants to implement this, just assign the issue to
>> yourself and I will hold off doing it myself.


Re: [jira] Commented: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by Ted Dunning <td...@veoh.com>.

Colt uses a foreachNonZero method that accepts what is essentially a
closure.  This works reasonably well.  A real iterator might be better, but
I can't say, having used only the Colt method.

I can see that having a nonZeroIndexIterator might be a very nice way to do
this.  Ultimately, one approach or the other might play better with the JIT
and inlining, but I definitely wouldn't worry about that yet.


On 4/9/08 9:21 AM, "Isabel Drost (JIRA)" <ji...@apache.org> wrote:

> 
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin
> .system.issuetabpanels:comment-tabpanel&focusedCommentId=12587253#action_12587
> 253 ] 
> 
> Isabel Drost commented on MAHOUT-20:
> ------------------------------------
> 
> 
> I guess, I will correct the first point Jeff mentioned on list.
> 
> 
>> One possible alternative might be to add a sort of iterator mechanism in the
>> Vector interface. That would
>> only visit non-null elements.
> 
> +1 to that. Maybe we could add that functionality after the patch is comitted,
> enhance the vector implementation after that? There were a few other points
> that Jeff mentioned. I would rather like to keep this patch focussed on the
> k-Means and Canopy classes and rather not touch the matrix stuff in it.
> 
> The alternative would be to wait with the patch until the proposed
> functionality is available in the matrix stuff.
> 
> 
> 
>> Migrate Canopy and KMeans Implementations to Vectors
>> ----------------------------------------------------
>> 
>>                 Key: MAHOUT-20
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>>             Project: Mahout
>>          Issue Type: Task
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Jeff Eastman
>>            Assignee: Isabel Drost
>>         Attachments: vectorClustering.txt
>> 
>> 
>> Canopy and KMeans clustering implementations use Float[] representations
>> instead of the new Vector package. They need to be migrated and the Vector
>> package may need some enhancement to support the notion of payloads. This
>> would be a good project for somebody new to the project who wants to get
>> involved. If somebody wants to implement this, just assign the issue to
>> yourself and I will hold off doing it myself.


[jira] Commented: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587253#action_12587253 ] 

Isabel Drost commented on MAHOUT-20:
------------------------------------


I guess, I will correct the first point Jeff mentioned on list.


> One possible alternative might be to add a sort of iterator mechanism in the Vector interface. That would
> only visit non-null elements.

+1 to that. Maybe we could add that functionality after the patch is comitted, enhance the vector implementation after that? There were a few other points that Jeff mentioned. I would rather like to keep this patch focussed on the k-Means and Canopy classes and rather not touch the matrix stuff in it.

The alternative would be to wait with the patch until the proposed functionality is available in the matrix stuff.



> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>         Attachments: vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Isabel Drost reassigned MAHOUT-20:
----------------------------------

    Assignee: Isabel Drost  (was: Jeff Eastman)

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by Jeff Eastman <je...@windwardsolutions.com>.
Isabel Drost (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Isabel Drost updated MAHOUT-20:
> -------------------------------
>
>     Attachment: vectorClustering.txt
>
> I have moved the code from the use of Float[] to using Vector instead. Unit tests are all running again - would be nice if someone could have a quick look at the patch and point me to the hideous mistakes I made or point out suggestions for improvement.
>
>   
>> Migrate Canopy and KMeans Implementations to Vectors
>> ----------------------------------------------------
>>
>>                 Key: MAHOUT-20
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>>             Project: Mahout
>>          Issue Type: Task
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Jeff Eastman
>>            Assignee: Isabel Drost
>>         Attachments: vectorClustering.txt
>>
>>
>> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.
>>     
>
>   

Hi Isabel,

- You might consider using the Vector.divide(double) operation in the 
computeCentroid() methods, but your version is the same as those 
method's implementations.
- I think Point is completely obsolete now and should be removed. There 
are still some dangling dependencies on its formatting and decoding 
operations that require it, however. If those operations were moved 
somewhere else (AbstractVector?) and the test also removed then Point 
could be eliminated.
- It would be good to make your patches from the Mahout directory so the 
paths are relative to that. Your patch applied cleanly with -p7 and all 
the unit tests ran.

+1 If you commit this patch you can clean up the other odds n ends 
another day.

+2 For staying in the game with Ted on the EM thread <grin>. I found the 
exchanges to be most beneficial to my learning process.

Jeff

[jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Isabel Drost updated MAHOUT-20:
-------------------------------

    Attachment: vectorClustering.txt

I have moved the code from the use of Float[] to using Vector instead. Unit tests are all running again - would be nice if someone could have a quick look at the patch and point me to the hideous mistakes I made or point out suggestions for improvement.

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>         Attachments: vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-20:
-------------------------------

    Attachment: jeastman.vcf

+1 There are many, many opportunities to improve on the Vector and 
Matrix implementations and its nice to see so much enthusiasm for doing 
that. I agree an efficient mechanism to iterate over sparse entities is 
needed and both of Ted's suggestions make sense. If we make the commits 
in little steps based upon empirical evidence we will be ahead. We were 
fooled a little with the surprising performance of the initial 
simpleminded SparseVector implementation which proved to be better than 
expected. Also, making significant changes to more than one component at 
a time seems unnecessary and even imprudent.

Is everybody getting duplicate copies of these posts or is it just me?

Jeff


> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>         Attachments: jeastman.vcf, vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-20:
------------------------------

    Fix Version/s: 0.1

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.1
>
>         Attachments: jeastman.vcf, vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-20) Migrate Canopy and KMeans Implementations to Vectors

Posted by "Samee Zahur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587223#action_12587223 ] 

Samee Zahur commented on MAHOUT-20:
-----------------------------------

some of the fuctions like add or distance seem to be iterating through each dimention in the point in a conventional loop: 
for(int i=0;i<z.cardinality();i++) ......
something like this. but in a high dimentional input, this seems to be cancelling out most of the advantages gained by the use of SparseVector. I mean we are not taking advantage of the sparseness of the input data and looping through all the elements in all cases. One possible alternative might be to add a sort of iterator mechanism in the Vector interface. That would only visit non-null elements. 

Samee

> Migrate Canopy and KMeans Implementations to Vectors
> ----------------------------------------------------
>
>                 Key: MAHOUT-20
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-20
>             Project: Mahout
>          Issue Type: Task
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Isabel Drost
>         Attachments: vectorClustering.txt
>
>
> Canopy and KMeans clustering implementations use Float[] representations instead of the new Vector package. They need to be migrated and the Vector package may need some enhancement to support the notion of payloads. This would be a good project for somebody new to the project who wants to get involved. If somebody wants to implement this, just assign the issue to yourself and I will hold off doing it myself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.