You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2009/06/20 15:28:07 UTC

[jira] Created: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Convert Clustering Algs to use Vector Writable
----------------------------------------------

                 Key: MAHOUT-137
                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
             Project: Mahout
          Issue Type: Improvement
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
             Fix For: 0.2


All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722754#action_12722754 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

http://hadoop.markmail.org/message/jr4cbem46erlhgzu

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722272#action_12722272 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

Yes, this is the plan.  The problem I'm having right now is between the CanopyDriver and the ClusterDriver for Canopy.  I've made Canopy Writable (similar to the formatString approach where it just stores the centroid and the canopy id) but this isn't fully working yet.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723193#action_12723193 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

Short term: have the examples job just convert them before running Hadoop

Long term: factor the canopy- and kmeans-specific stuff out of Canopy and Cluster. Replace Canopy with simplified Cluster

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722360#action_12722360 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

You got bit by the fact that the reader is not creating a distinct instance and is reusing Canopy value. This makes all of the canopies identical and messes up the test. Here I'm making a copy of the canopy before adding it to the canopies list. The unit test now passes. Before committing these changes, you really ought to fix the code in examples too.

{noformat}
      try {
        Text key = new Text();
        Canopy value = new Canopy();
        while (reader.next(key, value)) {
          canopies.add(new Canopy(value.getCenter(),value.getCanopyId()));
        }
      } finally {
        reader.close();
      }
{noformat}

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Canopy tests pass.  Ran the Synthetic control in local mode and it works, but haven't validated the output, as we need to write up the OutputDriver that takes in a sequence file and outputs GSON.

Added the need to pass in the concrete Vector implementation.  Also changed computeCentroid to return Vector (the actual implementation is still Sparse, but we should reserve the flexibility)

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Here's a start, but not all the tests don't pass yet.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1

Grant Ingersoll (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Grant Ingersoll updated MAHOUT-137:
> -----------------------------------
>
>     Attachment: MAHOUT-137.patch
>
> Updates LuceneIterable to have options for output VectorIterable.  Also makes it easy for others to plug in their output mechanism.
>
> I'd like to commit this as an interim today so that Jeff can sync up and work on his side.
>
>   
>> Convert Clustering Algs to use Vector Writable
>> ----------------------------------------------
>>
>>                 Key: MAHOUT-137
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>             Project: Mahout
>>          Issue Type: Improvement
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>
>>
>> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>     
>
>   


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Updates LuceneIterable to have options for output VectorIterable.  Also makes it easy for others to plug in their output mechanism.

I'd like to commit this as an interim today so that Jeff can sync up and work on his side.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Found the issues with *KMeans.  Tests pass.   Need to update the examples, then I will commit.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Tests pass, fixed the issue w/ Canopy -> Cluster mapping.  Will commit shortly.

Jeff, can you hook in your AbstractVector stuff in replacing my string workaround (for serialization of Canopy/Cluster)



> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

The right patch.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722754#action_12722754 ] 

Grant Ingersoll edited comment on MAHOUT-137 at 6/22/09 11:41 AM:
------------------------------------------------------------------

I asked the question on http://hadoop.markmail.org/message/jr4cbem46erlhgzu

      was (Author: gsingers):
    http://hadoop.markmail.org/message/jr4cbem46erlhgzu
  
> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723688#action_12723688 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

FYI, I've refactored Canopy/Cluster slightly to have a base class in common.  I also have been putting together some output tools that live in the utils module (similar to the LuceneIterable, etc.)

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722361#action_12722361 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

Ah, cool.  Thanks!  I've got a ways to go on this one, but will start with small steps.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722690#action_12722690 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

Also, did you look at what I did in the patch I posted to handle it?  Basically, push the question off to the user.  

Of course, that is slightly less than ideal.  It seems like people shouldn't have to care about the underlying implementation.  Furthermore, I don't know the likelihood that one would need to mix dense w/ sparse.  Intuition suggests to me that if one vector needs to be dense, then most vectors are likely to be dense and likewise, that if one vector is going to be sparse, the nature of the problem is such that all vectors are sparse (thinking of text), but this isn't based on any personal experience, it's just a guess.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
If we really want to be vector-type agnostic, perhaps caching the class 
found in readVector would be a reasonable improvement.


Grant Ingersoll (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722666#action_12722666 ] 
>
> Grant Ingersoll commented on MAHOUT-137:
> ----------------------------------------
>
> The only thing I worry about w/ this approach is that forName() call is pretty time consuming.
>
>   
>> Convert Clustering Algs to use Vector Writable
>> ----------------------------------------------
>>
>>                 Key: MAHOUT-137
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>             Project: Mahout
>>          Issue Type: Improvement
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>
>>
>> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>     
>
>   


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722666#action_12722666 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

The only thing I worry about w/ this approach is that forName() call is pretty time consuming.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723187#action_12723187 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

The KMeans examples seem a bit trickier, b/c they seem to be abusing the fact that the output of Canopy looks very much like a Cluster as well when viewed as Text.  Unfortunately, the KMeansMapper is looking for a Cluster object, but is getting a Canopy.

Any thoughts on how to remedy?

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-137.
------------------------------------

    Resolution: Fixed

I believe they are all converted.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722663#action_12722663 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

Here's some code (which depends upon the AbstractVector methods becoming public) which encodes the class name in addition to the elements and is vector-type agnostic.

{noformat}
  public void readFields(DataInput in) throws IOException {
    this.canopyId = in.readInt();
    this.center = AbstractVector.readVector(in);
  }

  @Override
  public void write(DataOutput out) throws IOException {
    out.writeInt(canopyId);
    AbstractVector.writeVector(out, computeCentroid());
{noformat}

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.
Yeah, I was debating doing that.  I don't like committing tests that  
fail.  Might make sense to branch.

On Jun 22, 2009, at 11:20 PM, Jeff Eastman wrote:

> Hi Grant,
>
> For me it would be easier if you commit what you have now and all of  
> us commit to work through the remaining issues. I think we  
> understand the migration gotcha patterns, we just haven't found them  
> all yet. Having to install/reinstall big patch wads doesn't help IMO.
>
>
>
> Grant Ingersoll (JIRA) wrote:
>>     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>>  ]
>>
>> Grant Ingersoll updated MAHOUT-137:
>> -----------------------------------
>>
>>    Attachment: MAHOUT-137.patch
>>
>> Fuzzy kMeans conversion, but tests fail.  Some doubt in my mind  
>> about the validity of some of the tests, but still working through  
>> those.  Could use some extra eyes from the authors of these pieces.
>>
>>
>>> Convert Clustering Algs to use Vector Writable
>>> ----------------------------------------------
>>>
>>>                Key: MAHOUT-137
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>           Reporter: Grant Ingersoll
>>>           Assignee: Grant Ingersoll
>>>            Fix For: 0.2
>>>
>>>        Attachments: MAHOUT-137.patch, MAHOUT-137.patch,  
>>> MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>>
>>>
>>> All M/R jobs should use Vector writable instead of encoding and  
>>> decoding strings.  We can have a separate utility that converts  
>>> serialized GSON, Strings, whatever into the appropriate vectors.   
>>> See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>>
>>
>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: [jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Grant,

For me it would be easier if you commit what you have now and all of us 
commit to work through the remaining issues. I think we understand the 
migration gotcha patterns, we just haven't found them all yet. Having to 
install/reinstall big patch wads doesn't help IMO.



Grant Ingersoll (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Grant Ingersoll updated MAHOUT-137:
> -----------------------------------
>
>     Attachment: MAHOUT-137.patch
>
> Fuzzy kMeans conversion, but tests fail.  Some doubt in my mind about the validity of some of the tests, but still working through those.  Could use some extra eyes from the authors of these pieces.
>
>   
>> Convert Clustering Algs to use Vector Writable
>> ----------------------------------------------
>>
>>                 Key: MAHOUT-137
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>             Project: Mahout
>>          Issue Type: Improvement
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>
>>
>> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>     
>
>   


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Fuzzy kMeans conversion, but tests fail.  Some doubt in my mind about the validity of some of the tests, but still working through those.  Could use some extra eyes from the authors of these pieces.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722382#action_12722382 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

Evidently, Hadoop needs to know the concrete class so it does not have to marshall the class name with every instance. It makes sense and is more efficient but it will require us to be more clever about using DenseVectors. A job argument would do the trick, and we might want to add another to specify the Binary/Json output encoding so we don't always have to always do an output driver step to get something human-readable. 

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 23, 2009, at 11:18 AM, Jeff Eastman wrote:

> That makes sense, though I don't understand why the reducer is not  
> doing its job in the test you cite. I've had to do manual things  
> (like calling close() in the unit tests to get all of the  
> functionality to exercise.
> All of the clustering algorithms behave similarly: each cluster has  
> a center (prior) which is used to observe some of the data  
> (observations) based upon a distance function (pdf), which is used  
> to compute its new centroid (posterior). I think it is possible to  
> abstract them into a common framework using this model.
>

It makes sense b/c the M/R pieces rely on the fact that everything  
round trips through the serialization/deserialization phase, whereas  
that particular test does not do that.  The centroid from one  
iteration thus becomes the center for the next iteration, AFAICT.

Re: [jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That makes sense, though I don't understand why the reducer is not doing 
its job in the test you cite. I've had to do manual things (like calling 
close() in the unit tests to get all of the functionality to exercise. 

All of the clustering algorithms behave similarly: each cluster has a 
center (prior) which is used to observe some of the data (observations) 
based upon a distance function (pdf), which is used to compute its new 
centroid (posterior). I think it is possible to abstract them into a 
common framework using this model.


Grant Ingersoll (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723067#action_12723067 ] 
>
> Grant Ingersoll commented on MAHOUT-137:
> ----------------------------------------
>
> I see the problem now with KMeans (and likely Fuzzy KMeans, and it is a source of confusion.  Namely, it's the whole relationship between Cluster.center and Cluster.centroid.  It seems as the Cluster goes from formatCluster through decodeCluster the centroid (computed in formatCluster) then becomes the center for the next time around.   In the testKMeansReducer, this never happens since we aren't serializing through the string layer. 
>
> Obviously, I can correct this in the test, but it seems a bit strange.  AIUI, the center holds the current iteration center and it seems like the centroid is the result of where the center is being moved to, right?  This does indeed happen in my implementation of Writable, but since that isn't being called in the test, it doesn't occur.
>
>   
>> Convert Clustering Algs to use Vector Writable
>> ----------------------------------------------
>>
>>                 Key: MAHOUT-137
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>             Project: Mahout
>>          Issue Type: Improvement
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>
>>
>> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>     
>
>   


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723067#action_12723067 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

I see the problem now with KMeans (and likely Fuzzy KMeans, and it is a source of confusion.  Namely, it's the whole relationship between Cluster.center and Cluster.centroid.  It seems as the Cluster goes from formatCluster through decodeCluster the centroid (computed in formatCluster) then becomes the center for the next time around.   In the testKMeansReducer, this never happens since we aren't serializing through the string layer. 

Obviously, I can correct this in the test, but it seems a bit strange.  AIUI, the center holds the current iteration center and it seems like the centroid is the result of where the center is being moved to, right?  This does indeed happen in my implementation of Writable, but since that isn't being called in the test, it doesn't occur.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722383#action_12722383 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

This is a bit more efficient patch for the ClusterMapper:

{noformat}
        Canopy value = new Canopy();
        while (reader.next(key, value)) {
          canopies.add(value);
          value = new Canopy();
        }
{noformat}

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722708#action_12722708 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

Yes, I saw that and that was my original approach too. I do like the ability to have the clustering jobs be vector-type agnostic and pushing it into an argument does work. On output, we still need it as a job argument since we need to know the type at config-time. This also allows us to use the same internal form between mapper and reducer steps in a clustering. I agree users would not like to have to worry about specifying it if we could avoid it, maybe that's the real question for core-user. 

I also think it is unlikely that a given application of clustering would mix sparse and dense vectors though it would allow us to make the particular encoding be automatic on a per-instance basis. Using the optimized AbstractVector methods on input would add a little storage overhead to the input data but would allow this flexibility. 



> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722373#action_12722373 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

bq. I find it a bit troubling that SparseVector.class must be specified explicit as the map and job output types instead of just Vector.class

Agreed, but I was following your lead.

Sounds good on MS and Dirichlet

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723235#action_12723235 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

Committed revision 787776.  This contains the patch I just submitted.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722246#action_12722246 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

MAHOUT-136 changed Canopy to use Writable between map and reduce steps, but input and output formats are still Text. In the interests of consistency and efficiency, it makes sense to convert all of the clustering jobs to use Writables for I/O too. We can have a separate utility job to convert from Writable form to Json or other textual representations if that is needed. Since most clustering jobs will have an input step to prepare the points for clustering anyway, having this output Writables vs Text would be a small change.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Looks like you missed Sean's am commit to Vector Cloneable but otherwise 
the patch applied cleanly.

The stuff after BufferedReader looks to be comparing expected reducer 
output with actual. Not very readable tho.

 From my performance test, the optimization I added to AbstractVector to 
cache the class and the subset optimization you added in 
vectorNameToVector are not justified. I ran 100k iterations of 
serializing/deserializing small vectors with and without my optimization 
and the performance was indistinguishable. I conclude it is being cached 
already by the jdk.

I'd suspect Writable identity issues in your test code but I can't find 
it. It's plaguing me big time with MeanShift.

I'm going to let my brain unwind for a while and try again.


Grant Ingersoll (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Grant Ingersoll updated MAHOUT-137:
> -----------------------------------
>
>     Attachment: MAHOUT-137.patch
>
> Draft of KMeans conversion.  Most tests pass except testKMeansReducer and testKMeansMRJob.  
>
> In reading the testKMeansMRJob() it is not clear to me what that last part of the test is doing (after the BufferedReader)
>
> As for the Reducer test, I'm not sure why the Centers aren't matching up.
>
> Some extra eyes would be appreciated.
>
>   
>> Convert Clustering Algs to use Vector Writable
>> ----------------------------------------------
>>
>>                 Key: MAHOUT-137
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>>             Project: Mahout
>>          Issue Type: Improvement
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>>
>>
>> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable
>>     
>
>   


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment: MAHOUT-137.patch

Draft of KMeans conversion.  Most tests pass except testKMeansReducer and testKMeansMRJob.  

In reading the testKMeansMRJob() it is not clear to me what that last part of the test is doing (after the BufferedReader)

As for the Reducer test, I'm not sure why the Centers aren't matching up.

Some extra eyes would be appreciated.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723102#action_12723102 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

OK, I've got KMeans passing.  Was a transposition error in the test on my part.  Now trying to get Fuzzy working

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722295#action_12722295 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

I should mention, it is only for Canopy and only two tests in TestCanopyCreation fail.  All the pieces pass except for the main one that tests the full M/R job.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723221#action_12723221 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

I think I found a better way: SequenceFile.Reader.getValueClass() returns the type.  I should be able to detect whether it is Canopy or Cluster and deal appropriately.  I'll post a patch if it works.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722688#action_12722688 ] 

Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

BTW, this seems like a good ? for core-user@hadoop.a.o. 



> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-137:
-----------------------------------

    Attachment:     (was: MAHOUT-137.patch)

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722372#action_12722372 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

How about we add a job argument to set whether to use DenseVector or SparseVector?

Looks like we will need an OutputDriver step in Synthetic Control now to convert back to human-readable form. I have a patch for the rest of it if you want it, let me know.

I'm going to work on Mean Shift and Dirichlet later today while you finish Canopy and do Kmeans?

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723678#action_12723678 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

revision 788071 and 788116 implement Writable changes to MeanShift and Dirichlet clustering. MeanShift no longer has the bogus combiner but still holds all clustered points so it really wont scale well. Dirichlet needs some more fixing but that is another issue.

Some cleanup of directory structures to improve uniformity of naming is needed. Will do that under this issue since it is minor.

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-137) Convert Clustering Algs to use Vector Writable

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722367#action_12722367 ] 

Jeff Eastman commented on MAHOUT-137:
-------------------------------------

I find it a bit troubling that SparseVector.class must be specified explicit as the map and job output types instead of just Vector.class

> Convert Clustering Algs to use Vector Writable
> ----------------------------------------------
>
>                 Key: MAHOUT-137
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-137
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: MAHOUT-137.patch
>
>
> All M/R jobs should use Vector writable instead of encoding and decoding strings.  We can have a separate utility that converts serialized GSON, Strings, whatever into the appropriate vectors.  See MAHOUT-136 and http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.