You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2009/06/20 02:54:07 UTC

[jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

    [ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105#action_12722105 ] 

Jeff Eastman commented on MAHOUT-136:
-------------------------------------

r786738 committed the following changes.
- Modified CanopyMapper and CanopyReducer to produce and consume Canopy centroids as Writable values vs. previous formatStrings
- Modified CanopyMapper to specify SparseVector output from mapper
- Fixed null name hash() bug in SparseVector
- Modified Canopy.emitPointToExistingCanopies to emit only canopy id and not full serialized canopy. 
- This eliminates the need for the OutputDriver and OutputMapper in synthetic control example so they are deleted.
- Updated unit tests; all tests run
- Synthetic control example runs

NOTE: When passing Vectors between Map and Reduce steps using Writable format, Hadoop uses the *same instance* to do all of the deserializations. I had to change the Canopy constructors to clone() their center arguments so that the same instance would not be reused for multiple canopies.

> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
>                 Key: MAHOUT-136
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-136
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.1
>
>
> Internal serialization of Canopy currently uses asFormatString rather than just making the Canopy writable. This is storage inefficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.

So, are we to make these changes on all the Mappers/Reducers?



On Jun 19, 2009, at 8:54 PM, Jeff Eastman (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105 
> #action_12722105 ]
>
> Jeff Eastman commented on MAHOUT-136:
> -------------------------------------
>
> r786738 committed the following changes.
> - Modified CanopyMapper and CanopyReducer to produce and consume  
> Canopy centroids as Writable values vs. previous formatStrings
> - Modified CanopyMapper to specify SparseVector output from mapper
> - Fixed null name hash() bug in SparseVector
> - Modified Canopy.emitPointToExistingCanopies to emit only canopy id  
> and not full serialized canopy.
> - This eliminates the need for the OutputDriver and OutputMapper in  
> synthetic control example so they are deleted.
> - Updated unit tests; all tests run
> - Synthetic control example runs
>
> NOTE: When passing Vectors between Map and Reduce steps using  
> Writable format, Hadoop uses the *same instance* to do all of the  
> deserializations. I had to change the Canopy constructors to clone()  
> their center arguments so that the same instance would not be reused  
> for multiple canopies.
>
>> Change Canopy MR Implementation to use Vector Writable
>> ------------------------------------------------------
>>
>>                Key: MAHOUT-136
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-136
>>            Project: Mahout
>>         Issue Type: Improvement
>>         Components: Clustering
>>   Affects Versions: 0.1
>>           Reporter: Jeff Eastman
>>           Assignee: Jeff Eastman
>>            Fix For: 0.1
>>
>>
>> Internal serialization of Canopy currently uses asFormatString  
>> rather than just making the Canopy writable. This is storage  
>> inefficient.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.

Seems like Canopy would need to be made Writable too, no?

On Jun 20, 2009, at 7:04 AM, Grant Ingersoll wrote:

>
> On Jun 20, 2009, at 3:01 AM, Robert Burrell Donkin wrote:
>
>>
>> Perhaps it would be better to move the conversion code (eg tabular ->
>> Vectors) from examples into either core or a new module so it can be
>> more easily be maintained and reused
>
> +1.  Or utils, as I picture utils being the place where we keep  
> things that aren't core, but are still useful.  Of course, we also  
> have, in core, o.a.m.utils, I believe.  The difference, in my mind,  
> is that the utils module is dependent on core, not the other way  
> around, which is why I put the Lucene extraction stuff in there.
>
> You up for a patch for this?  I think I have some time to work on  
> converting the M/R if someone else can take on the I/O to from the  
> user.
>
> -Grant

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 20, 2009, at 3:01 AM, Robert Burrell Donkin wrote:

>
> Perhaps it would be better to move the conversion code (eg tabular ->
> Vectors) from examples into either core or a new module so it can be
> more easily be maintained and reused

+1.  Or utils, as I picture utils being the place where we keep things  
that aren't core, but are still useful.  Of course, we also have, in  
core, o.a.m.utils, I believe.  The difference, in my mind, is that the  
utils module is dependent on core, not the other way around, which is  
why I put the Lucene extraction stuff in there.

You up for a patch for this?  I think I have some time to work on  
converting the M/R if someone else can take on the I/O to from the user.

-Grant

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Saturday, June 20, 2009, Ted Dunning <te...@gmail.com> wrote:
> Sounds pretty good to me.

+1

> On Fri, Jun 19, 2009 at 6:38 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> So, should we just go to having everything be binary and then have
>> Input/Output utilities that can take the binary format and output GSON?
>>  Seems like w/ Canopy, since it's used for feeding into other algorithms
>> that it should output Writable as well, otherwise we're still going to be
>> round tripping through Text.

+1

Perhaps it would be better to move the conversion code (eg tabular ->
Vectors) from examples into either core or a new module so it can be
more easily be maintained and reused

- Robert

>>
>> Then, it would be pretty easy to write a M/R job that takes Vectors and
>> outputs asFormatString(), right?
>>
>

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Ted Dunning <te...@gmail.com>.

Sounds pretty good to me.

On Fri, Jun 19, 2009 at 6:38 PM, Grant Ingersoll <gs...@apache.org>wrote:

> So, should we just go to having everything be binary and then have
> Input/Output utilities that can take the binary format and output GSON?
>  Seems like w/ Canopy, since it's used for feeding into other algorithms
> that it should output Writable as well, otherwise we're still going to be
> round tripping through Text.
>
> Then, it would be pretty easy to write a M/R job that takes Vectors and
> outputs asFormatString(), right?
>

Re: [jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

Posted by Grant Ingersoll <gs...@apache.org>.

So, should we just go to having everything be binary and then have  
Input/Output utilities that can take the binary format and output  
GSON?  Seems like w/ Canopy, since it's used for feeding into other  
algorithms that it should output Writable as well, otherwise we're  
still going to be round tripping through Text.

Then, it would be pretty easy to write a M/R job that takes Vectors  
and outputs asFormatString(), right?

On Jun 19, 2009, at 8:54 PM, Jeff Eastman (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722105 
> #action_12722105 ]
>
> Jeff Eastman commented on MAHOUT-136:
> -------------------------------------
>
> r786738 committed the following changes.
> - Modified CanopyMapper and CanopyReducer to produce and consume  
> Canopy centroids as Writable values vs. previous formatStrings
> - Modified CanopyMapper to specify SparseVector output from mapper
> - Fixed null name hash() bug in SparseVector
> - Modified Canopy.emitPointToExistingCanopies to emit only canopy id  
> and not full serialized canopy.
> - This eliminates the need for the OutputDriver and OutputMapper in  
> synthetic control example so they are deleted.
> - Updated unit tests; all tests run
> - Synthetic control example runs
>
> NOTE: When passing Vectors between Map and Reduce steps using  
> Writable format, Hadoop uses the *same instance* to do all of the  
> deserializations. I had to change the Canopy constructors to clone()  
> their center arguments so that the same instance would not be reused  
> for multiple canopies.
>
>> Change Canopy MR Implementation to use Vector Writable
>> ------------------------------------------------------
>>
>>                Key: MAHOUT-136
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-136
>>            Project: Mahout
>>         Issue Type: Improvement
>>         Components: Clustering
>>   Affects Versions: 0.1
>>           Reporter: Jeff Eastman
>>           Assignee: Jeff Eastman
>>            Fix For: 0.1
>>
>>
>> Internal serialization of Canopy currently uses asFormatString  
>> rather than just making the Canopy writable. This is storage  
>> inefficient.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.