You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2010/12/01 08:49:11 UTC

[jira] Issue Comment Edited: (MAHOUT-376) Implement Map-reduce version of stochastic SVD

    [ https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965597#action_12965597 ] 

Dmitriy Lyubimov edited comment on MAHOUT-376 at 12/1/10 2:48 AM:
------------------------------------------------------------------

Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local mode only, S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for test scope only in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math module seems to depend on 2.1 too, not clear why it was not transitive for this one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have it transitively anymore, but one of  the classes in core required it. so i added commons-cli in order to fix the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your encouragement and help. i probably should've opened another issue once it was clear it diverged far enough, instead of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real distributed test yet as i don't have a suitable data set yet, but perhaps somebody in the user community with the interest in the method could do it faster than i get to it. I will do tests with moderate scale at some point but i don't want to do it on my company's machine cluster yet and i don't exactly own a good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. Partly becuase i did not quite have enough time to study all capabilities in math module, and partly becuase i wanted explicit access to memory for control over its more efficient re-use in mass iterations.  This may or may not need be rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know the standard is set to use svn patches... but i already used git for pulling the trunk  (so happens i prefer git in general too so i can have my own commit tree and branching for this work). 

If there's enough interest from the project to this contribution, i will support it, and if requested, i can port it to 0.20 if that's the target platform for 0.5, as well as doing other specific mahout architectural tweaks.  Please kindly let me know. 


Thank you.

      was (Author: dlyubimov2):
    Final trunk patch for CDH3 or 0.21 api. 
This includes code cleanup, javadoc updates, and mahout CLI class (not tested though). 

all existing tests and this test are passing. I tested 100Kx100 matrix in local mode only, S values coincide with 1e-10 or better.

changes to dependencies i had to make 
* hadoop 0.21 or cdh3 to support multiple outputs 
* local MR mode has dependency on commons-http client, so i included it for test scope only in order for test to work
*  changed apache-math dependency from 1.2 (?) to 2.1. Actually mahout math module seems to depend on 2.1 too, not clear why it was not transitive for this one. 
* commons-math 1.2 seemed to have depended on commons-cli and 2.1 doesn't have it transitively anymore, but one of  the classes in core required it. so i added commons-cli in order to fix the build.

*Ted*, sorry i kind of polluted your issue here. Thank you for your encouragement and help. i probably should've opened another issue once it was clear it diverged far enough, instead of keep putting stuff here. 

This should be compatible with DistributedRowMatrix. I did not have real distributed test yet as i don't have a suitable data set yet, but perhaps somebody in the user community with the interest in the method could do it faster than i get to it. I will do tests with moderate scale at some point but i don't want to do it on my company's grounds yet and i don't exactly own a good one myself.

I did have a rather mixed use of mahout vector math and just dense arrays. Partly becuase i did not quite have enough time to study all capabilities in math module, and partly becuase i wanted explicit access to memory for control over its more efficient re-use in mass iterations.  This may or may not need be rectified over time. But it seems to work pretty well as is.

The patch is git patch (so one needs to use patch -p1 instead of -p0). I know the standard set to use svn patches... but i already used git for pulling the trunk  (i prefer git in general too). 

If there's enough interest from the project to this contribution, i will support it, and if requested, i can port it to 0.20 if that's the target platform for 0.5, as well as doing other specific mahout architectural tweaks.  Please kindly let me know. 


Thank you.
  
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz, ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.