You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2011/03/01 03:01:36 UTC

[jira] Commented: (MAHOUT-593) Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure compatibility with current Mahout dependencies.

    [ https://issues.apache.org/jira/browse/MAHOUT-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000665#comment-13000665 ] 

Dmitriy Lyubimov commented on MAHOUT-593:
-----------------------------------------

{quote}My only question then is why those intermediate stages of Mappers/Reducers need to be exposed as stand-alone units ("Jobs" in your patch)? I agree they're not command-line "Jobs" that would be invoked independently, but they seem exposed that way.{quote}
I don't think they are exposed. here is the class diagram.

There's only one CLI entity (SSVDCLI) which is a Tool as well as AbstractJob. The SSVDCLI is basically a CLI adapter to the SSVDSolver API. SSVDSolver api can be used inline in a program as much as a regular solver (distinction is that DRM input is specified by a Hadoop glob expression).

SSVDSolver encapsulates overarching functionality of SSVD by driving map reduce jobs as well as small front-end computation (the latter by beans of instantiating an EigenSolverWrapper which solves BB'=U&Lambda;U' ). All this is completely isolated from either CLI or Solver api. The function of SSVDCli is parse and establish job specific parameters as well as Hadoop's Configuration. (Solver may override some of them when passing them on to jobs).

The idea here is that one might use it as embedded solver by using SSVDSolver, _or_ one might use command-line interface. But everything else is encapsulated and may change.

The overarching sequence enforced by solver is QtJob -> BBtJob -> BtJob -> front end eigen solution -> (optional VJob and optional UJob in parallel).

QtJob, VJob and UJob are map-only.



!ssvdclassdiag.png|height=700!

> Backport of Stochastic SVD patch (Mahout-376) to hadoop 0.20 to ensure compatibility with current Mahout dependencies.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-593
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-593
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.4
>            Reporter: Dmitriy Lyubimov
>             Fix For: 0.5
>
>         Attachments: MAHOUT-593.patch.gz, MAHOUT-593.patch.gz, MAHOUT-593.patch.gz, SSVD-givens-CLI.pdf, ssvdclassdiag.png
>
>
> Current Mahout-376 patch requries 'new' hadoop API.  Certain elements of that API (namely, multiple outputs) are not available in standard hadoop 0.20.2 release. As such, that may work only with either CDH or 0.21 distributions. 
>  In order to bring it into sync with current Mahout dependencies, a backport of the patch to 'old' API is needed. 
> Also, some work is needed to resolve math dependencies. Existing patch relies on apache commons-math 2.1 for eigen decomposition of small matrices. This dependency is not currently set up in the mahout core. So, certain snippets of code are either required to go to mahout-math or use Colt eigen decompositon (last time i tried, my results were mixed with that one. It seems to produce results inconsistent with those from mahout-math eigensolver, at the very least, it doesn't produce singular values in sorted order).
> So this patch is mainly moing some Mahout-376 code around.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira