You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2011/03/07 11:55:59 UTC

[jira] Resolved: (MAHOUT-309) Implement Stochastic Decomposition

     [ https://issues.apache.org/jira/browse/MAHOUT-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-309.
------------------------------

    Resolution: Duplicate
      Assignee: Ted Dunning  (was: Jake Mannix)

And then likewise I think this, being the same as MAHOUT-376, is duplicated by MAHOUT-593 for all practical purposes?

> Implement Stochastic Decomposition
> ----------------------------------
>
>                 Key: MAHOUT-309
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-309
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: 0.4
>            Reporter: Jake Mannix
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>
> Techniques reviewed in <a href="http://arxiv.org/abs/0909.4061">Halko, Martinsson, and Tropp</a>.
> The basic idea of the implementation is as follows: if the input matrix is represented as a DistributedSparseRowMatrix (backed by a sequence-file of <Writable,VectorWritable> - the values of which should be SequentialAccessSparseVector instances for best performance), and you optionally have a kernel function f(v) which maps sparse numColumns-dimensional (here numColumns is unconstrained in size) vectors to sparse numKernelizedFeatures-dimensional (also unconstrained in size) vectors (in the case where you want to do kernel-PCA, for example, for a kernel k(u,v) = f(u).dot( f(v) )), then take the MurmurHash (from MAHOUT-228) and maps the numKernelizedFeatures-dimensional vectors and projects down to some numHashedFeatures-dimensional space (reasonably-sized - no more than a 10^2 to 10^4).  
> This is all done in the Mapper, and there are two outputs: the numHashedFeatures-dimensional vector itself (if the left-singular vectors are ever desired), which does not need to be Reduced, and the outer-product of this vector with itself, where the Reducer/Combiner just does the matrix sum on the partial outputs, eventually producing the kernel / gram matrix of your hashed features, which can then be run through a simple eigen-decomposition, the ((1/eigenvalue)-scaled) eigenvectors of which can be applied to project the (optional) numHashedFeatures-dimensional outputs mentioned earlier in this paragraph to get the left-singular vectors / reduced projections (which can be then run through clustering, etc...).
> Good fun will be had by all.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira