You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/06/17 13:18:25 UTC

[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-418:
--------------------------------------

    Attachment: MAHOUT-418.patch

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.