You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/06/17 13:14:25 UTC

[jira] Created: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Computing the pairwise similarities of the rows of a matrix
-----------------------------------------------------------

                 Key: MAHOUT-418
                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
             Project: Mahout
          Issue Type: New Feature
          Components: Math
            Reporter: Sebastian Schelter


In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.

The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf

I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.

If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881191#action_12881191 ] 

Sean Owen commented on MAHOUT-418:
----------------------------------

Yes, sounds exactly right to me. I think it'd be great if you could proceed along these lines.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-418:
--------------------------------------

    Attachment: MAHOUT-418-3.patch

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882993#action_12882993 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

You're right, the latest patch is actually a big code move and refactoring, all original functionality is still there. The methods in TasteHadoopUtils can be used in other classes too. I'm looking forward to starting work on MAHOUT-420 when this patch is commited :)

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880691#action_12880691 ] 

Sean Owen commented on MAHOUT-418:
----------------------------------

Code looks OK to me. I think WeightedRowPair needs to implement equals() and hashCode()?

Overall, are you moving code out of org.apache.mahout.cf.taste.hadoop.item.similarity into this more common package, and generalizing the code? Should we then expect that that other code will be removed? This would be better.

yes would be good to get comments from Kris about whether this solves his problem before proceeding.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880902#action_12880902 ] 

Sean Owen commented on MAHOUT-418:
----------------------------------

Let's see how far apart these two implementations are. It would be great to spend some time unifying them a bit, now, if there is no hurry to get a second implementation in.

Yes, the recommender-specific job needs an additional phase at the start and end, to map from longs to ints and back. It does do this. This can remain. But once data is converted into vectors, the general code you are creating should be able to take over?

Both implementations can write the whole matrix, and take the same approach to self-similarity. That is I think you are welcome to make them both assume the same thing. Just compute and store everything for good measure.

If those are the only differences, it really seems like they are doing the same thing and this can be a move of code rather than copy. I think you should feel free to go this way, even if it requires change in other code. I can help adjust other code if it means some assumptions have changed.

That way you are not burdened with maintaining two implementations. I think that makes MAHOUT-420 easier.

What do you think, are you keen to commit this, or open to pushing towards one implementation?


> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-418:
--------------------------------------

    Attachment: MAHOUT-418.patch

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-418:
--------------------------------------

    Attachment: MAHOUT-418-2.patch

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882789#action_12882789 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

Attached a patch that moves the mathematical operation (computation of all pairwise row similarities) to o.a.m.math and also modifies the item-item-similarity computation in the cf part in the way it was proposed.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881141#action_12881141 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

I think we should take the time and try to find a unified solution. You are right that maintaining two implementations of the same algorithm is not desirable and from my side there's no hurry to commit this.

I propose the following decisions/assumptions for o.a.m.math.hadoop.similarity.RowSimilarityJob

 * similarity functions used must be symmetric 
 * the whole matrix is written to disk
 * the similarity of a row to itself is always computed (making this job a usecase-agnostic mathematical operation)

Then o.a.m.cf.taste.hadoop.similarity.SimilarityJob could be reduced to do the following things:

 * map the IDs to ints
 * create an item-user-matrix from the preferences
 * run o.a.m.math.hadoop.similarity.RowSimilarityJob
 * pick out the pairs of similar items from the resulting matrix (ignoring the similarity of an item to itself)
 * map the ints back to the IDs

Do you think this is a viable way to go here?

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880757#action_12880757 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

Attached a new patch including equals() and hashCode() for WeightedRowPair, thank you for pointing me to that.

I'm not sure whether the code in o.a.m.cf.taste.hadoop.similarity should be removed by now. Although it is an implementation of the same algorithm as this patch here, there are some differences in the details. By merging them we would lose some optimizations in the cf-specific implementation but I agree with Sean that it is desirable to have the cf code use standard matrix operations.

Differences between the two implementations:

 * vectors use ints as indices, preferences use longs as IDs, so those IDs would need to be mapped to ints and back (I think the distributed recommender job is already doing that, so that shouldn't be a big problem)

 * o.a.m.math.hadoop.similarity.RowSimilarityJob writes the whole result matrix to disk (although it should be symmetric and no information would be lost if only half of it was written) because we need the whole matrix to be available for following operations and integration into DistributedRowMatrix

 * o.a.m.cf.taste.hadoop.similarity.SimilarityJob automatically assumes the similarity of an item to itself as NaN (and doesn't compute it) whereas a similarity matrix created by RowSimilarityJob actively computes and includes these values (because it's a mathematical operation and should be agnostic of the fact that it's main use case is collaborative filtering)

A possible solution for the cf usecase (that would allow merging the implementations) would be to have the RowSimilarityJob do the computation and after that pick out only the matrix entries we're interested in in another M/R run.

If you want it that way, I can implement that.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-418:
-----------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883274#action_12883274 ] 

Hudson commented on MAHOUT-418:
-------------------------------

Integrated in Mahout-Quality #104 (See [http://hudson.zones.apache.org/hudson/job/Mahout-Quality/104/])
    

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882963#action_12882963 ] 

Sean Owen commented on MAHOUT-418:
----------------------------------

I looked at it briefly and it seems OK. It's mostly moving code and generalizing. Am I right that all the original functionality you had is still there, just refactored? Then I'm OK to commit. My last question is whether the methods in TasteHadoopUtils can be used elsewhere. there are several other classes that do similar things. We don't need a new patch to address that, can do it later. I will run tests and commit if it passes as it seems this is well-enough considered to go in.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-418:
--------------------------------------

    Status: Patch Available  (was: Open)

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Kris,

Glad to hear that the code works and is useful to you!

-sebastian

Am 22.06.2010 13:33, schrieb Kris Jack (JIRA):
>     [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881174#action_12881174 ] 
>
> Kris Jack commented on MAHOUT-418:
> ----------------------------------
>
> Hi Sebastian,
>
> I ran your latest patch on a set of 10,000,000+ documents and managed to get results being produced in less than 24 hours using 10 mapper and 10 reducers.  My documents have been made very sparse, eliminating any terms with a global frequency > 0.001% of all term frequencies.  I haven't been able to analyse the results systematically yet but they look good at a glimpse.  I think that this is a very valuable contribution to mahout, well done!
>
> Thanks,
> Kris
>
>   
>> Computing the pairwise similarities of the rows of a matrix
>> -----------------------------------------------------------
>>
>>                 Key: MAHOUT-418
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Math
>>            Reporter: Sebastian Schelter
>>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>>
>>
>> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
>> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
>> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
>> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.
>>     
>   


[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Posted by "Kris Jack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881174#action_12881174 ] 

Kris Jack commented on MAHOUT-418:
----------------------------------

Hi Sebastian,

I ran your latest patch on a set of 10,000,000+ documents and managed to get results being produced in less than 24 hours using 10 mapper and 10 reducers.  My documents have been made very sparse, eliminating any terms with a global frequency > 0.001% of all term frequencies.  I haven't been able to analyse the results systematically yet but they look good at a glimpse.  I think that this is a very valuable contribution to mahout, well done!

Thanks,
Kris

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list started by Kris Jack about computing a document similarity matrix, I tried to generalize the approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto and cosine for demo and testing purposes. The algorithm is based on the one presented here: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it with a reasonably large input, I'm also worried that it might buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests, more similarity measures, integration with DistributedRowMatrix) would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.