You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jonathan Seale <jo...@samegrain.com> on 2015/04/08 21:03:56 UTC

RowSimilarity

Hi all,
 
I'm new to the community and Mahout. Happy to be here. :-)
 
I have the following problem that I'm having difficulty with. I've setup an instance on Amazon with Mahout and can run some basic machine learning tasks (just testing). Now I'm trying to do a specific task and am unsure how to proceed.
 
Imagine I have a data file containing the following columns: user_id, item_id, and rating, where rating is how each user rated the item on a scale of -1 to 1 (the necessity of negative ratings will become apparent in a minute). Ultimately, what I'm trying to do is create a similarity matrix that measures the similarity between all pairs of USERS. To do this, I would like to transform the users' ratings into a matrix (rows are users, columns are items) and then run RowSimilarity to find the dot product / cosine between all rows.
 
I feel like my problem is simple and has probably been done 1000 times, but I can't seem to find any documentation directly on the subject. The best I've been able to do so far is use the similaritem function (where I've swapped item for user). While it works and gives decent results, it's mathematically not quite what I want. Help!
 
Thanks!
Jonathan

Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Input to the Mahout MapReduce version of this job requires a sequence file, SequenceFile<IntWritable,VectorWritable>, better known as a Distributed Row Matrix. See the unit tests for how to create one (RowSimilarityJobTest). You will need to turn your user and item ids into non-negative integers—corresponding to row and column numbers in the input matrix. This translation into and out of Mahout IDs is the user’s responsibility. 

The Spark version that I referenced takes text files using your _application’s_ user and item ids (treated as strings) with vectors in rows but only allows LLR similarity. I think I mentioned that LLR works better for collaborative filtering type user similarity—similarity based on common preferences. There are helper classes in the new Spark version that will read in the data by element if you’d rather input tuples (user-id,item-id). This reader class can read your file directly.

On Apr 15, 2015, at 10:08 AM, Jonathan Seale <jo...@samegrain.com> wrote:

Thanks for the help. I’m not able to get this to run. Maybe you can help.

I’ll attach the data file I’m using. This is for a simple case of 4 users (1st column), a number of “items” (2nd column), and “ratings" (3rd column). What I want is the cosine similarity between each pair of users. I don’t want to use LLR, because I need more precise control over how things are weighted, and I’m not using this for recommendations (and these aren’t actually items and ratings). I’ll skip the details and just say that cosine similarity is what I need to use.

I have cerated an amazon instance and have mahout installed. I can get some of the other examples to run, so I know things are installed properly. Here is what I tried - 

mahout rowsimilarity -i data.csv -o <output directory> --similarityClassname SIMILARITY_COSINE

I think the problem is with my data file, which I think need to be vectors? But I’m unsure how to do that.

spark-rowsimilarity isn’t recognized at all, and hadoop complains.

Hope you can help,
Jonathan

> On Apr 8, 2015, at 3:24 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Well first I’d ignore ratings. There are too many problems trying to normalize or understand the meaning of a rating. If you follow the rest of this advice it will ignore them anyway. Ratings were used in older recommenders but have become meaningless with recent thinking. Netflix made the idea popular with the Netflix prize but since then even they do not use ratings to recommend since ranking of the best recs is far more important than predicting your rating. We can handle negative preferences in a different way, but that will come later.
> 
> Use the Mahout driver 'spark-rowsimilarity’. It will read text csv style data and create the matrix, compare rows (users in your case) and output one user per line (user-id,list of similar users). The IDs will be your input ids so unlike the older hadoop mapreduce version of this in Mahout, the spark version will maintain your ids.
> 
> This will use LLR to find non-coinsidental similarities in the things users prefer. LLR has been shown to be much better at detecting similarities in preference data. Cosine may be good for text similarity but you’d want to use LLR to downsample out the noise terms first anyway. 
> 
> See some docs here:  http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
> search for "spark-rowsimilarity”
> 
> LLR is discussed here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and inside this free ebook: https://www.mapr.com/practical-machine-learning
> 
> On Apr 8, 2015, at 12:03 PM, Jonathan Seale <jo...@samegrain.com> wrote:
> 
> Hi all,
> 
> I'm new to the community and Mahout. Happy to be here. :-)
> 
> I have the following problem that I'm having difficulty with. I've setup an instance on Amazon with Mahout and can run some basic machine learning tasks (just testing). Now I'm trying to do a specific task and am unsure how to proceed.
> 
> Imagine I have a data file containing the following columns: user_id, item_id, and rating, where rating is how each user rated the item on a scale of -1 to 1 (the necessity of negative ratings will become apparent in a minute). Ultimately, what I'm trying to do is create a similarity matrix that measures the similarity between all pairs of USERS. To do this, I would like to transform the users' ratings into a matrix (rows are users, columns are items) and then run RowSimilarity to find the dot product / cosine between all rows.
> 
> I feel like my problem is simple and has probably been done 1000 times, but I can't seem to find any documentation directly on the subject. The best I've been able to do so far is use the similaritem function (where I've swapped item for user). While it works and gives decent results, it's mathematically not quite what I want. Help!
> 
> Thanks!
> Jonathan
> 
> 
> 
>

Re: RowSimilarity

Posted by Jonathan Seale <jo...@samegrain.com>.

Thanks for the help. I’m not able to get this to run. Maybe you can help.

I’ll attach the data file I’m using. This is for a simple case of 4 users (1st column), a number of “items” (2nd column), and “ratings" (3rd column). What I want is the cosine similarity between each pair of users. I don’t want to use LLR, because I need more precise control over how things are weighted, and I’m not using this for recommendations (and these aren’t actually items and ratings). I’ll skip the details and just say that cosine similarity is what I need to use.

I have cerated an amazon instance and have mahout installed. I can get some of the other examples to run, so I know things are installed properly. Here is what I tried - 

mahout rowsimilarity -i data.csv -o <output directory> --similarityClassname SIMILARITY_COSINE

I think the problem is with my data file, which I think need to be vectors? But I’m unsure how to do that.

spark-rowsimilarity isn’t recognized at all, and hadoop complains.

Hope you can help,
Jonathan




> On Apr 8, 2015, at 3:24 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Well first I’d ignore ratings. There are too many problems trying to normalize or understand the meaning of a rating. If you follow the rest of this advice it will ignore them anyway. Ratings were used in older recommenders but have become meaningless with recent thinking. Netflix made the idea popular with the Netflix prize but since then even they do not use ratings to recommend since ranking of the best recs is far more important than predicting your rating. We can handle negative preferences in a different way, but that will come later.
> 
> Use the Mahout driver 'spark-rowsimilarity’. It will read text csv style data and create the matrix, compare rows (users in your case) and output one user per line (user-id,list of similar users). The IDs will be your input ids so unlike the older hadoop mapreduce version of this in Mahout, the spark version will maintain your ids.
> 
> This will use LLR to find non-coinsidental similarities in the things users prefer. LLR has been shown to be much better at detecting similarities in preference data. Cosine may be good for text similarity but you’d want to use LLR to downsample out the noise terms first anyway. 
> 
> See some docs here:  http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
> search for "spark-rowsimilarity”
> 
> LLR is discussed here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and inside this free ebook: https://www.mapr.com/practical-machine-learning
> 
> On Apr 8, 2015, at 12:03 PM, Jonathan Seale <jo...@samegrain.com> wrote:
> 
> Hi all,
> 
> I'm new to the community and Mahout. Happy to be here. :-)
> 
> I have the following problem that I'm having difficulty with. I've setup an instance on Amazon with Mahout and can run some basic machine learning tasks (just testing). Now I'm trying to do a specific task and am unsure how to proceed.
> 
> Imagine I have a data file containing the following columns: user_id, item_id, and rating, where rating is how each user rated the item on a scale of -1 to 1 (the necessity of negative ratings will become apparent in a minute). Ultimately, what I'm trying to do is create a similarity matrix that measures the similarity between all pairs of USERS. To do this, I would like to transform the users' ratings into a matrix (rows are users, columns are items) and then run RowSimilarity to find the dot product / cosine between all rows.
> 
> I feel like my problem is simple and has probably been done 1000 times, but I can't seem to find any documentation directly on the subject. The best I've been able to do so far is use the similaritem function (where I've swapped item for user). While it works and gives decent results, it's mathematically not quite what I want. Help!
> 
> Thanks!
> Jonathan
> 
> 
> 
>

Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Well first I’d ignore ratings. There are too many problems trying to normalize or understand the meaning of a rating. If you follow the rest of this advice it will ignore them anyway. Ratings were used in older recommenders but have become meaningless with recent thinking. Netflix made the idea popular with the Netflix prize but since then even they do not use ratings to recommend since ranking of the best recs is far more important than predicting your rating. We can handle negative preferences in a different way, but that will come later.

Use the Mahout driver 'spark-rowsimilarity’. It will read text csv style data and create the matrix, compare rows (users in your case) and output one user per line (user-id,list of similar users). The IDs will be your input ids so unlike the older hadoop mapreduce version of this in Mahout, the spark version will maintain your ids.

This will use LLR to find non-coinsidental similarities in the things users prefer. LLR has been shown to be much better at detecting similarities in preference data. Cosine may be good for text similarity but you’d want to use LLR to downsample out the noise terms first anyway.

See some docs here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
search for "spark-rowsimilarity”

LLR is discussed here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html and inside this free ebook: https://www.mapr.com/practical-machine-learning

On Apr 8, 2015, at 12:03 PM, Jonathan Seale <jo...@samegrain.com> wrote:

Hi all,

I'm new to the community and Mahout. Happy to be here. :-)

I have the following problem that I'm having difficulty with. I've setup an instance on Amazon with Mahout and can run some basic machine learning tasks (just testing). Now I'm trying to do a specific task and am unsure how to proceed.

Imagine I have a data file containing the following columns: user_id, item_id, and rating, where rating is how each user rated the item on a scale of -1 to 1 (the necessity of negative ratings will become apparent in a minute). Ultimately, what I'm trying to do is create a similarity matrix that measures the similarity between all pairs of USERS. To do this, I would like to transform the users' ratings into a matrix (rows are users, columns are items) and then run RowSimilarity to find the dot product / cosine between all rows.

I feel like my problem is simple and has probably been done 1000 times, but I can't seem to find any documentation directly on the subject. The best I've been able to do so far is use the similaritem function (where I've swapped item for user). While it works and gives decent results, it's mathematically not quite what I want. Help!

Thanks!
Jonathan