You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Remy <ar...@gmail.com> on 2016/02/12 19:26:32 UTC

Mahout RowSimilarity

Hello,


I am trying to run the RowSimilarity algorithm on Mahout 0.11.0 (on a 
single node cluster), but I cannot seem to find a way to transform my 
data into the appropriate SequenceFile format. 

My data is in a CSV file, with each row being user_id, tag_id, rating. 
4, 1233, 0.3
4, 98, 0.7
12, 654, 0.1
12, 98, 0.9

The data is sparse (the users do not have a rating for most of the 
items), and I would like to find and group users based on similarity 
(using a cosine similarity measure). Basically, I can run ItemSimilarity 
with this format as an input without any problem.

I stumbled upon these two threads, which have been quite helpful
http://comments.gmane.org/gmane.comp.apache.mahout.user/21263
and this one: 
http://comments.gmane.org/gmane.comp.apache.mahout.user/17873

The issue is that I still cannot find how to convert my data into a 
SequenceFile using the CSVIterator on CLI (I am not very familiar with 
Java...). However, if there is a Pythonic approach to this problem, I'd 
be glad to hear about it. 

Most of the examples using RowSimilarity on Mahout are focusing on text 
based content, so it is not very helpful when it comes to tweaking an 
example.

Is there a CLI way to do this?

Thanks in advance,

Remy