You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Remy <ar...@gmail.com> on 2016/02/12 19:26:32 UTC
Mahout RowSimilarity
Hello,
I am trying to run the RowSimilarity algorithm on Mahout 0.11.0 (on a
single node cluster), but I cannot seem to find a way to transform my
data into the appropriate SequenceFile format.
My data is in a CSV file, with each row being user_id, tag_id, rating.
4, 1233, 0.3
4, 98, 0.7
12, 654, 0.1
12, 98, 0.9
The data is sparse (the users do not have a rating for most of the
items), and I would like to find and group users based on similarity
(using a cosine similarity measure). Basically, I can run ItemSimilarity
with this format as an input without any problem.
I stumbled upon these two threads, which have been quite helpful
http://comments.gmane.org/gmane.comp.apache.mahout.user/21263
and this one:
http://comments.gmane.org/gmane.comp.apache.mahout.user/17873
The issue is that I still cannot find how to convert my data into a
SequenceFile using the CSVIterator on CLI (I am not very familiar with
Java...). However, if there is a Pythonic approach to this problem, I'd
be glad to hear about it.
Most of the examples using RowSimilarity on Mahout are focusing on text
based content, so it is not very helpful when it comes to tweaking an
example.
Is there a CLI way to do this?
Thanks in advance,
Remy