You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Gruszowska Natalia <Na...@grupaonet.pl> on 2014/12/12 15:18:26 UTC

itemsimilarity - maxPrefs parameter

Hi All, 

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs                               max number of
                                                          preferences to
                                                          consider per user or
                                                          item, users or items
                                                          with more preferences
                                                          will be sampled down
                                                          (default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default maxPrefs, it consider only 500 ranks from those 5 mln or what? Is it sampling? What can I do to force calculation for all input data? 

			M1   M2   M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"


Thx in advance
Natalia

Re: itemsimilarity - maxPrefs parameter

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Increase the number to Integer.max or the highest of your number of users or items. The “or" means that the row and columns are both downsampled to that number or less.

To use all data you will also have to increase the —maxSimilaritiesPerItem

There are two marices in the Hadoop itemsimilarity. The input is A, and is one row per user with each item the user has interacted with. From this AtA is calculated as the output using LLR instead of actual matrix multiplication. This yields an AtA with values weighted but LLR strength. —maxSimilaritiesPerItem will further limit the values here to no more than that number. There is also a quality threshold, which is pretty difficult to use.

If you remove all of these downsampling params you will approach O(n^2) runtime, if you use them you will have O(n). You will also get rapidly diminishing returns by removing downsampling.

The indicator matrix will have arbitrarily many similar items of diminishing strength, some could be nearly useless. This potentially large vector may be unwieldy in you other calculations and has not had low value similar items filtered out.

Bottom line it that the downsampling is possible to tweak but removal altogether is not likely to be a good thing.

On Dec 12, 2014, at 6:18 AM, Gruszowska Natalia <Na...@grupaonet.pl> wrote:

Hi All,

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs max number of
preferences to
consider per user or
item, users or items
with more preferences
will be sampled down
(default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default maxPrefs, it consider only 500 ranks from those 5 mln or what? Is it sampling? What can I do to force calculation for all input data?

M1 M2 M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"

Thx in advance
Natalia