You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/08/05 08:52:18 UTC
[jira] Commented: (MAHOUT-455) NearestNUserNeighborhood problems with large Ns

    [ https://issues.apache.org/jira/browse/MAHOUT-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895569#action_12895569 ] 

Ted Dunning commented on MAHOUT-455:
------------------------------------

At the risk of being rude, I think that this would not fix a thing since the current behavior is correct and follows directly from first principles.

The reason that the algorithm is called "nearest-neighbors" is because that is what it does.  The intent is to use only the nearest few neighbors.  This method has a long lineage in data-mining and the crux of it all is that "nearest few" part.  That is what allows it to express very complex relations by locally linear relations.


If you really want to include all users, just compute the single average of all users off-line and be done with it.  This works because if you include any weighting by proximity, then  you can use a moderate number of neighbors and get the same result (i.e. the current behavior).  If your weighting does not strongly depend on distance as must be the case if all of the users are included in the sample, then you have a measure that does not depend on the user you are recommending for, that is, you have reinvented a "most-popular" recommendation.  If that is what you want, then you should use that and not a nearest neighbor recommender.

Either way, I think that the current code isn't broken.



> NearestNUserNeighborhood problems with large Ns
> -----------------------------------------------
>
>                 Key: MAHOUT-455
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-455
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.3
>         Environment: Linux
>            Reporter: Yanir Seroussi
>            Priority: Minor
>
> I set a large n for NearestNUserNeighborhood, with the intention of including all users in the neighbourhood. However, I encountered the following problems:
> (1) If n is set to Integer.MAX_VALUE, the program crashes with the following stack trace:
> Exception in thread "main" java.lang.IllegalArgumentException
> 	at java.util.PriorityQueue.<init>(PriorityQueue.java:152)
> 	at org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopUsers(TopItems.java:93)
> 	at org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.getUserNeighborhood(NearestNUserNeighborhood.java:111)
> This is because TopItems.getTopUsers() tries to create a PriorityQueue with a capacity of Integer.MAX_VALUE + 1.
> (2) If n is set to a large integer value (e.g., 1 billion), it crashes with the following stack trace:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.PriorityQueue.<init>(PriorityQueue.java:153)
> 	at org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopUsers(TopItems.java:93)
> 	at org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.getUserNeighborhood(NearestNUserNeighborhood.java:111)
> This is due to the same reason - trying to create a PriorityQueue with size n + 1.
> In my opinion, this should be fixed by changing n to the number of users in the DataModel when NearestNUserNeighborhood is created, or by letting users specify n = -1 (or a similar value) when they want the user neighbourhood to include all users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.