You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/11/09 01:46:52 UTC

TopItems.getTopUsers()

I've been reading code and am wondering about TopItems.getTopUsers

Here's my pseudocode of it (lines 96-134 in TopItems)
Get a Priority Queue (PQ)
for all users
	estimate the similarity of user i with our current user (the one we are generating a rec for)
	put a SimilarUser object on the PQ.

//HERE
Create a list of SimilarUser objects
Populate that list with the PQ objects, which also contains SimilarUser objects 
Sort that list
Iterate over that list once more and grab the ids and put it into an array
//HERE
return the array

Why all that in between moving stuff marked by //HERE?  Isn't that traversing the same list 3 times?  Can't we just pop from the priority queue in reverse order and fill the array from back to front?  Am I missing something?

Here's the same code in Lucene:
<snip>
protected void populateResults(ScoreDoc[] results, int howMany) {
    for (int i = howMany - 1; i >= 0; i--) { 
      results[i] = pq.pop();
    }
  }
</snip>

Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I understand (which is why we wrote our own).

-Grant

Re: TopItems.getTopUsers()

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Nov 8, 2011 at 7:46 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I understand (which is why we wrote our own).

That's why we *kept* our own (it predates the standard JDK implementation).
The existence of adjustTop() (and our use of that) is why it's faster.

-Yonik
http://www.lucidimagination.com

Re: TopItems.getTopUsers()

Posted by Sean Owen <sr...@gmail.com>.

On Wed, Nov 9, 2011 at 1:52 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Notice the code in Lucene.  It is solving the exact same problem.  Return the top X things by score.  Our ScoreDoc is made of a score and an id.

I'm sure it does -- but I thought the question was why this code was
the way it is, and why you couldn't just use the queue as-is? I am
sure the Lucene equivalent must do something similar. You can't solve
this with a max-heap, if that's what you mean.

> Yeah, I know.  But why all the shuffling around?  We have all we need on the PQ already.

Yes, but not in order! What are you suggesting would work, that is simpler?

> Multiplied by millions or requests, it adds up.  I think Recommenders, especially, are to the point where it's time to start tightening the screws on performance.   I've been reviewing this stuff quite a bit lately and am struck by how similar this stuff is to Lucene's core scoring, albeit from a few years ago (pre Lucene 2.4, by my guess).  For instance, our "IDRescorer", is modeled via a Collector class and bit set filters and it's integrated directly into the scoring, whereas Mahout sorts after scoring.

This is nowhere near the bottleneck, I assure you, but invite you to
profile it to have a look. This piece of code has been profiled and
scrubbed many times over.

I don't understand the comment about sorting after scoring -- you
can't, in this context.

Re: TopItems.getTopUsers()

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 8, 2011, at 11:24 PM, Sean Owen wrote:

> The PriorityQueue there is a min heap. It's used to keep finding the
> smallest among a collection of large values. So, no it can't be used
> directly to create a list of items from large to small as it has a
> "get smallest" method, not "get largest".

Notice the code in Lucene.  It is solving the exact same problem.  Return the top X things by score.  Our ScoreDoc is made of a score and an id.

> 
> The result of the method is a list of IDs, not a list of item-score
> pairs. This is the reason for the last step where it's converted into
> just the IDs.

Yeah, I know.  But why all the shuffling around?  We have all we need on the PQ already.

> 
> The size of queue here is typically really small, like 10 items. The
> queue implementation probably makes no measurable difference at that
> size.

Multiplied by millions or requests, it adds up.  I think Recommenders, especially, are to the point where it's time to start tightening the screws on performance.   I've been reviewing this stuff quite a bit lately and am struck by how similar this stuff is to Lucene's core scoring, albeit from a few years ago (pre Lucene 2.4, by my guess).  For instance, our "IDRescorer", is modeled via a Collector class and bit set filters and it's integrated directly into the scoring, whereas Mahout sorts after scoring.


> 
> On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I've been reading code and am wondering about TopItems.getTopUsers
>> 
>> Here's my pseudocode of it (lines 96-134 in TopItems)
>> Get a Priority Queue (PQ)
>> for all users
>>        estimate the similarity of user i with our current user (the one we are generating a rec for)
>>        put a SimilarUser object on the PQ.
>> 
>> //HERE
>> Create a list of SimilarUser objects
>> Populate that list with the PQ objects, which also contains SimilarUser objects
>> Sort that list
>> Iterate over that list once more and grab the ids and put it into an array
>> //HERE
>> return the array
>> 
>> Why all that in between moving stuff marked by //HERE?  Isn't that traversing the same list 3 times?  Can't we just pop from the priority queue in reverse order and fill the array from back to front?  Am I missing something?
>> 
>> Here's the same code in Lucene:
>> <snip>
>> protected void populateResults(ScoreDoc[] results, int howMany) {
>>    for (int i = howMany - 1; i >= 0; i--) {
>>      results[i] = pq.pop();
>>    }
>>  }
>> </snip>
>> 
>> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I understand (which is why we wrote our own).
>> 
>> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: TopItems.getTopUsers()

Posted by Sean Owen <sr...@gmail.com>.

The PriorityQueue there is a min heap. It's used to keep finding the
smallest among a collection of large values. So, no it can't be used
directly to create a list of items from large to small as it has a
"get smallest" method, not "get largest".

The result of the method is a list of IDs, not a list of item-score
pairs. This is the reason for the last step where it's converted into
just the IDs.

The size of queue here is typically really small, like 10 items. The
queue implementation probably makes no measurable difference at that
size.

On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I've been reading code and am wondering about TopItems.getTopUsers
>
> Here's my pseudocode of it (lines 96-134 in TopItems)
> Get a Priority Queue (PQ)
> for all users
>        estimate the similarity of user i with our current user (the one we are generating a rec for)
>        put a SimilarUser object on the PQ.
>
> //HERE
> Create a list of SimilarUser objects
> Populate that list with the PQ objects, which also contains SimilarUser objects
> Sort that list
> Iterate over that list once more and grab the ids and put it into an array
> //HERE
> return the array
>
> Why all that in between moving stuff marked by //HERE?  Isn't that traversing the same list 3 times?  Can't we just pop from the priority queue in reverse order and fill the array from back to front?  Am I missing something?
>
> Here's the same code in Lucene:
> <snip>
> protected void populateResults(ScoreDoc[] results, int howMany) {
>    for (int i = howMany - 1; i >= 0; i--) {
>      results[i] = pq.pop();
>    }
>  }
> </snip>
>
> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I understand (which is why we wrote our own).
>
> -Grant