You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (Commented) (JIRA)" <ji...@apache.org> on 2011/11/12 11:06:51 UTC

[jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

    [ https://issues.apache.org/jira/browse/MAHOUT-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149034#comment-13149034 ] 

Sean Owen commented on MAHOUT-881:
----------------------------------

-1 Grant I thought we discussed this on the mailing list? I don't see that this achieves anything. You are still doing an implicit n log n heap sort to attain an ordered result. You still allocate a container object for the result. This seems like swapping some code for more complex (but equally working) code, and adding a Lucene dependency.

Do you have a load test that shows it would be notably faster? the patch does have such a thing.

The comment about SimilarUser is right though. It would just affect tie breaking, which doesn't really matter, but should be fixed.
                
> Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting
> ----------------------------------------------------------------------------
>
>                 Key: MAHOUT-881
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-881
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: MAHOUT-881.patch
>
>
> TopItems.getTop*() all do a fair number of excessive operations that can be replaced by switching to using Lucene's PriorityQueue implementation, which is more efficient and faster than Java's built in PQ implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 12, 2011, at 12:58 PM, Sean Owen wrote:

> On Sat, Nov 12, 2011 at 7:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> OK, that's reasonable.  I did see the reply and my response was to put up a patch so we can discuss it concretely instead of have a theoretical discussion.
> 
> Your JIRA just repeated the same point, and appeared to disregard my
> follow-up. It looks like the stated motivation for the patch was in
> fact not correct. (Yonik provided a different, decent rationalization
> for it.) I think it was right to -1 the patch. and I don't think you
> should have suggested this was an invalid or extreme thing to do.
> Opening a JIRA, which suggests you think there's consensus to make a
> change, seems like the out-of-place thing.

Yes, you are right.  I overreacted and should have explained myself better.  My apologies.


At this point, I've seen a variety of results in benchmarking on my laptop, some suggesting a difference of nothing (favoring your point of view) to about 9% (which I think we would like to have!)   Since I don't have faith in these tests on my laptop, I am going to re-run when I get home so I have access to my benchmark server.  I'll put up the patch containing the benchmarking code so others can scrutinize it.  Longer term, I'd like to bring over Lucene's benchmark framework, modified for our needs.

> 
> 
>> These 40 lines could probably be a lot less if we didn't have 2 different classes to track the same two things:  id and score (SimilarUser and RecommendedItem).  If that were gone, then we could just have one extension of the PQ.  Likewise for ItemItemSimilarity and UserUserSimilarity container.  Since that's all hidden from the user, there really is no need for the distinction.
> 
> This is an orthogonal point -- what's it have to do with your proposed
> patch? I think the patch is fine, but don't think this bolsters the
> argument that it's a big simplification.
> 

I was just responding to the suggestion that the patch was adding a lot more lines of code, but I don't think it would if it weren't for the redundancy in storage classes.

> You can overload, say, SimilarUser and GenericRecommendedItem. They're
> both an ID and a floating-point value. What do you call it --
> EntityScore? It probably harms readability ever so slightly,
> especially as it also has to implement the RecommendedItem interface.
> toString() would change. I don't object, I don't really think it
> helps.

Perhaps, but they are internal objects, right?


Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Ted Dunning <te...@gmail.com>.
It is also plausible to push a template into the PQ implementation or into
the insertion code that is being duplicated.  Then the same code can type
check and handle both cases.

If the lucene PQ doesn't handle templating, then the code can just be
lifted and fixed.

On Sat, Nov 12, 2011 at 12:58 PM, Sean Owen <sr...@gmail.com> wrote:

> You can overload, say, SimilarUser and GenericRecommendedItem. They're
> both an ID and a floating-point value. What do you call it --
> EntityScore? It probably harms readability ever so slightly,
> especially as it also has to implement the RecommendedItem interface.
> toString() would change. I don't object, I don't really think it
> helps.
>

Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Sean Owen <sr...@gmail.com>.
On Sat, Nov 12, 2011 at 7:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> OK, that's reasonable.  I did see the reply and my response was to put up a patch so we can discuss it concretely instead of have a theoretical discussion.

Your JIRA just repeated the same point, and appeared to disregard my
follow-up. It looks like the stated motivation for the patch was in
fact not correct. (Yonik provided a different, decent rationalization
for it.) I think it was right to -1 the patch. and I don't think you
should have suggested this was an invalid or extreme thing to do.
Opening a JIRA, which suggests you think there's consensus to make a
change, seems like the out-of-place thing.


> These 40 lines could probably be a lot less if we didn't have 2 different classes to track the same two things:  id and score (SimilarUser and RecommendedItem).  If that were gone, then we could just have one extension of the PQ.  Likewise for ItemItemSimilarity and UserUserSimilarity container.  Since that's all hidden from the user, there really is no need for the distinction.

This is an orthogonal point -- what's it have to do with your proposed
patch? I think the patch is fine, but don't think this bolsters the
argument that it's a big simplification.

You can overload, say, SimilarUser and GenericRecommendedItem. They're
both an ID and a floating-point value. What do you call it --
EntityScore? It probably harms readability ever so slightly,
especially as it also has to implement the RecommendedItem interface.
toString() would change. I don't object, I don't really think it
helps.

Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 12, 2011, at 8:28 AM, Sean Owen wrote:

> I was surprised to see a JIRA pop up when we had an open thread on the
> topic on the mailing list. Did you see my reply? I believe it had
> answered your questions... then you seemed to be about to make a
> change anyway. I don't think anything is "serious" here, but I suppose
> that's what I was flagging with a -1. Indulge me below first please.

OK, that's reasonable.  I did see the reply and my response was to put up a patch so we can discuss it concretely instead of have a theoretical discussion.

> 
> I am sure this can easily be agreed on. I detest classic Apache-style
> JIRA fights, I'm not picking any such thing!
> 
> 
>> Just b/c some piece of code isn't considered a bottleneck at the moment means it can't be improved?  This new code removes a fair bit of complexity (the code was passing over the same data at least 3 times in some cases, when it only needed to pass over it once) and puts in place an approach that is far fewer operations AND has a faster underlying data structure.  It's a no-brainer.  I would think we would be happy that we are finally to the point where doing these kinds of optimizations is a worthwhile exercise.
> 
> No disagreement about optimizing and improving stuff even for its own
> sake. The question is whether this is an improvement and I'm not sure
> I agree with that reasoning (see next). This adds a net 40 lines of
> code -- I am not sure that's a simplification? It's no big deal
> either, but I suppose I am not seeing that win.

These 40 lines could probably be a lot less if we didn't have 2 different classes to track the same two things:  id and score (SimilarUser and RecommendedItem).  If that were gone, then we could just have one extension of the PQ.  Likewise for ItemItemSimilarity and UserUserSimilarity container.  Since that's all hidden from the user, there really is no need for the distinction.


Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Sean Owen <sr...@gmail.com>.
I was surprised to see a JIRA pop up when we had an open thread on the
topic on the mailing list. Did you see my reply? I believe it had
answered your questions... then you seemed to be about to make a
change anyway. I don't think anything is "serious" here, but I suppose
that's what I was flagging with a -1. Indulge me below first please.

I am sure this can easily be agreed on. I detest classic Apache-style
JIRA fights, I'm not picking any such thing!


> Just b/c some piece of code isn't considered a bottleneck at the moment means it can't be improved?  This new code removes a fair bit of complexity (the code was passing over the same data at least 3 times in some cases, when it only needed to pass over it once) and puts in place an approach that is far fewer operations AND has a faster underlying data structure.  It's a no-brainer.  I would think we would be happy that we are finally to the point where doing these kinds of optimizations is a worthwhile exercise.

No disagreement about optimizing and improving stuff even for its own
sake. The question is whether this is an improvement and I'm not sure
I agree with that reasoning (see next). This adds a net 40 lines of
code -- I am not sure that's a simplification? It's no big deal
either, but I suppose I am not seeing that win.


> As for n log n, yeah, I know, but really the current code is nlogn (PQ operations) + n(copying arrays) + nlogn(sort).  Not too mention several extra memory allocations and branch conditions.  We all know that the constant factor and the additive factors hidden by Big O notation can still make a difference in real systems.   Besides, I have some ideas on other optimizations we can make too, like precomputing more stuff, but that is a separate discussion.

I don't think this analysis is correct, and I think if we delve into
it, maybe what I'm saying will be clearer.

The current code does not perform n log n heap operations -- what are
you referring to? It doesn't even do n inserts. That's the benefit of
keeping around the smallest-big-value value to compare to. This is
exactly the strategy that the Lucene class uses too, it seems:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/3.4.0/org/apache/lucene/util/PriorityQueue.java

That's not what I meant. Constructing the sorted list of results from
a heap necessarily involves a sort. Your code does a heap sort,
implicitly, by calling pop() to clear the PriorityQueue. There is no
difference there. The new method also spends the same amount of time
sorting.

There's also no difference in allocating a container for the result in
getTopItems(). Both fill out a List. What's the win there?

There *is* a difference in getTopUsers()! You are going straight to
the long[] result bypassing the intermediate array of results. That's
a small win but a win.


> So, do you really think this is worth vetoing?   This is nowhere near a veto-able piece of code, IMO.  Not even in the same solar system.  Vetoes are for stuff that moves something in a completely opposite direction from the project's goal.  Us getting rid of Watchmaker is a veto-able offense if someone feels near and dear to that code.  This is a micro optimization in a single class file.  Sure, there are other, bigger optimizations we can make, but that doesn't mean we can't fix things.

I was concerned about a perceived process problem above and just
wanted to put up a stop-sign in Apache-speak. Is it so serious? To
rationalize it: I thought you were indicating you were moving ahead
with a change that I had open questions about. That seems like
something we shouldn't do and against mom and apple pie and the Apache
way. It seems OK to flag that with a -1. Right? I don't know. That's
my meaning, anyway.


> I'm working on it, which is why I haven't committed yet, b/c I know it bears some scrutiny.  I was putting it up to show how much simpler it is.  But it doesn't take a benchmarking tool to see that removing n inserts, a sort of an array of size n and then another n inserts in favor of a single set of n inserts is going to be faster.  Not too mention it also kills several extraneous array allocations and branch condition checks.  This isn't the kind of stuff the JVM can optimize for you.

Since you've already done some good work to put that in place it would
really be good to see the result, indeed. In practice it sure could be
faster, theoretical arguments aside! And that speaks for itself. But
even if it's slower by a bit, and you've heard all the thinking here
and would still prefer to make the change, I'm OK with it. Then at
least it will have been after useful discussion and even measurement.

Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Posted by Grant Ingersoll <gs...@apache.org>.
Moving this discussion to the mailing list, as a -1 is a fairly serious thing when it comes to Apache and it merits a discussion.

On Nov 12, 2011, at 2:06 AM, Sean Owen (Commented) (JIRA) wrote:

> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149034#comment-13149034 ] 
> 
> Sean Owen commented on MAHOUT-881:
> ----------------------------------
> 
> -1 Grant I thought we discussed this on the mailing list? I don't see that this achieves anything. You are still doing an implicit n log n heap sort to attain an ordered result. You still allocate a container object for the result. This seems like swapping some code for more complex (but equally working) code, and adding a Lucene dependency.

Just b/c some piece of code isn't considered a bottleneck at the moment means it can't be improved?  This new code removes a fair bit of complexity (the code was passing over the same data at least 3 times in some cases, when it only needed to pass over it once) and puts in place an approach that is far fewer operations AND has a faster underlying data structure.  It's a no-brainer.  I would think we would be happy that we are finally to the point where doing these kinds of optimizations is a worthwhile exercise.

As for the Lucene dependency, we already have it and it isn't going anywhere anytime soon.  

As for n log n, yeah, I know, but really the current code is nlogn (PQ operations) + n(copying arrays) + nlogn(sort).  Not too mention several extra memory allocations and branch conditions.  We all know that the constant factor and the additive factors hidden by Big O notation can still make a difference in real systems.   Besides, I have some ideas on other optimizations we can make too, like precomputing more stuff, but that is a separate discussion.

So, do you really think this is worth vetoing?   This is nowhere near a veto-able piece of code, IMO.  Not even in the same solar system.  Vetoes are for stuff that moves something in a completely opposite direction from the project's goal.  Us getting rid of Watchmaker is a veto-able offense if someone feels near and dear to that code.  This is a micro optimization in a single class file.  Sure, there are other, bigger optimizations we can make, but that doesn't mean we can't fix things.

> 
> Do you have a load test that shows it would be notably faster? the patch does have such a thing.

I'm working on it, which is why I haven't committed yet, b/c I know it bears some scrutiny.  I was putting it up to show how much simpler it is.  But it doesn't take a benchmarking tool to see that removing n inserts, a sort of an array of size n and then another n inserts in favor of a single set of n inserts is going to be faster.  Not too mention it also kills several extraneous array allocations and branch condition checks.  This isn't the kind of stuff the JVM can optimize for you.

-Grant