You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2009/05/02 12:18:55 UTC

Re: Search result ordering

Lucene's field sorting will be faster in 2.9.

First, the warming time of a reopened reader (after a writer has
committed some changes) will typically be very much faster.  Second,
you'll be able to optionally turn off score computation when sorting
by field.  Finally, the actual cost of sorting per hit should go down
(we try hard to specialize the various combinations of sort fields).

That said, I'm surprised even on 2.4.x you're seeing such a difference
in query times when sorting by score (default) vs say a String field.
In 2.4.x, String sorting has a very costly "warmup", but once warmed
the searching ought to be quite fast.  Were your times measured after
warming had been done on the reader?

Mike

On Wed, Apr 29, 2009 at 5:23 PM,  <Bi...@sungard.com> wrote:
> Unfortunately we do periodically add Documents to our index.  However, I wasn't aware of the Lucene-assigned doc ID or Sort.INDEXORDER.  This is good information to know.  Who knows, we might be able to refactor things to use this method.
>
> Regarding performance, yes I have actually seen some differences.  With no sort at all the query is lightning fast, like a tenth of a second.  With the String sort it varies from about 1.5 - 3 seconds.  I also tried sorting on an integer field.  This consistently takes about 1 second.  These numbers are not terrible.  Again, I'm pretty impressed how fast it completes with a sort given how many hits we get sometimes.  Just trying to see if we could squeeze some more cycles out of it.
>
> Thanks for the help.
>
> regards,
>
> Bill Chesky
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, April 29, 2009 4:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search result ordering
>
> I really doubt boosting at index time will help. All that expresses is
> that "this document's title (say) is more important *when calculating score*
> than other documents with a smaller title boost".
>
> But since you're not searching on your key (I assume), boosting
> at index time would be irrelevant to score calculations that didn't
> search on your key. I doubt you want to go there...
>
> So I think you're really back to sorting on your key, I don't know
> of any other way to get what you want.
>
> Unless you want to be really adventurous and your index is pretty
> static (or at least you can regenerate the whole thing "often enough").
> NOTE: this scheme *will not work* if you expect to add to and/or
> update your index without rebuilding it from scratch.
>
> In a nutshell, pre-sort your corpus and add the docs in the order
> of your key. Now for your index, the following holds true:
>
> if (key1 < key2) then (id1 < id2).
>
> where id is the Lucene-assigned doc ID.
>
> You can then sort the documents by INDEXORDER (see the Sort
> API in the javadocs), which should, at least, be very fast and use
> a minimal amount of memory. Whether this is enough of a gain
> over just sorting on your key to justify the complexity I leave
> as an exercise for the reader <G>.
>
> I should stress, though, that it won't work if you alter your index
> *at all* after you build it. Updating a document is really a delete/add
> so after updating, that document would be at the end every time.
> Similarly if you add new documents, each and every one will
> be after the last document in your original index.
>
> I suppose I should also ask whether you're really seeing
> performance issues or whether this is theoretical. I'm a big
> advocate of avoiding optimizations until you have a reasonable
> certainty that you need it. Premature optimization and all that.
>
> Best
> Erick
>
> On Wed, Apr 29, 2009 at 3:19 PM, <Bi...@sungard.com> wrote:
>
>> Thanks Erick,
>>
>> Basically, the ideal ordering is an alphabetical one based on a String
>> value that is known at index creation.  I was just wondering if there was
>> anything I could do at index creation time that might help me enforce that
>> ordering at query time (without using a Sort).  To be honest, I haven't had
>> to deal much w/ scoring in my work w/ Lucene.  Our app just searches based
>> on some set of criteria and the results returned are all pretty much equal.
>>  However, we want to list them alphabetically so they don't appear in a
>> jumbled, seemingly random order.
>>
>> Your comment about boost and scoring prompted me to read up a bit on it and
>> it got me wondering if maybe boost could be used somehow.  E.g. once all the
>> docs have been added to the index and I know the order I want, I could go
>> back and set the boost for each document accordingly.  But maybe this is a
>> naïve or innapriate use of boost.
>>
>> Thanks for the tip on Hits.  We're using 2.2.0 and should probably upgrade.
>>  We're at a point in development, though, where we want to keep new
>> variables to a minimum.  Interestingly I don't see much of a difference when
>> paging through hits 1-10 vs. hits 300-310.  They all seem to take about the
>> same time to evaluate.  I'll try using one of the HitCollectors as you
>> suggest to see if it makes a difference.
>>
>> regards,
>>
>> --
>> Bill Chesky
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Wednesday, April 29, 2009 1:46 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Search result ordering
>>
>> People (including me) use Lucene to page through results all the time,
>> so I'm pretty sure you're OK.
>>
>> so here's my answers...
>> (1) yes.
>> (2) Well, the default sort is by score so if you want some other
>>     ordering you have to sort.
>> (3) You can boost things at index time, but I don't think that's at all
>>     relevant. What order are you trying to enforce that you know
>>     enough about at index time to specify?
>>
>> Do note, though, that Hits is deprecated. The problem is that
>> Hits was intended to be reasonable *only* when accessing the first
>> few documents. If you're paging far into the result set, be aware that
>> using Hits will re-execute the entire query every 100 (200?) results you
>> throw away. Think about one of the HitCollectors
>> (perhaps TopDocCollector) instead.
>>
>> Best
>> Erick
>>
>>
>> On Wed, Apr 29, 2009 at 1:01 PM, <Bi...@sungard.com> wrote:
>>
>> > Hello,
>> >
>> > I have a few questions about the ordering of search results:
>> >
>> > 1) Given a query, are the Documents contained in the Hits object that is
>> > returned by IndexSearcher.search(Query query) guaranteed to be in the
>> > same order from one call to the next (assuming the index has not been
>> > updated in the meantime)?
>> > 2) Assuming I don't use the IndexSearcher.search(Query query, Sort sort)
>> > method, is the ordering of Documents in the Hits object predictable at
>> > all?
>> > 3) Short of using the IndexSearcher.search(Query query, Sort sort), is
>> > there any way to influence the ordering of the Documents in the Hits
>> > object?  E.g. is there anything that I can do when creating and/or
>> > updating the index that will guarantee a certain ordering of results at
>> > query time?
>> >
>> > For what it's worth, we're using Lucene in conjuction w/ a relational
>> > database.  Our index has an 'id' field that maps to a row in a
>> > relational database table.  Since Lucene queries are so quick we've
>> > found performance to be much better if we do a pure Lucene query to find
>> > the docs we need then do a simple SQL query with a "where id in (...)"
>> > clause.   We've also wrapped this in an interface that implements a
>> > relational-like limit/offset functionality.  So it's important to us
>> > that at the very least the query returns results in the same order each
>> > time.  Ideally, we'd like to order the results on a String field we have
>> > on each document.  This actually works, however it slows things down a
>> > bit.  This is understandable, course -- and I'm actually very impressed
>> > how quickly it does perform -- however, we're trying to squeeze as many
>> > cycles as possible out it to give the best user experience.   Just
>> > wondering if there is anything else we might try.
>> >
>> > thanks,
>> >
>> > Bill
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org