You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Morus Walter <mo...@googlemail.com> on 2009/02/17 11:10:19 UTC

distinct queries for search and scoring

Hallo,

I'm currently thinking about what the best solution would be for the
following request:

- a lucene index should be queried for a number of search criteria
- the score for each result should not be the normal query score, but an
  indicator on the similarity between the matched document and some
  other conditions that can be expressed as a query as well.

The use case is something like a search for jobs (defined by arbitrary
user input) and a scoring based on similarity to a users profile
(basically his CV).

This can certainly be done in various ways
- get the scores from a score query; do the main search then and attach
  the scores to the results
- do the main search first and then the score query using the results
  of the main as a filter (the score query might need a small
  modification to match for all documents)
- combine the searches into one and make the scoring part for the
  main query neglectable
- see if it's possible to run two scorer at a time and combine the
  results; of course one scorer would have to score documents in an
  order defined by the other (that's just a vague idea; I didn't check
  the low level APIs thoroughly yet; so maybe this does not work at all)
but I don't have a clear idea what the performance expectations for the
different ways might be.

So before I start experimenting I'd like to ask if anyone on the list
has ever done something like this (or thought about it) or has other
insights that might be helpful.

The indices in question are medium size (50k/200k documents; but that
might increase up to a few millon). The main query might match a large
part of that index (up to all documents), as we do an incremental search
where each user input results in a search even if the complete search
criteria isn't provided yet. The number of documents having a score
larger than 0 (that is match the score query) is usually smaller but
might reach a few thousands.

regards
	Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: distinct queries for search and scoring

Posted by Michael McCandless <lu...@mikemccandless.com>.

Is your scoring query also doing some filtering?  If so, you could
drive the search with your scoring query, and then pass in as a filter
your second query wrapped with QueryWrapperFilter.  I think that's
effectively your last option, which should be the most efficient one.

Or, if the scoring query does not do any filtering, or you expect the
main query to be more restrictive, you could try a BooleanQuery with
the two sub-queries as AND'd clauses, where the non-scoring query has
boost 0.0 (I'm not certain that works but it seems like it should).
The downside is Lucene still does all the scoring work, and then
multiplies by 0.0, I think.  If performance is good enough I'd just go
with this?

To avoid computing that sub-score (only to throw it away) you could
make your own custom iteration through the matched docs, AND'ing
together the docIDs but then calling only on your scoring query to do
the scoring.  But that's quite a bit more work.

Mike

Morus Walter wrote:

> Hallo,
>
> I'm currently thinking about what the best solution would be for the
> following request:
>
> - a lucene index should be queried for a number of search criteria
> - the score for each result should not be the normal query score,  
> but an
>  indicator on the similarity between the matched document and some
>  other conditions that can be expressed as a query as well.
>
> The use case is something like a search for jobs (defined by arbitrary
> user input) and a scoring based on similarity to a users profile
> (basically his CV).
>
> This can certainly be done in various ways
> - get the scores from a score query; do the main search then and  
> attach
>  the scores to the results
> - do the main search first and then the score query using the results
>  of the main as a filter (the score query might need a small
>  modification to match for all documents)
> - combine the searches into one and make the scoring part for the
>  main query neglectable
> - see if it's possible to run two scorer at a time and combine the
>  results; of course one scorer would have to score documents in an
>  order defined by the other (that's just a vague idea; I didn't check
>  the low level APIs thoroughly yet; so maybe this does not work at  
> all)
> but I don't have a clear idea what the performance expectations for  
> the
> different ways might be.
>
> So before I start experimenting I'd like to ask if anyone on the list
> has ever done something like this (or thought about it) or has other
> insights that might be helpful.
>
> The indices in question are medium size (50k/200k documents; but that
> might increase up to a few millon). The main query might match a large
> part of that index (up to all documents), as we do an incremental  
> search
> where each user input results in a search even if the complete search
> criteria isn't provided yet. The number of documents having a score
> larger than 0 (that is match the score query) is usually smaller but
> might reach a few thousands.
>
> regards
> 	Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org