You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Morus Walter <mo...@googlemail.com> on 2009/02/17 11:10:19 UTC
distinct queries for search and scoring
Hallo,
I'm currently thinking about what the best solution would be for the
following request:
- a lucene index should be queried for a number of search criteria
- the score for each result should not be the normal query score, but an
indicator on the similarity between the matched document and some
other conditions that can be expressed as a query as well.
The use case is something like a search for jobs (defined by arbitrary
user input) and a scoring based on similarity to a users profile
(basically his CV).
This can certainly be done in various ways
- get the scores from a score query; do the main search then and attach
the scores to the results
- do the main search first and then the score query using the results
of the main as a filter (the score query might need a small
modification to match for all documents)
- combine the searches into one and make the scoring part for the
main query neglectable
- see if it's possible to run two scorer at a time and combine the
results; of course one scorer would have to score documents in an
order defined by the other (that's just a vague idea; I didn't check
the low level APIs thoroughly yet; so maybe this does not work at all)
but I don't have a clear idea what the performance expectations for the
different ways might be.
So before I start experimenting I'd like to ask if anyone on the list
has ever done something like this (or thought about it) or has other
insights that might be helpful.
The indices in question are medium size (50k/200k documents; but that
might increase up to a few millon). The main query might match a large
part of that index (up to all documents), as we do an incremental search
where each user input results in a search even if the complete search
criteria isn't provided yet. The number of documents having a score
larger than 0 (that is match the score query) is usually smaller but
might reach a few thousands.
regards
Morus
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: distinct queries for search and scoring
Posted by Michael McCandless <lu...@mikemccandless.com>.
Is your scoring query also doing some filtering? If so, you could
drive the search with your scoring query, and then pass in as a filter
your second query wrapped with QueryWrapperFilter. I think that's
effectively your last option, which should be the most efficient one.
Or, if the scoring query does not do any filtering, or you expect the
main query to be more restrictive, you could try a BooleanQuery with
the two sub-queries as AND'd clauses, where the non-scoring query has
boost 0.0 (I'm not certain that works but it seems like it should).
The downside is Lucene still does all the scoring work, and then
multiplies by 0.0, I think. If performance is good enough I'd just go
with this?
To avoid computing that sub-score (only to throw it away) you could
make your own custom iteration through the matched docs, AND'ing
together the docIDs but then calling only on your scoring query to do
the scoring. But that's quite a bit more work.
Mike
Morus Walter wrote:
> Hallo,
>
> I'm currently thinking about what the best solution would be for the
> following request:
>
> - a lucene index should be queried for a number of search criteria
> - the score for each result should not be the normal query score,
> but an
> indicator on the similarity between the matched document and some
> other conditions that can be expressed as a query as well.
>
> The use case is something like a search for jobs (defined by arbitrary
> user input) and a scoring based on similarity to a users profile
> (basically his CV).
>
> This can certainly be done in various ways
> - get the scores from a score query; do the main search then and
> attach
> the scores to the results
> - do the main search first and then the score query using the results
> of the main as a filter (the score query might need a small
> modification to match for all documents)
> - combine the searches into one and make the scoring part for the
> main query neglectable
> - see if it's possible to run two scorer at a time and combine the
> results; of course one scorer would have to score documents in an
> order defined by the other (that's just a vague idea; I didn't check
> the low level APIs thoroughly yet; so maybe this does not work at
> all)
> but I don't have a clear idea what the performance expectations for
> the
> different ways might be.
>
> So before I start experimenting I'd like to ask if anyone on the list
> has ever done something like this (or thought about it) or has other
> insights that might be helpful.
>
> The indices in question are medium size (50k/200k documents; but that
> might increase up to a few millon). The main query might match a large
> part of that index (up to all documents), as we do an incremental
> search
> where each user input results in a search even if the complete search
> criteria isn't provided yet. The number of documents having a score
> larger than 0 (that is match the score query) is usually smaller but
> might reach a few thousands.
>
> regards
> Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org