You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kyle Maxwell <fi...@gmail.com> on 2007/10/04 02:17:04 UTC

Generalized proximity query performance

Hi again,As the subject would suggest I'm trying to implement a layer of
proximity weighting over lucene.  This has greatly increased search
relevance, but at the same time has knocked down performance by a
substantial amount (see footer).

I am using a hand rolled query of the following form (implemented with
SpanNearQuery, not a sloppy PhraseQuery):
a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5

The obvious solution, "a b c"~5, is not applicable for my issues, because I
would like to allow for the possibility that a and b are near each other in
one field, while c is in another field.

So, is there something I'm missing to make this performant?  Would a
reordering, query rewriting solution help?  If there's no solution in
existing Lucene, would anyone be interested in investigating options with
me?

-Kyle


Somewhat arbitrary benchmarks.
--------------
Before:
$ ./bench.rb "paris hilton"
0.022000   0.000000   0.022000 (  0.021000)
$ ./bench.rb "paris hilton goes to jail"
0.024000   0.000000   0.024000 (  0.024000)

After:
$> ./bench.rb "paris hilton"
0.103000   0.000000   0.103000 (  0.103000)
$> ./bench.rb "paris hilton goes to jail"
1.514000   0.000000   1.514000 (  1.513000)

Re: Generalized proximity query performance

Posted by Chris Hostetter <ho...@fucit.org>.
: If I could intelligently rewrite queries, this would be better formulated
: as:
: title:"harry potter"~5 genre:books
: 
: Instead, since I don't have that knowledge, I should perhaps rewrite several
: guesses, and take the dismax.  These guesses are equivalent to passing the

right.  okay.  the brute force approach of trying all possible 
permutations is really the only thing you can do unless you can think 
of ways to translate the "intelligence" that you would use to rewrite 
hte query into code.  One start: test each "word" against each field and 
see if the idf is unusually high, if it is then maybe it's a good idea to 
pull that word out of the phrase and use it to query that specific field 
... maybe you only do this on words at the beginign and end of the input?

the problem becomes a lot simpler when you write code specific to your 
domain .. if you know you are dealing with "products" and you hvae a 
"type" field that only ever contains 1 of 50 values which frequently 
appera in search input (ie: books, couch, dvd) then testing that field
first makes a lot of sense ... the problem becausem much ahrder when you 
want it to work on any generic index under the sun without knowing 
anything about the user behavior.

: This is rather slow.  The in the before/after, the numbers are in seconds,
: for one query, before and after this transformation has been made.

oh, well yeah ... no suprise there.  you can't compare benchmarks between 
two queries that do completley differnet things -- a "simple" query is 
probably always going to be faster the a more complex query that matches a 
different set of documents, or does a "better" job of scoring the same 
set as the simple query.  it's an apple and oranges thing.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Generalized proximity query performance

Posted by Kyle Maxwell <fi...@gmail.com>.
>
> Hmmm.. can you give some more concrete examples of what you mean by this?
> both in terms of the use case you are trying to satisfy, and in terms of
> how your current code works ... you don't have to post code or give away
> trade secrets, just describe it as a black box (ie: what is the input?,
> how do you know when to use fieldA vs fieldC,how do you decide when to
> make a span query vs an OR query?


I have a title field, and a genre field.  A user enters the query:
harry potter books

If I could intelligently rewrite queries, this would be better formulated
as:
title:"harry potter"~5 genre:books


Instead, since I don't have that knowledge, I should perhaps rewrite several
guesses, and take the dismax.  These guesses are equivalent to passing the
following query through the MultiFieldQueryParser:

("harry potter"~5 AND books) OR (harry AND "potter books"~5)

This is rather slow.  The in the before/after, the numbers are in seconds,
for one query, before and after this transformation has been made.


Hope that clears things up

-Kyle

Re: Generalized proximity query performance

Posted by Mike Klaas <mi...@gmail.com>.
On 5-Oct-07, at 11:27 AM, Chris Hostetter wrote:

>
> that's what i thought first too, and it is a problem i'd eventaully  
> like
> to tackle ... it was the part about "c" being in a differnet field  
> from
> "a" and "b" that confused me ... i don't know what that exactly is  
> being
> suggested here.

I'm thinking of the dismax model: you still want each keyword to  
match (though possibly in different fields).  I don't really think  
that that is appropriate to through into a single query class.   
Having separate match/boost clauses is better.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Generalized proximity query performance

Posted by Chris Hostetter <ho...@fucit.org>.
: > : would like to allow for the possibility that a and b are near each other
: > in
: > : one field, while c is in another field.

: I understand the OP to want a PhraseQuery that has an intention (rather than
: side-effect) of doing proximity-based scoring.
: 
: "phrase query here"~1000 is the current hack that performs fine for N < 3
: query terms, but fails currently for N >= 3 since it requires that all the
: terms be present.  For larger queries, this effectively nullifies the
: usefulness of the phrase query approach.

that's what i thought first too, and it is a problem i'd eventaully like 
to tackle ... it was the part about "c" being in a differnet field from 
"a" and "b" that confused me ... i don't know what that exactly is being 
suggested here.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Generalized proximity query performance

Posted by Mike Klaas <mi...@gmail.com>.
On 5-Oct-07, at 10:54 AM, Chris Hostetter wrote:

> : I am using a hand rolled query of the following form (implemented  
> with
> : SpanNearQuery, not a sloppy PhraseQuery):
> : a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5
> :
> : The obvious solution, "a b c"~5, is not applicable for my issues,  
> because I
> : would like to allow for the possibility that a and b are near  
> each other in
> : one field, while c is in another field.
>
> Hmmm.. can you give some more concrete examples of what you mean by  
> this?
> both in terms of the use case you are trying to satisfy, and in  
> terms of
> how your current code works ... you don't have to post code or give  
> away
> trade secrets, just describe it as a black box (ie: what is the  
> input?,
> how do you know when to use fieldA vs fieldC,how do you decide when to
> make a span query vs an OR query?
>
> based one what youv'e described so far, it's hard to udnerstand  
> what it is
> you are doing -- which is important to udnerstand how to help you  
> make it
> better/faster.

I understand the OP to want a PhraseQuery that has an intention  
(rather than side-effect) of doing proximity-based scoring.

"phrase query here"~1000 is the current hack that performs fine for N  
< 3 query terms, but fails currently for N >= 3 since it requires  
that all the terms be present.  For larger queries, this effectively  
nullifies the usefulness of the phrase query approach.

It doesn't seem to me that writing a variant of PhraseQuery that has  
the desired functionality would be _too_ hard, but I haven't looked  
into it in depth.

-Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Generalized proximity query performance

Posted by Chris Hostetter <ho...@fucit.org>.
: I am using a hand rolled query of the following form (implemented with
: SpanNearQuery, not a sloppy PhraseQuery):
: a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5
: 
: The obvious solution, "a b c"~5, is not applicable for my issues, because I
: would like to allow for the possibility that a and b are near each other in
: one field, while c is in another field.

Hmmm.. can you give some more concrete examples of what you mean by this?  
both in terms of the use case you are trying to satisfy, and in terms of 
how your current code works ... you don't have to post code or give away 
trade secrets, just describe it as a black box (ie: what is the input?, 
how do you know when to use fieldA vs fieldC,how do you decide when to 
make a span query vs an OR query?

based one what youv'e described so far, it's hard to udnerstand what it is 
you are doing -- which is important to udnerstand how to help you make it 
better/faster.

: Somewhat arbitrary benchmarks.

they do seem fairly arbitrary, especially since there are no units on the 
numbers, and no indication as to what "before" and "after" refer to...


: --------------
: Before:
: $ ./bench.rb "paris hilton"
: 0.022000   0.000000   0.022000 (  0.021000)
: $ ./bench.rb "paris hilton goes to jail"
: 0.024000   0.000000   0.024000 (  0.024000)
: 
: After:
: $> ./bench.rb "paris hilton"
: 0.103000   0.000000   0.103000 (  0.103000)
: $> ./bench.rb "paris hilton goes to jail"
: 1.514000   0.000000   1.514000 (  1.513000)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org