You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eric Jain <Er...@isb-sib.ch> on 2006/01/11 00:09:31 UTC

Generating phrase queries from term queries

Is there an efficient way to determine if two or more terms frequently 
appear next to each other sequence? For a query like:

a b c

one or more of the following suggestions could be generated:

"a b c"
"a b" c
a "b c"

I could of course just run a search with all possible combinations, but 
perhaps there is a better way?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Chris Hostetter <ho...@fucit.org>.

: > (Assuming *I* understand it) what he's talking baout, is the ability for
: > his search GUI to display suggested phrase searches you may want to try
: > which consist of the words you just typed in grouped into phrases.
:
: Yes, that's precisely what I am talking about. Sorry for being unclear.

I would start with the most straight forward appraoch: try executing
queries where each permutation of the input is grouped into phrases.
start with the phrases that include the fewest number of words first, that
was if "A B" isnt' a valid phrase, yo know you can skip "A B C".  design
your API so that you can specify a "max time to look" and once that time
has past give up on looking for more phrases.

wants youv'e got that, you might be able to optimize the order that you
test phrases by looking at the TermFreq for each of hte individual words,
and trying the phrases hat include hte most common term first (and the
second most common term second, etc...)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Eric Jain <Er...@isb-sib.ch>.

Chris Hostetter wrote:
> (Assuming *I* understand it) what he's talking baout, is the ability for
> his search GUI to display suggested phrase searches you may want to try
> which consist of the words you just typed in grouped into phrases.

Yes, that's precisely what I am talking about. Sorry for being unclear.

> Presumably, if multiple phrases in the source data can be found in the
> permutations of hte search words, the least common are the ones you'd want
> to sugggest -- which makes the problem a sort of SIP problem (ie: given an
> extremely limited set of words, find the Statistically imporbably phrases
> in the corpus made using only subsets of those words)

I'd already be happy to get *any* phrases :-)

If the phrases could be ranked, I might prefer to pick the *most frequent* 
phrases. For example:

   anopheles anopheles malaria

("anopheles anopheles" is the latin name for the common mosquito)

I'd like to be able to suggest quoting this name to eliminate all the other 
mosquito species that also contain "anopheles" in their name.

There are lots of documents with "anopheles anopheles". There may also be a 
document or two where "anopheles" happens to appear next to "malaria", but 
these are less interesting here.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Chris Hostetter <ho...@fucit.org>.

: If you can express each phrase as a SpanNearQuery, the occurrences
: of the phrases can be easily obtained by iterating over the result of
: getSpans() on SpanNearQuery.
: It's not as efficient as a specialized PhraseQuery, though.

I think you are missunderstanding his goal.

(Assuming *I* understand it) what he's talking baout, is the ability for
his search GUI to display suggested phrase searches you may want to try
which consist of the words you just typed in grouped into phrases.

For example, if a user types in...
	Lucene Erik Otis

...he will do a Boolean OR search on those words, and not suggest any
phrases, if the user types in...
	Lucene Erik Otis Hatcher

...he will again do a boolean search on the individual words, but he
wants to be able to suggest the more restrictive search consisisting of a
phrase he found with the words "Erik" and "Hatcher"
	"Erik Hatcher" Otis Lucene

...but he doesn't want to suggest searching on phrases like "Erik Otis" or
"Otis Lucene" unless those phrases actually appear in some documents.

Presumably, if multiple phrases in the source data can be found in the
permutations of hte search words, the least common are the ones you'd want
to sugggest -- which makes the problem a sort of SIP problem (ie: given an
extremely limited set of words, find the Statistically imporbably phrases
in the corpus made using only subsets of those words)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 11 January 2006 11:33, Eric Jain wrote:
> Paul Elschot wrote:
> > One way that might be better is to provide your own Scorer
> > that works on the term positions of the three or more terms.
> > This would be better for performance because it only uses one
> > term positions object per query term (a, b, and c here).
> 
> I'm trying to extract the actual phrases, rather than scoring documents 
> with terms that appear in the same order higher (though that would seem 
> like a good idea, too).
> 
> The idea is that once I have the phrases, I can suggest something like 
> "show only matches where a and b appear next to each other". Not terribly 
> important, but if there was a simple and efficient way to accomplish this...

If you can express each phrase as a SpanNearQuery, the occurrences
of the phrases can be easily obtained by iterating over the result of
getSpans() on SpanNearQuery.
It's not as efficient as a specialized PhraseQuery, though.

Regards,
Paul Elschot.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Eric Jain <Er...@isb-sib.ch>.

Paul Elschot wrote:
> One way that might be better is to provide your own Scorer
> that works on the term positions of the three or more terms.
> This would be better for performance because it only uses one
> term positions object per query term (a, b, and c here).

I'm trying to extract the actual phrases, rather than scoring documents 
with terms that appear in the same order higher (though that would seem 
like a good idea, too).

The idea is that once I have the phrases, I can suggest something like 
"show only matches where a and b appear next to each other". Not terribly 
important, but if there was a simple and efficient way to accomplish this...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Elschot wrote:

>On Wednesday 11 January 2006 00:09, Eric Jain wrote:
>  
>
>>Is there an efficient way to determine if two or more terms frequently 
>>appear next to each other sequence? For a query like:
>>
>>a b c
>>
>>one or more of the following suggestions could be generated:
>>
>>"a b c"
>>"a b" c
>>a "b c"
>>
>>I could of course just run a search with all possible combinations, but 
>>perhaps there is a better way?
>>    
>>
>
>One way that might be better is to provide your own Scorer
>that works on the term positions of the three or more terms.
>This would be better for performance because it only uses one
>term positions object per query term (a, b, and c here).
>
>For two terms, Nutch has something very similar that works
>over multiple fields. Have a look in the archives for the thread
>"Lucene performance bottlenecks" that started on 2 Dec 2005.
>I don't know how Nutch handles more than two query terms.
>  
>

Not there yet... ;-) Currently Nutch handles this as simple 
BooleanQueries (using the default scorer) and optionally PhraseQueries. 
We didn't work yet on a custom scorer.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 11 January 2006 00:09, Eric Jain wrote:
> Is there an efficient way to determine if two or more terms frequently 
> appear next to each other sequence? For a query like:
> 
> a b c
> 
> one or more of the following suggestions could be generated:
> 
> "a b c"
> "a b" c
> a "b c"
> 
> I could of course just run a search with all possible combinations, but 
> perhaps there is a better way?

One way that might be better is to provide your own Scorer
that works on the term positions of the three or more terms.
This would be better for performance because it only uses one
term positions object per query term (a, b, and c here).

For two terms, Nutch has something very similar that works
over multiple fields. Have a look in the archives for the thread
"Lucene performance bottlenecks" that started on 2 Dec 2005.
I don't know how Nutch handles more than two query terms.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Generating phrase queries from term queries

Posted by Yonik Seeley <ys...@gmail.com>.

A phrase query with slop scores matching documents higher when the
terms are closer together.

"a b c"~10000

-Yonik

On 1/10/06, Eric Jain <Er...@isb-sib.ch> wrote:
> Is there an efficient way to determine if two or more terms frequently
> appear next to each other sequence? For a query like:
>
> a b c
>
> one or more of the following suggestions could be generated:
>
> "a b c"
> "a b" c
> a "b c"
>
> I could of course just run a search with all possible combinations, but
> perhaps there is a better way?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org