You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Abhi <ab...@gmail.com> on 2009/06/05 13:31:15 UTC

Phrase search

Say I have indexed the following strings:

1. "cool gaming laptop"
2. "cool gaming lappy"
3. "gaming laptop cool"

Now when I search with a query say "cool gaming computer", I want string 1
and 2 to appear on top (where search terms are closer to each other)
followed by 3.

I can use a Term query to search but, the problem is that word proximity
does not come into picture. All 3 document get an even score. The behaviour
that I want is documents that have "cool" and "gaming" and "computer" (these
words might be present or not in the indexed document) as close to each
other as possible should get a higher score.

I can use a Phrase query so that proximity of search terms affect scoring
but, I do not get any result because string "computer" is not present in any
of the indexed documents.

Is there a way to achieve the above?
-- 
Cheers,
Abhi

Re: Phrase search

Posted by Savvas-Andreas Moysidis <sa...@googlemail.com>.
Hello,

You could use a PhraseQuery with the terms "cool" and "gaming" and
"computer" and set the slop factor you reckon is right. Then could assign a
boost to this query only, which will make it bubble up the list.
I don't think you can get away without specifying a slop factor though(like
in the proximity scenario you mention).

Regards,
Savvas


2009/6/11 Daniel Noll <da...@nuix.com>

> On Fri, Jun 5, 2009 at 21:31, Abhi<ab...@gmail.com> wrote:
> > Say I have indexed the following strings:
> >
> > 1. "cool gaming laptop"
> > 2. "cool gaming lappy"
> > 3. "gaming laptop cool"
> >
> > Now when I search with a query say "cool gaming computer", I want string
> 1
> > and 2 to appear on top (where search terms are closer to each other)
> > followed by 3.
> >
> > I can use a Term query to search but, the problem is that word proximity
> > does not come into picture. All 3 document get an even score. The
> behaviour
> > that I want is documents that have "cool" and "gaming" and "computer"
> (these
> > words might be present or not in the indexed document) as close to each
> > other as possible should get a higher score.
> >
> > I can use a Phrase query so that proximity of search terms affect scoring
> > but, I do not get any result because string "computer" is not present in
> any
> > of the indexed documents.
> >
> > Is there a way to achieve the above?
>
> I would rewrite it to this:
>
> cool gaming computer "cool gaming" "gaming computer" "cool gaming computer"
>
> Naively assuming a score of 1.0 for each hit, you would get something
> like...
>  1. "cool gaming laptop"    => 3 (cool, gaming, "cool gaming")
>  2. "cool gaming lappy"    => 3 (cool, gaming, "cool gaming")
>  3. "gaming laptop cool"    => 2 (cool, gaming)
>
> And of course if it actually finds "cool gaming computer" it would get 6.
>
> Daniel
>
>
> --
> Daniel Noll                            Forensic and eDiscovery Software
> Senior Developer                              The world's most advanced
> Nuix                                                email data analysis
> http://nuix.com/                                and eDiscovery software
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Phrase search

Posted by Daniel Noll <da...@nuix.com>.
On Fri, Jun 5, 2009 at 21:31, Abhi<ab...@gmail.com> wrote:
> Say I have indexed the following strings:
>
> 1. "cool gaming laptop"
> 2. "cool gaming lappy"
> 3. "gaming laptop cool"
>
> Now when I search with a query say "cool gaming computer", I want string 1
> and 2 to appear on top (where search terms are closer to each other)
> followed by 3.
>
> I can use a Term query to search but, the problem is that word proximity
> does not come into picture. All 3 document get an even score. The behaviour
> that I want is documents that have "cool" and "gaming" and "computer" (these
> words might be present or not in the indexed document) as close to each
> other as possible should get a higher score.
>
> I can use a Phrase query so that proximity of search terms affect scoring
> but, I do not get any result because string "computer" is not present in any
> of the indexed documents.
>
> Is there a way to achieve the above?

I would rewrite it to this:

cool gaming computer "cool gaming" "gaming computer" "cool gaming computer"

Naively assuming a score of 1.0 for each hit, you would get something like...
 1. "cool gaming laptop"    => 3 (cool, gaming, "cool gaming")
 2. "cool gaming lappy"    => 3 (cool, gaming, "cool gaming")
 3. "gaming laptop cool"    => 2 (cool, gaming)

And of course if it actually finds "cool gaming computer" it would get 6.

Daniel


-- 
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org