You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joel Halbert <jo...@su3analytics.com> on 2009/10/30 10:49:01 UTC

scoring adjacent terms without proximity search

Hi,

Without using a proximity search i.e. "cheese sandwich"~5

What's the best way of up-scoring results in which the search terms are
closer to each other? 

E.g. so if I search for:
content:cheese  content:sandwich

How do you ensure that a document with content:
"Toasted Cheese Sandwich"
scores higher then:
"Cheese and Potato, Tuna sandwich"

Joel


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: scoring adjacent terms without proximity search

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Joel,

You could index every possible word combination in your document text, with one field for each possible distance.  (You would have to write an index-time analyzer to do this, since AFAIK nothing like this exists currently.  Shingles wouldn't work, since you want to ignore intervening terms when you query.)

E.g. doc#1: For "Toasted Cheese Sandwich", you would index "Toasted Cheese" and "Cheese Sandwich" in the "d0" field, and "Toasted Sandwich" in the "d1" field.

E.g. doc#2: For "Cheese and Potato, Tuna sandwich", you would index d0:"Cheese and", d1:"Cheese Potato", d2:"Cheese Tuna", d3:"Cheese sandwich", d0:"and Potato", d1:"and Tuna", d2:"and sandwich", d0:"Potato Tuna", d1:"Potato sandwich", and d0:"Tuna sandwich".

To query, search against all possible distance fields, up-boosting those fields that are closer to zero distance.  E.g. to search for "cheese sandwich", knowing that the maximum distance is 3, you would search for:
 
   d0:"cheese sandwich"^4 d1:"cheese sandwich"^3
   d2:"cheese sandwich"^2 d3:"cheese sandwich"^1

(To disregard order, you would want to either index the reverse of everything or query for the reverse term.)

Doc #1 would get a hit in the "d0" field, while doc #2 would get a hit in the "d3" field, and since you boosted the "d0" field higher in your query than the "d3" field, doc #1's score would be higher.  (If you don't want document length to be a factor in the score, you should to turn off normalization on these fields.)

If you want to extend this scheme to more than two words, you could use the sum of the distances between terms to name the fields.  But: holy exploding index size, Batman.  Especially if you index all possible term permutations.

If you only care about terms within X distance, your analyzer could limit the terms in that way.  This would also reduce index size.

To avoid using a whole bunch of fields, one per distance, you could add the distance to the text of each indexed and queried term, e.g. instead of indexing d0:"Toasted Cheese", you could index "Toasted Cheese d0", etc.  Then your queries would look like:

   "cheese sandwich d0"^4 "cheese sandwich d1"^3
   "cheese sandwich d2"^2 "cheese sandwich d3"^1

Steve

> -----Original Message-----
> From: Joel Halbert [mailto:joel@su3analytics.com]
> Sent: Friday, October 30, 2009 5:49 AM
> To: Lucene Users
> Subject: scoring adjacent terms without proximity search
> 
> Hi,
> 
> Without using a proximity search i.e. "cheese sandwich"~5
> 
> What's the best way of up-scoring results in which the search terms are
> closer to each other?
> 
> E.g. so if I search for:
> content:cheese  content:sandwich
> 
> How do you ensure that a document with content:
> "Toasted Cheese Sandwich"
> scores higher then:
> "Cheese and Potato, Tuna sandwich"
> 
> Joel


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: scoring adjacent terms without proximity search

Posted by liat oren <or...@gmail.com>.
Hi Joel,

I encounter the same problem.
Could you please elaborate a bit on this?

Many thanks,
Liat

2009/11/2 Joel Halbert <jo...@su3analytics.com>

> I opted to use the following query to solve this problem, since it meets
> my requirements, for the time being.
>
> +(cheese sandwich) "cheese sandwich"~slop
>
> This includes documents with one of more of the terms, but prefers those
> with an edit distance <= the slop.
>
>
> -----Original Message-----
> From: Joel Halbert <jo...@su3analytics.com>
> Reply-To: java-user@lucene.apache.org
> To: java-user@lucene.apache.org
> Subject: Re: scoring adjacent terms without proximity search
>  Date: Sat, 31 Oct 2009 08:38:29 +0000
>
> Thank you all for your suggestions, I shall have a little think about
> the best way forward, and report back if I do anything interesting that
> works well.
>
> In answer to Grant's question, why not use PhraseQuery,  we do not want
> to have an artificial upper limit on the slop, i.e. we do want to
> include documents that might only have a subset of words from the
> phrase. (e.g. just cheese, or just sandwich, but not both).
>
>
> -----Original Message-----
> From: Robert Muir <rc...@gmail.com>
> Reply-To: java-user@lucene.apache.org
> To: java-user@lucene.apache.org
> Subject: Re: scoring adjacent terms without proximity search
> Date: Fri, 30 Oct 2009 16:04:03 -0400
>
> > I suppose you could precompute the proximity associations by indexing
> > n-grams (in this case, called Lucene calls them shingles), such that
> there
> > is a single token in your index containing cheese_sandwich (effectively)
> >
> >
> doh, I see Grant already lead you in this direction. (sorry for the
> duplicate mail)
> on average its worked for me for some things like this.
>
> although, I'll try to contribute something actually useful, and mention
> that
> if you use things like shingles, its good to consider modifying
> DefaultSimilarity, look at setDiscountOverlaps param.
> otherwise, i've measured cases where injecting additional tokens will cause
> more harm than good, because it has an adverse affect on lengthnorm.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: scoring adjacent terms without proximity search

Posted by Joel Halbert <jo...@su3analytics.com>.
I opted to use the following query to solve this problem, since it meets
my requirements, for the time being.

+(cheese sandwich) "cheese sandwich"~slop

This includes documents with one of more of the terms, but prefers those
with an edit distance <= the slop.


-----Original Message-----
From: Joel Halbert <jo...@su3analytics.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: scoring adjacent terms without proximity search
Date: Sat, 31 Oct 2009 08:38:29 +0000

Thank you all for your suggestions, I shall have a little think about
the best way forward, and report back if I do anything interesting that
works well.

In answer to Grant's question, why not use PhraseQuery,  we do not want
to have an artificial upper limit on the slop, i.e. we do want to
include documents that might only have a subset of words from the
phrase. (e.g. just cheese, or just sandwich, but not both).


-----Original Message-----
From: Robert Muir <rc...@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: scoring adjacent terms without proximity search
Date: Fri, 30 Oct 2009 16:04:03 -0400

> I suppose you could precompute the proximity associations by indexing
> n-grams (in this case, called Lucene calls them shingles), such that there
> is a single token in your index containing cheese_sandwich (effectively)
>
>
doh, I see Grant already lead you in this direction. (sorry for the
duplicate mail)
on average its worked for me for some things like this.

although, I'll try to contribute something actually useful, and mention that
if you use things like shingles, its good to consider modifying
DefaultSimilarity, look at setDiscountOverlaps param.
otherwise, i've measured cases where injecting additional tokens will cause
more harm than good, because it has an adverse affect on lengthnorm.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: scoring adjacent terms without proximity search

Posted by Joel Halbert <jo...@su3analytics.com>.
Thank you all for your suggestions, I shall have a little think about
the best way forward, and report back if I do anything interesting that
works well.

In answer to Grant's question, why not use PhraseQuery,  we do not want
to have an artificial upper limit on the slop, i.e. we do want to
include documents that might only have a subset of words from the
phrase. (e.g. just cheese, or just sandwich, but not both).


-----Original Message-----
From: Robert Muir <rc...@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: scoring adjacent terms without proximity search
Date: Fri, 30 Oct 2009 16:04:03 -0400

> I suppose you could precompute the proximity associations by indexing
> n-grams (in this case, called Lucene calls them shingles), such that there
> is a single token in your index containing cheese_sandwich (effectively)
>
>
doh, I see Grant already lead you in this direction. (sorry for the
duplicate mail)
on average its worked for me for some things like this.

although, I'll try to contribute something actually useful, and mention that
if you use things like shingles, its good to consider modifying
DefaultSimilarity, look at setDiscountOverlaps param.
otherwise, i've measured cases where injecting additional tokens will cause
more harm than good, because it has an adverse affect on lengthnorm.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: scoring adjacent terms without proximity search

Posted by Robert Muir <rc...@gmail.com>.
> I suppose you could precompute the proximity associations by indexing
> n-grams (in this case, called Lucene calls them shingles), such that there
> is a single token in your index containing cheese_sandwich (effectively)
>
>
doh, I see Grant already lead you in this direction. (sorry for the
duplicate mail)
on average its worked for me for some things like this.

although, I'll try to contribute something actually useful, and mention that
if you use things like shingles, its good to consider modifying
DefaultSimilarity, look at setDiscountOverlaps param.
otherwise, i've measured cases where injecting additional tokens will cause
more harm than good, because it has an adverse affect on lengthnorm.

-- 
Robert Muir
rcmuir@gmail.com

Re: scoring adjacent terms without proximity search

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 30, 2009, at 5:49 AM, Joel Halbert wrote:

> Hi,
>
> Without using a proximity search i.e. "cheese sandwich"~5
>
> What's the best way of up-scoring results in which the search terms  
> are
> closer to each other?

I'm not aware of any query technique to score based on proximity that  
doesn't, itself, use proximity information.

I suppose you could precompute the proximity associations by indexing  
n-grams (in this case, called Lucene calls them shingles), such that  
there is a single token in your index containing cheese_sandwich  
(effectively)

BTW, what's your concern about using a Phrase Query?  What requirement  
do you have that would prevent that particular query?  Or is there  
something in the way it is implemented that doesn't work for your  
needs (assuming your example here is for discussion purposes)

>
> E.g. so if I search for:
> content:cheese  content:sandwich
>
> How do you ensure that a document with content:
> "Toasted Cheese Sandwich"
> scores higher then:
> "Cheese and Potato, Tuna sandwich"
>
> Joel
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: scoring adjacent terms without proximity search

Posted by Robert Muir <rc...@gmail.com>.
yet another thing to look into that might improve things a bit is using
ShingleFilter in contrib.

this way cheese sandwich would form a shingle of "cheese sandwich" and would
get a higher score for the "Toasted Cheese Sandwich" document.

it wouldn't solve the proximity problem in general, but maybe it would help,
depending on your requirements.

On Fri, Oct 30, 2009 at 5:49 AM, Joel Halbert <jo...@su3analytics.com> wrote:

> Hi,
>
> Without using a proximity search i.e. "cheese sandwich"~5
>
> What's the best way of up-scoring results in which the search terms are
> closer to each other?
>
> E.g. so if I search for:
> content:cheese  content:sandwich
>
> How do you ensure that a document with content:
> "Toasted Cheese Sandwich"
> scores higher then:
> "Cheese and Potato, Tuna sandwich"
>
> Joel
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com