You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Joaquin Delgado <jo...@triplehop.com> on 2005/02/01 00:33:32 UTC

SPANQUERY for phrase proximity search

Is there any proposal to add a proper NEAR (proximity) operator to the
default query language that can handle phrase proximity, implemented as
SpanNearQuery?

With all the conversations about density queries and searching for
"concepts" that appear in different fields, it just seems logical to
treat exact phrases as single terms when the users' explicitly decide to
use quotes along with unquoted terms. 

J.D.

-----Original Message-----
From: Chuck Williams [mailto:chuck@manawiz.com] 
Sent: Monday, January 31, 2005 6:20 PM
To: Lucene Developers List
Subject: RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark
evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
problems with Similarity.docFreq() ?

Doug Cutting wrote:
  > What did you think of my DensityPhraseQuery proposal?

It is a step in the direction of what I have in mind, but I'd like to go
further.  How about a query class with these properties:
  1.  Inputs are:
      a.  F = list of fields
      b.  B = list of field boosts (1:1 correspondence with F)
      c.  T = list of terms or phrases, each either optional or required
      d.  P = proximity-sloping window
  2.  Generate matches that contain every required T in some F, and if
no required T's then at least one optional T if some F.
  3.  Score matches based on these considerations:
      a.  Normal TermQuery and PhraseQuery scores for individual matches
in individual fields.
      b.  Boost scores for proximity of TermQuery and PhraseQuery
matches in individual fields, based on some function of P (term
proximity).
      c.  Boost scores based on number of optional T's matched in at
least one F (term diversity).

I think that meets all the objectives of my earlier posts.  I'd like to
have it, and would be happy to contribute it if it sounds like the right
thing.

Is there a better way?

  > If field boosting needs to then trump idf, we should be able to deal
  > with that when we subsequently tune field boosting, no?  We can,
e.g.,
  > square the field boosts if we need.

Perhaps, but that seems to me to be a hack on top of a hack.  Current
literature seems to consistently not square idf -- I found one reference
that specifically says even Salton removed the squaring after he first
proposed it a long time ago.  The simpler solution is just to remove the
squaring.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:cutting@apache.org]
  > Sent: Monday, January 31, 2005 3:04 PM
  > To: Lucene Developers List
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > Chuck Williams wrote:
  > > That expansion is scalable, but it only accounts for proximity of
all
  > > query terms together.  E.g., it does not favor a match where t1
and t2
  > > are close together while t3 is distant over a match where all 3
terms
  > > are distant.  Worse, it would not favor a match with t1 and t2 in
a
  > > short title, and t2 and t3 proximal in the content (with no
occurrence
  > > of t1 in the content) vs. a match with t1 and t2 in the title and
t2
  > and
  > > t3 distant in the content.
  > 
  > Right.  I just mentioned this same weakness in a message replying to
  > David.
  > 
  > >   > Is that distinct from my goal to develop an improved
  > >   > MultiFieldQueryParser for Lucene 2.0?
  > >
  > > Not distinct, but I think the first step is to decide on the
expansion
  > > we want.  Unless somebody has a better idea, I think the best
solution
  > > is a new Query class that simultaneously supports multiple fields,
  > term
  > > diversity and term proximity.  It would be similar to SpansQuery,
but
  > > generalized.  It would be like BooleanQuery in the sense that
  > individual
  > > query clauses could be required or not.  Then, default AND could
be
  > > achieved by expanding queries to all-required.
  > >
  > > With this new Query class, revised versions of QueryParser and
  > > MultiFieldQuery parser would generate it.
  > >
  > > Am I way off-base somewhere and/or is there a simpler approach to
the
  > > same end?
  > 
  > It just sounds like a lot to bite off at once.
  > 
  > What did you think of my DensityPhraseQuery proposal?  We could use
this
  > in place of a PhraseQuery w/ slop=infinity.  We'd need just one per
  > field.
  > 
  > The straight boolean clauses are required for two reasons:
  >    1. To make sure that every query term appears in some field; and
  >    2. To reward a term that occurs frequently in a field, but near
no
  > other query terms.
  > 
  > > Sure, idf is important enough to evaluate independently as a
factor.
  > > However, I do not think these considerations are orthogonal.  For
  > > example, I'm putting a lot of weight in field boosting and don't
want
  > > the preference of title matches over body matches to be
overwhelmed by
  > > the idf's.
  > 
  > If field boosting needs to then trump idf, we should be able to deal
  > with that when we subsequently tune field boosting, no?  We can,
e.g.,
  > square the field boosts if we need.
  > 
  > Doug
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: SPANQUERY for phrase proximity search

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Joaquin,

Check this:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg05504.html

Otis

--- Joaquin Delgado <jo...@triplehop.com> wrote:

> Is there any proposal to add a proper NEAR (proximity) operator to
> the
> default query language that can handle phrase proximity, implemented
> as
> SpanNearQuery?
> 
> With all the conversations about density queries and searching for
> "concepts" that appear in different fields, it just seems logical to
> treat exact phrases as single terms when the users' explicitly decide
> to
> use quotes along with unquoted terms. 
> 
> J.D.
> 
> -----Original Message-----
> From: Chuck Williams [mailto:chuck@manawiz.com] 
> Sent: Monday, January 31, 2005 6:20 PM
> To: Lucene Developers List
> Subject: RE: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
> evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
> problems with Similarity.docFreq() ?
> 
> Doug Cutting wrote:
>   > What did you think of my DensityPhraseQuery proposal?
> 
> It is a step in the direction of what I have in mind, but I'd like to
> go
> further.  How about a query class with these properties:
>   1.  Inputs are:
>       a.  F = list of fields
>       b.  B = list of field boosts (1:1 correspondence with F)
>       c.  T = list of terms or phrases, each either optional or
> required
>       d.  P = proximity-sloping window
>   2.  Generate matches that contain every required T in some F, and
> if
> no required T's then at least one optional T if some F.
>   3.  Score matches based on these considerations:
>       a.  Normal TermQuery and PhraseQuery scores for individual
> matches
> in individual fields.
>       b.  Boost scores for proximity of TermQuery and PhraseQuery
> matches in individual fields, based on some function of P (term
> proximity).
>       c.  Boost scores based on number of optional T's matched in at
> least one F (term diversity).
> 
> I think that meets all the objectives of my earlier posts.  I'd like
> to
> have it, and would be happy to contribute it if it sounds like the
> right
> thing.
> 
> Is there a better way?
> 
>   > If field boosting needs to then trump idf, we should be able to
> deal
>   > with that when we subsequently tune field boosting, no?  We can,
> e.g.,
>   > square the field boosts if we need.
> 
> Perhaps, but that seems to me to be a hack on top of a hack.  Current
> literature seems to consistently not square idf -- I found one
> reference
> that specifically says even Salton removed the squaring after he
> first
> proposed it a long time ago.  The simpler solution is just to remove
> the
> squaring.
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Doug Cutting [mailto:cutting@apache.org]
>   > Sent: Monday, January 31, 2005 3:04 PM
>   > To: Lucene Developers List
>   > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
>   > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
>   > problems with Similarity.docFreq() ?
>   > 
>   > Chuck Williams wrote:
>   > > That expansion is scalable, but it only accounts for proximity
> of
> all
>   > > query terms together.  E.g., it does not favor a match where t1
> and t2
>   > > are close together while t3 is distant over a match where all 3
> terms
>   > > are distant.  Worse, it would not favor a match with t1 and t2
> in
> a
>   > > short title, and t2 and t3 proximal in the content (with no
> occurrence
>   > > of t1 in the content) vs. a match with t1 and t2 in the title
> and
> t2
>   > and
>   > > t3 distant in the content.
>   > 
>   > Right.  I just mentioned this same weakness in a message replying
> to
>   > David.
>   > 
>   > >   > Is that distinct from my goal to develop an improved
>   > >   > MultiFieldQueryParser for Lucene 2.0?
>   > >
>   > > Not distinct, but I think the first step is to decide on the
> expansion
>   > > we want.  Unless somebody has a better idea, I think the best
> solution
>   > > is a new Query class that simultaneously supports multiple
> fields,
>   > term
>   > > diversity and term proximity.  It would be similar to
> SpansQuery,
> but
>   > > generalized.  It would be like BooleanQuery in the sense that
>   > individual
>   > > query clauses could be required or not.  Then, default AND
> could
> be
>   > > achieved by expanding queries to all-required.
>   > >
>   > > With this new Query class, revised versions of QueryParser and
>   > > MultiFieldQuery parser would generate it.
>   > >
>   > > Am I way off-base somewhere and/or is there a simpler approach
> to
> the
>   > > same end?
>   > 
>   > It just sounds like a lot to bite off at once.
>   > 
>   > What did you think of my DensityPhraseQuery proposal?  We could
> use
> this
>   > in place of a PhraseQuery w/ slop=infinity.  We'd need just one
> per
>   > field.
>   > 
>   > The straight boolean clauses are required for two reasons:
>   >    1. To make sure that every query term appears in some field;
> and
>   >    2. To reward a term that occurs frequently in a field, but
> near
> no
>   > other query terms.
>   > 
>   > > Sure, idf is important enough to evaluate independently as a
> factor.
>   > > However, I do not think these considerations are orthogonal. 
> For
>   > > example, I'm putting a lot of weight in field boosting and
> don't
> want
>   > > the preference of title matches over body matches to be
> overwhelmed by
>   > > the idf's.
>   > 
>   > If field boosting needs to then trump idf, we should be able to
> deal
>   > with that when we subsequently tune field boosting, no?  We can,
> e.g.,
>   > square the field boosts if we need.
>   > 
>   > Doug
>   > 
>   >
> ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>   > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org