You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Alex Davidson <al...@bluewire-technologies.com> on 2010/03/03 13:08:59 UTC

How to do prefix/phrase matching with term-length-sensitive scoring?

Hi,

We are using Lucene.NET 2.9.1 to index names/addresses/identification
numbers for various classes of person entity. The index is used to
populate a 'live search' poplist control in a web page, as the user
types. The total number of entries to index exceeds a million and the
poplist only displays the top 8, which makes match ranking difficult
since the user expects something useful from as few as three characters
(eg. two from surname, plus one initial).

The index is updated in real-time by adding update requests to a task
queue, to prevent indexing errors from rolling back database
transactions, and a transaction log, to ensure that pending index
updates are not forgotten if the app pool is killed/recycled. There is
only one IndexReader and IndexWriter per index. The Reader is reopened
periodically via the writer.

The Question:

Given a list of prefixes, what is the simplest way to match them against
a text field, giving preference to shorter term matches?
 * It is required that short prefixes are included usefully in the
query, so simply generating TermQuerys is not viable (see below).
 * Term frequency within the field must be ignored when scoring.
 * Documents and fields are sometimes boosted at index time; norms are
present.
 * The query generator returns only a Query object; it cannot affect the
Filters passed to the Searcher.
 * It would be preferable, though not required, if matching/scoring
based on term position (a la PhraseQuery) was available as an option.



Previous Implementation:

In a previous iteration (in Lucene.NET 2.0), with a single index
directly available to the query generator, we did the following:
 * For each prefix p with length pL:
   * Look up all relevant terms via an IndexReader,
   * Create a list of TermQuerys, weighting each term t (length tL)
according to:
			pL/(tL + tL*(tL - pL))
   * Combine TermQuery list with Occur.SHOULD
 * Combine BooleanQuery list with Occur.MUST

PrefixQuery seemed to do similar stuff, but our implementation also had
some logic to prefer filters for very short prefixes with lots of terms.

Current Attempt:

Architectural changes mean that this query generation code no longer has
access to an IndexReader and can only use a Searcher (to allow use of
MultiSearchers, etc). Since the Searcher does not expose a Reader, we've
gone back to using PrefixQuery and applied SCORING_BOOLEAN_QUERY_REWRITE
to get the term length weightings.
This is as good as the old implementation accuracy-wise, except that it
doesn't cope well with huge numbers of matching terms for a short
prefix, which is quite a common case.



Is there a way of achieving this within Lucene's existing Query and
scoring implementations, or will I have to roll my own? If the latter,
should I be looking at deriving from Query implementations or scorers?


Thanks!



RE: How to do prefix/phrase matching with term-length-sensitive scoring?

Posted by Alex Davidson <al...@bluewire-technologies.com>.
On Wed, 2010-03-03 at 20:32 +0200, Digy wrote: 
> > Architectural changes mean that this query generation code no longer has
> > access to an IndexReader and can only use a Searcher
> 
> Why? IndexSearcher has a method "GetIndexReader" (or possibly, I couldn't
> understand the problem)
> 
> DIGY

IndexSearcher does. Searcher does not, since some implementations (eg.
MultiSearcher) may use multiple IndexReaders.




RE: How to do prefix/phrase matching with term-length-sensitive scoring?

Posted by Digy <di...@gmail.com>.
> Architectural changes mean that this query generation code no longer has
access to an IndexReader and can only use a Searcher
Why? IndexSearcher has a method "GetIndexReader" (or possibly, I couldn't
understand the problem)

DIGY

-----Original Message-----
From: Alex Davidson [mailto:alex.davidson@bluewire-technologies.com] 
Sent: Wednesday, March 03, 2010 2:09 PM
To: lucene-net-user@lucene.apache.org
Subject: How to do prefix/phrase matching with term-length-sensitive
scoring?

Hi,

We are using Lucene.NET 2.9.1 to index names/addresses/identification
numbers for various classes of person entity. The index is used to
populate a 'live search' poplist control in a web page, as the user
types. The total number of entries to index exceeds a million and the
poplist only displays the top 8, which makes match ranking difficult
since the user expects something useful from as few as three characters
(eg. two from surname, plus one initial).

The index is updated in real-time by adding update requests to a task
queue, to prevent indexing errors from rolling back database
transactions, and a transaction log, to ensure that pending index
updates are not forgotten if the app pool is killed/recycled. There is
only one IndexReader and IndexWriter per index. The Reader is reopened
periodically via the writer.

The Question:

Given a list of prefixes, what is the simplest way to match them against
a text field, giving preference to shorter term matches?
 * It is required that short prefixes are included usefully in the
query, so simply generating TermQuerys is not viable (see below).
 * Term frequency within the field must be ignored when scoring.
 * Documents and fields are sometimes boosted at index time; norms are
present.
 * The query generator returns only a Query object; it cannot affect the
Filters passed to the Searcher.
 * It would be preferable, though not required, if matching/scoring
based on term position (a la PhraseQuery) was available as an option.



Previous Implementation:

In a previous iteration (in Lucene.NET 2.0), with a single index
directly available to the query generator, we did the following:
 * For each prefix p with length pL:
   * Look up all relevant terms via an IndexReader,
   * Create a list of TermQuerys, weighting each term t (length tL)
according to:
			pL/(tL + tL*(tL - pL))
   * Combine TermQuery list with Occur.SHOULD
 * Combine BooleanQuery list with Occur.MUST

PrefixQuery seemed to do similar stuff, but our implementation also had
some logic to prefer filters for very short prefixes with lots of terms.

Current Attempt:

Architectural changes mean that this query generation code no longer has
access to an IndexReader and can only use a Searcher (to allow use of
MultiSearchers, etc). Since the Searcher does not expose a Reader, we've
gone back to using PrefixQuery and applied SCORING_BOOLEAN_QUERY_REWRITE
to get the term length weightings.
This is as good as the old implementation accuracy-wise, except that it
doesn't cope well with huge numbers of matching terms for a short
prefix, which is quite a common case.



Is there a way of achieving this within Lucene's existing Query and
scoring implementations, or will I have to roll my own? If the latter,
should I be looking at deriving from Query implementations or scorers?


Thanks!



RE: How to do prefix/phrase matching with term-length-sensitive scoring?

Posted by Alex Davidson <al...@bluewire-technologies.com>.
Mike,

Ah, excellent! I've been experimenting with that approach and it's good
to know that it will work. My implementation stores the precomputed
weighting rather than term length, but I suppose it'd be much easier to
change the weighting formula in future if it was handled in the scorer
at query time.

It looks like this generalises nicely to other fields in our indexes.
Searching for whole words is rare in this application; prefixes are
usually all the user wants to provide. Since our revamped architecture
allows declaration of analysers on a per-field basis, I shall see about
applying this elsewhere too.

Thanks!


On Thu, 2010-03-04 at 11:19 -0800, Michael Garski wrote:
> Alex - 
> 
> I've done similar things - here's some feedback that I hope is of use.
> 
> Prefixes: a PrefixQuery is essentially a wildcard query which can explode the number of clauses in the query.  For the purposes of a type-ahead that can add up fast.  You may want to try analyzing the text a little differently to limit the number of matching terms.
> 	
> 	Example : "John Adams Doe" tokenizes to 'j', 'jo', 'joh', 'john', 'john a', etc
> 
> This will reduce the number of term clauses in your query to one, which will speed it up, and you can combine the above tokenization approach with payloads that contain data such as the original term character count to allow you prefer matches that are exact so that for a query for 'john' scores higher on matches documents that contain 'john' over 'johnny'.  Another approach would be to use CustomScoreQuery where you have access to the reader when the documents are being scored.
> 
> The payload or custom score approach will remove your need to directly access the reader, which cannot be accessed through Searcher or Searchable.
> 
> Hope that helps!
> 
> Michael
> 
> -----Original Message-----
> From: Alex Davidson [mailto:alex.davidson@bluewire-technologies.com] 
> Sent: Wednesday, March 03, 2010 4:09 AM
> To: lucene-net-user@lucene.apache.org
> Subject: How to do prefix/phrase matching with term-length-sensitive scoring?
> 
> Hi,
> 
> We are using Lucene.NET 2.9.1 to index names/addresses/identification
> numbers for various classes of person entity. The index is used to
> populate a 'live search' poplist control in a web page, as the user
> types. The total number of entries to index exceeds a million and the
> poplist only displays the top 8, which makes match ranking difficult
> since the user expects something useful from as few as three characters
> (eg. two from surname, plus one initial).
> 
> The index is updated in real-time by adding update requests to a task
> queue, to prevent indexing errors from rolling back database
> transactions, and a transaction log, to ensure that pending index
> updates are not forgotten if the app pool is killed/recycled. There is
> only one IndexReader and IndexWriter per index. The Reader is reopened
> periodically via the writer.
> 
> The Question:
> 
> Given a list of prefixes, what is the simplest way to match them against
> a text field, giving preference to shorter term matches?
>  * It is required that short prefixes are included usefully in the
> query, so simply generating TermQuerys is not viable (see below).
>  * Term frequency within the field must be ignored when scoring.
>  * Documents and fields are sometimes boosted at index time; norms are
> present.
>  * The query generator returns only a Query object; it cannot affect the
> Filters passed to the Searcher.
>  * It would be preferable, though not required, if matching/scoring
> based on term position (a la PhraseQuery) was available as an option.
> 
> 
> 
> Previous Implementation:
> 
> In a previous iteration (in Lucene.NET 2.0), with a single index
> directly available to the query generator, we did the following:
>  * For each prefix p with length pL:
>    * Look up all relevant terms via an IndexReader,
>    * Create a list of TermQuerys, weighting each term t (length tL)
> according to:
> 			pL/(tL + tL*(tL - pL))
>    * Combine TermQuery list with Occur.SHOULD
>  * Combine BooleanQuery list with Occur.MUST
> 
> PrefixQuery seemed to do similar stuff, but our implementation also had
> some logic to prefer filters for very short prefixes with lots of terms.
> 
> Current Attempt:
> 
> Architectural changes mean that this query generation code no longer has
> access to an IndexReader and can only use a Searcher (to allow use of
> MultiSearchers, etc). Since the Searcher does not expose a Reader, we've
> gone back to using PrefixQuery and applied SCORING_BOOLEAN_QUERY_REWRITE
> to get the term length weightings.
> This is as good as the old implementation accuracy-wise, except that it
> doesn't cope well with huge numbers of matching terms for a short
> prefix, which is quite a common case.
> 
> 
> 
> Is there a way of achieving this within Lucene's existing Query and
> scoring implementations, or will I have to roll my own? If the latter,
> should I be looking at deriving from Query implementations or scorers?
> 
> 
> Thanks!
> 
> 
> 


RE: How to do prefix/phrase matching with term-length-sensitive scoring?

Posted by Michael Garski <mg...@myspace-inc.com>.
Alex - 

I've done similar things - here's some feedback that I hope is of use.

Prefixes: a PrefixQuery is essentially a wildcard query which can explode the number of clauses in the query.  For the purposes of a type-ahead that can add up fast.  You may want to try analyzing the text a little differently to limit the number of matching terms.
	
	Example : "John Adams Doe" tokenizes to 'j', 'jo', 'joh', 'john', 'john a', etc

This will reduce the number of term clauses in your query to one, which will speed it up, and you can combine the above tokenization approach with payloads that contain data such as the original term character count to allow you prefer matches that are exact so that for a query for 'john' scores higher on matches documents that contain 'john' over 'johnny'.  Another approach would be to use CustomScoreQuery where you have access to the reader when the documents are being scored.

The payload or custom score approach will remove your need to directly access the reader, which cannot be accessed through Searcher or Searchable.

Hope that helps!

Michael

-----Original Message-----
From: Alex Davidson [mailto:alex.davidson@bluewire-technologies.com] 
Sent: Wednesday, March 03, 2010 4:09 AM
To: lucene-net-user@lucene.apache.org
Subject: How to do prefix/phrase matching with term-length-sensitive scoring?

Hi,

We are using Lucene.NET 2.9.1 to index names/addresses/identification
numbers for various classes of person entity. The index is used to
populate a 'live search' poplist control in a web page, as the user
types. The total number of entries to index exceeds a million and the
poplist only displays the top 8, which makes match ranking difficult
since the user expects something useful from as few as three characters
(eg. two from surname, plus one initial).

The index is updated in real-time by adding update requests to a task
queue, to prevent indexing errors from rolling back database
transactions, and a transaction log, to ensure that pending index
updates are not forgotten if the app pool is killed/recycled. There is
only one IndexReader and IndexWriter per index. The Reader is reopened
periodically via the writer.

The Question:

Given a list of prefixes, what is the simplest way to match them against
a text field, giving preference to shorter term matches?
 * It is required that short prefixes are included usefully in the
query, so simply generating TermQuerys is not viable (see below).
 * Term frequency within the field must be ignored when scoring.
 * Documents and fields are sometimes boosted at index time; norms are
present.
 * The query generator returns only a Query object; it cannot affect the
Filters passed to the Searcher.
 * It would be preferable, though not required, if matching/scoring
based on term position (a la PhraseQuery) was available as an option.



Previous Implementation:

In a previous iteration (in Lucene.NET 2.0), with a single index
directly available to the query generator, we did the following:
 * For each prefix p with length pL:
   * Look up all relevant terms via an IndexReader,
   * Create a list of TermQuerys, weighting each term t (length tL)
according to:
			pL/(tL + tL*(tL - pL))
   * Combine TermQuery list with Occur.SHOULD
 * Combine BooleanQuery list with Occur.MUST

PrefixQuery seemed to do similar stuff, but our implementation also had
some logic to prefer filters for very short prefixes with lots of terms.

Current Attempt:

Architectural changes mean that this query generation code no longer has
access to an IndexReader and can only use a Searcher (to allow use of
MultiSearchers, etc). Since the Searcher does not expose a Reader, we've
gone back to using PrefixQuery and applied SCORING_BOOLEAN_QUERY_REWRITE
to get the term length weightings.
This is as good as the old implementation accuracy-wise, except that it
doesn't cope well with huge numbers of matching terms for a short
prefix, which is quite a common case.



Is there a way of achieving this within Lucene's existing Query and
scoring implementations, or will I have to roll my own? If the latter,
should I be looking at deriving from Query implementations or scorers?


Thanks!