You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Avi Rosenschein <ar...@gmail.com> on 2010/02/24 22:38:35 UTC

Re: boosts for unstemmed matches (was Re: If you could have one feature in Lucene...)

On Wed, Feb 24, 2010 at 11:20 PM, Aaron Lav <as...@pobox.com> wrote:

> On Wed, Feb 24, 2010 at 10:18:27PM +0200, Avi Rosenschein wrote:
> > On Wed, Feb 24, 2010 at 3:42 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> > > What would it be?
> > >
> >
> > For scoring to take into account the non-analyzed token stream.
> >
> > That is, if a field is analyzed (stemmed, lowercased, maybe even stop
> words
> > removed), that is fine for indexing. But tokens in the query matching the
> > original form could still get a higher score than those that only match
> when
> > analyzed.
>
> You can get some of that effect by indexing stemmed and unstemmed
> forms, and letting IDF boost unstemmed results.  (I picked this
> idea up from http://lingpipe-blog.com/2007/03/21/to-stem-or-not-to-stem/)
>

This is not quite the same (either in relevance or efficiency). I would like
the infrastructure for this to be built into Lucene, so that  queries and
scorers could take advantage of it.


> > Also, this would maybe allow a flexible, run-time, decision of what
> > analyzers to include. For example, I might want stemming turned on for
> > normal search, but not for a PhraseQuery.
>
> That's harder - different field names for the different analyses might
> work, but not for run-time decisions.  I think the way Sun's Minion does
> it is morphologically-based query expansion (see
> http://blogs.sun.com/searchguy/entry/lightweight_morphology_vs_stemming),
> and you might be able to
> implement that via query rewriting.
>

Again, rather than forcing me to store a separate field for every possible
type of query I might want to build, Lucene should be able to efficiently
store the original information in a form conducive to using at query time.