You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ivan Provalov <ip...@yahoo.com> on 2010/05/20 22:16:44 UTC

Stemming and Wildcard Queries

Is there a good way to combine the wildcard queries and stemming?  

As is, the field which is stemmed at index time, won't work with some wildcard queries.

We were thinking to create two separate index fields - one stemmed, one non-stemmed, but we are having issues with our SpanNear queries (they require the same field).  

We thought to try combining the stemmed and non-stemmed terms in the same field, but we are concerned about the stats being skewed as a result of this (especially for the TermVector stats).  Can overloading the non-stemmed field with stemmed terms cause any issues with the TermVector?

Any suggestions?

Ivan Provalov


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Stemming and Wildcard Queries

Posted by Erick Erickson <er...@gmail.com>.

Another approach to stemming at index time but still providing exact matches
when requested is to index the stemmed version AND the original version at
the same position (think synonyms). But here's the trick, index the original
token with a special character. For instance, indexing "running" would look
like indexing "run" and "running$". Now, whenever you want the exact match,
just add the "$" to the end of the token.

With this approach, you have to watch that your analyzers don't strip the
'$'...

Of course, each approach has its trade-offs, and the characteristics of your
particular problem may determine which is preferable...

FWIW
Erick

On Thu, May 20, 2010 at 4:48 PM, Herbert Roitblat <he...@orcatec.com> wrote:

> At a general level, we have found that stemming during indexing is not
> advisable.  Sometimes users want the exact form and if you have removed the
> exact form during indexing, obviously, you cannot provide that.  Rather, we
> have found that stemming during search is more useful, or maybe it should be
> called anti-stemming.  For any given input for which the user wants to stem,
> we could derive the variations during the query processing.  E.g., plan can
> be expanded to include plans, planning, planned, etc.
>
> In our application we provide a feature that is sometimes called a word
> wheel.  When someone enters plan in this tool, we show all of the words in
> the index that start with plan. Here are some of the related words:
> plan
> plane
> planes
> planet
> planificaci
> planned
> plannedoutages.xls
> planner
> planners
>
> Just a thought.
> Herb
>
> ----- Original Message ----- From: "Ivan Provalov" <ip...@yahoo.com>
> To: <ja...@lucene.apache.org>
> Sent: Thursday, May 20, 2010 1:16 PM
> Subject: Stemming and Wildcard Queries
>
>
>
>  Is there a good way to combine the wildcard queries and stemming?
>>
>> As is, the field which is stemmed at index time, won't work with some
>> wildcard queries.
>>
>> We were thinking to create two separate index fields - one stemmed, one
>> non-stemmed, but we are having issues with our SpanNear queries (they
>> require the same field).
>>
>> We thought to try combining the stemmed and non-stemmed terms in the same
>> field, but we are concerned about the stats being skewed as a result of this
>> (especially for the TermVector stats).  Can overloading the non-stemmed
>> field with stemmed terms cause any issues with the TermVector?
>>
>> Any suggestions?
>>
>> Ivan Provalov
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Stemming and Wildcard Queries

Posted by Ivan Provalov <ip...@yahoo.com>.

Thanks, everyone!

--- On Thu, 5/20/10, Herbert Roitblat <he...@orcatec.com> wrote:

> From: Herbert Roitblat <he...@orcatec.com>
> Subject: Re: Stemming and Wildcard Queries
> To: java-user@lucene.apache.org
> Date: Thursday, May 20, 2010, 4:48 PM
> At a general level, we have found
> that stemming during indexing is not advisable. 
> Sometimes users want the exact form and if you have removed
> the exact form during indexing, obviously, you cannot
> provide that.  Rather, we have found that stemming
> during search is more useful, or maybe it should be called
> anti-stemming.  For any given input for which the user
> wants to stem, we could derive the variations during the
> query processing.  E.g., plan can be expanded to
> include plans, planning, planned, etc.
> 
> In our application we provide a feature that is sometimes
> called a word wheel.  When someone enters plan in this
> tool, we show all of the words in the index that start with
> plan. Here are some of the related words:
> plan
> plane
> planes
> planet
> planificaci
> planned
> plannedoutages.xls
> planner
> planners
> 
> Just a thought.
> Herb
> 
> ----- Original Message ----- From: "Ivan Provalov" <ip...@yahoo.com>
> To: <ja...@lucene.apache.org>
> Sent: Thursday, May 20, 2010 1:16 PM
> Subject: Stemming and Wildcard Queries
> 
> 
> > Is there a good way to combine the wildcard queries
> and stemming?
> > 
> > As is, the field which is stemmed at index time, won't
> work with some wildcard queries.
> > 
> > We were thinking to create two separate index fields -
> one stemmed, one non-stemmed, but we are having issues with
> our SpanNear queries (they require the same field).
> > 
> > We thought to try combining the stemmed and
> non-stemmed terms in the same field, but we are concerned
> about the stats being skewed as a result of this (especially
> for the TermVector stats).  Can overloading the
> non-stemmed field with stemmed terms cause any issues with
> the TermVector?
> > 
> > Any suggestions?
> > 
> > Ivan Provalov
> > 
> > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Stemming and Wildcard Queries

Posted by Herbert Roitblat <he...@orcatec.com>.

At a general level, we have found that stemming during indexing is not 
advisable.  Sometimes users want the exact form and if you have removed the 
exact form during indexing, obviously, you cannot provide that.  Rather, we 
have found that stemming during search is more useful, or maybe it should be 
called anti-stemming.  For any given input for which the user wants to stem, 
we could derive the variations during the query processing.  E.g., plan can 
be expanded to include plans, planning, planned, etc.

In our application we provide a feature that is sometimes called a word 
wheel.  When someone enters plan in this tool, we show all of the words in 
the index that start with plan. Here are some of the related words:
plan
plane
planes
planet
planificaci
planned
plannedoutages.xls
planner
planners

Just a thought.
Herb

----- Original Message ----- 
From: "Ivan Provalov" <ip...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Thursday, May 20, 2010 1:16 PM
Subject: Stemming and Wildcard Queries


> Is there a good way to combine the wildcard queries and stemming?
>
> As is, the field which is stemmed at index time, won't work with some 
> wildcard queries.
>
> We were thinking to create two separate index fields - one stemmed, one 
> non-stemmed, but we are having issues with our SpanNear queries (they 
> require the same field).
>
> We thought to try combining the stemmed and non-stemmed terms in the same 
> field, but we are concerned about the stats being skewed as a result of 
> this (especially for the TermVector stats).  Can overloading the 
> non-stemmed field with stemmed terms cause any issues with the TermVector?
>
> Any suggestions?
>
> Ivan Provalov
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Stemming and Wildcard Queries

Posted by Ahmet Arslan <io...@yahoo.com>.

> Is there a good way to combine the
> wildcard queries and stemming?  
> 
> As is, the field which is stemmed at index time, won't work
> with some wildcard queries.

org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser may help?


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org