You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by JBTech <jb...@gmail.com> on 2008/07/25 00:39:23 UTC

issues with wildcard search and snowball english analyzer

I am using SnowballAnalayzer(English).
I just created one document with one field with content as "elephant is a
big animal".
I searched for e*t using queryparser.
This did not return any results.
I indexed with "lion is a big animal".
Then searched for l*t. This returned one result as expected.
I looked at the index using Luke and figured out that elephant has been
steemed to eleph by analyzer.
I reindexed "elephant is a big animal" and tried with e*p, this time I got
one hit.
I like the stemming as it stems tests, tested, testing etc... to test.
Is there a way to avoid stemming in certain cases?
-- 
View this message in context: http://www.nabble.com/issues-with-wildcard-search-and-snowball-english-analyzer-tp18641947p18641947.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: issues with wildcard search and snowball english analyzer

Posted by Andrew Gilmartin <an...@yahoo.com>.

--- On Fri, 7/25/08, JBTech <jb...@gmail.com> wrote:

> I tried with e*t and that did not return any results.

Hum. Example code would be helpful now.

-- Andrew

Re: issues with wildcard search and snowball english analyzer

Posted by JBTech <jb...@gmail.com>.

Hi Andrew,
Thanks for your quick reply.
I tried with e*t and that did not return any results.
I am using Lucene 2.2.
The full word elephant returned one hit as I am using the same analayzer for
indexing and searching.
I uploaded the java class I used for testing this.
Thanks
JB

Andrew Gilmartin-2 wrote:
> 
> --- On Thu, 7/24/08, JBTech <jb...@gmail.com> wrote:
> 
>> Is there a way to avoid stemming in certain cases?
> 
> As a general rule, make the query intelligent and not the index.
> Therefore, index your text verbatim. Small changes like changing terms to
> lowercase and removing possessives are fine. You now have an index upon
> which you can make intelligent queries.
> 
> An intelligent query requires keeping track of several collections of
> term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s).
> Now, convert the users search for "elephant is a big animal" into
> something akin to 
> 
> ( (elephant^10) OR (A) OR (B) ) AND
> ( (big^10) OR (C) ) AND
> ( (animal^10) OR (D) )
> 
> Where A and B are other terms with the same stemming as elephant, C is
> another term with the same stemming as big, and D is a another term with
> the same stemming as animal. Adding the boost ensures that a verbatim
> match pushes the document's rank higher and so ensure that what the user
> asked for is closer to the top.
> 
> This basic idea of making the queries more intelligent by broadening them
> and boosting term weights gives you a lot of control over the query and
> how results are ranked. The same control is not possible by making the
> index more intelligent.
> 
> Don't worry about Lucene's performance with complex queries. My experience
> is that it is very fast.
> 
> And to answer your specific question, search for "e*t" will work as is.
> 
> -- Andrew
> 
> 
> 
> 
> 
http://www.nabble.com/file/p18652365/Testing.java Testing.java 
-- 
View this message in context: http://www.nabble.com/issues-with-wildcard-search-and-snowball-english-analyzer-tp18641947p18652365.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: issues with wildcard search and snowball english analyzer

Posted by Andrew Gilmartin <an...@yahoo.com>.

--- On Thu, 7/24/08, JBTech <jb...@gmail.com> wrote:

> Is there a way to avoid stemming in certain cases?

As a general rule, make the query intelligent and not the index. Therefore, index your text verbatim. Small changes like changing terms to lowercase and removing possessives are fine. You now have an index upon which you can make intelligent queries.

An intelligent query requires keeping track of several collections of term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s). Now, convert the users search for "elephant is a big animal" into something akin to 

( (elephant^10) OR (A) OR (B) ) AND
( (big^10) OR (C) ) AND
( (animal^10) OR (D) )

Where A and B are other terms with the same stemming as elephant, C is another term with the same stemming as big, and D is a another term with the same stemming as animal. Adding the boost ensures that a verbatim match pushes the document's rank higher and so ensure that what the user asked for is closer to the top.

This basic idea of making the queries more intelligent by broadening them and boosting term weights gives you a lot of control over the query and how results are ranked. The same control is not possible by making the index more intelligent.

Don't worry about Lucene's performance with complex queries. My experience is that it is very fast.

And to answer your specific question, search for "e*t" will work as is.

-- Andrew