You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bayer Dennis <De...@cursor.de> on 2012/12/11 10:49:44 UTC

Stemming and Wildcard - or fire and water

Hello there,
my colleague and I ran into an example which didn't return the result size which we were expecting. We discovered that there is a mismatch in handling terms while indexing and searching. This issue is already discussed several times in the internet as we found out later on, but in our point of view it's a buggy behavior if, at least, using a German stemmer.

Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)

Setup:
* Lucene 4.0.0
* Use the GermanAnalyzer which internally uses a GermanStemmer

Issue:
* Create an index for "Hersener" which has a common ending in German -> the string is shortend to "hers"
* Search for "Hers" -> a result is found
* Search for "Hersen" -> a result is found because the input token is also stemmed to "hers"
* Search for "Hers*" -> a result is found
* Search for "Hersen*" -> nothing is found because the analyzer does not run

Similiar examples can be constructed easily if umlauts are involved.

Conclusion:
The search query which contains a wildcard should also be run through the analyzer, because there are a lot of queries which would return nothing. The lucene FAQ already as a topic related to this issue: http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F

The example with "dog" and "dogs" works as long as only one character is stemmed - which could be true in English for the majority. But if more characters are involved lucene does not return anything instead of returning a few additional items. Just consider "families" which is stemmed to "famili". Searching for "familie*" wouldn't return no item.

To find an ending for this initial post ;) :
Could this behavior made configurable in the standard? If not:
a) Why are the stemmers used by default if they can led to wrong results?
b) What can be done manually to stem queries containing wildcards, e.g. overriding some parser.

Best regards
Dennis

RE: Stemming and Wildcard - or fire and water

Posted by Lars-Erik Aabech <LE...@markedspartner.no>.

A possible workaround could be to modify search terms with wildcard tokens by stemming them manually and creating a new search string.
Searches for hersen* would be modified to hers* and return what you expect.
Con is of course that you search for more than you specified.

Lars-Erik

> -----Original Message-----
> From: Bayer Dennis [mailto:Dennis.Bayer@cursor.de]
> Sent: Tuesday, December 11, 2012 10:50 AM
> To: java-user@lucene.apache.org
> Subject: Stemming and Wildcard - or fire and water
> 
> Hello there,
> my colleague and I ran into an example which didn't return the result 
> size which we were expecting. We discovered that there is a mismatch 
> in handling terms while indexing and searching. This issue is already 
> discussed several times in the internet as we found out later on, but 
> in our point of view it's a buggy behavior if, at least, using a German stemmer.
> 
> Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)
> 
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> 
> Issue:
> * Create an index for "Hersener" which has a common ending in German 
> -> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is 
> also stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does 
> not run
> 
> Similiar examples can be constructed easily if umlauts are involved.
> 
> Conclusion:
> The search query which contains a wildcard should also be run through 
> the analyzer, because there are a lot of queries which would return 
> nothing. The lucene FAQ already as a topic related to this issue:
> http://wiki.apache.org/lucene-
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> 
> The example with "dog" and "dogs" works as long as only one character 
> is stemmed - which could be true in English for the majority. But if 
> more characters are involved lucene does not return anything instead 
> of returning a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> 
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> 
> Best regards
> Dennis
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Stemming and Wildcard - or fire and water

Posted by Uwe Schindler <uw...@thetaphi.de>.

This is a well-known problem: Wildcards cannot be analyzed by the query parser, because the analysis would destroy the wildcard characters; also stemming of parts of terms will never work. For Solr there is a workaround (MultiTermAware component), but it is also very limited and only works when all analysis components are MultiTermAware, what stemmers are not.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bayer Dennis [mailto:Dennis.Bayer@cursor.de]
> Sent: Tuesday, December 11, 2012 10:50 AM
> To: java-user@lucene.apache.org
> Subject: Stemming and Wildcard - or fire and water
> 
> Hello there,
> my colleague and I ran into an example which didn't return the result size
> which we were expecting. We discovered that there is a mismatch in
> handling terms while indexing and searching. This issue is already discussed
> several times in the internet as we found out later on, but in our point of
> view it's a buggy behavior if, at least, using a German stemmer.
> 
> Tl;dr: a Junit testcase is available (http://pastebin.com/AdeFdW1k)
> 
> Setup:
> * Lucene 4.0.0
> * Use the GermanAnalyzer which internally uses a GermanStemmer
> 
> Issue:
> * Create an index for "Hersener" which has a common ending in German ->
> the string is shortend to "hers"
> * Search for "Hers" -> a result is found
> * Search for "Hersen" -> a result is found because the input token is also
> stemmed to "hers"
> * Search for "Hers*" -> a result is found
> * Search for "Hersen*" -> nothing is found because the analyzer does not
> run
> 
> Similiar examples can be constructed easily if umlauts are involved.
> 
> Conclusion:
> The search query which contains a wildcard should also be run through the
> analyzer, because there are a lot of queries which would return nothing. The
> lucene FAQ already as a topic related to this issue:
> http://wiki.apache.org/lucene-
> java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sen
> sitive.3F
> 
> The example with "dog" and "dogs" works as long as only one character is
> stemmed - which could be true in English for the majority. But if more
> characters are involved lucene does not return anything instead of returning
> a few additional items. Just consider "families" which is stemmed to "famili".
> Searching for "familie*" wouldn't return no item.
> 
> To find an ending for this initial post ;) :
> Could this behavior made configurable in the standard? If not:
> a) Why are the stemmers used by default if they can led to wrong results?
> b) What can be done manually to stem queries containing wildcards, e.g.
> overriding some parser.
> 
> Best regards
> Dennis
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org