You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alexey Shakov <al...@menta.de> on 2008/01/15 10:57:19 UTC

wildcards and German umlauts

Hi all,

Index-searching works, if i type complete word (such as "übersicht").
But there are no hits, if i use wildcards (such as "über*")
Searching with wildcards and without umlauts works as well.

Can someone help me? Thanx in advance!

Here is my field definition:


        <fieldtype name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.StopFilterFactory" ignoreCase="true"
                    words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1" 
catenateWords="1"
                    catenateNumbers="1" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.SnowballPorterFilterFactory" 
protected="protwords.txt" language="German2" />               
                <!-- filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt" /-->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory"
                    synonyms="synonyms.txt" ignoreCase="true" 
expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true"
                    words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1" 
catenateWords="0"
                    catenateNumbers="0" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.SnowballPorterFilterFactory" 
protected="protwords.txt" language="German2" />               
                <!-- filter class="solr.EnglishPorterFilterFactory"  
protected="protwords.txt" /-->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldtype>

Re: wildcards and German umlauts

Posted by solenweg <da...@hotmail.com>.

Hi,

I've got the same problem: searching using wildcards and umlaut -> no
results.
Just as you descriped it:

"if i type complete word (such as "übersicht").
But there are no hits, if i use wildcards (such as "über*")
Searching with wildcards and without umlauts works as well."

Anyone found the solution to this problem or have any new ideas?
-- 
View this message in context: http://www.nabble.com/wildcards-and-German-umlauts-tp14836043p24517583.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Agree that this is annoying for foreign languages. I get the idea behind the original behaviour, but there could be more elegant ways of handling it. It would make sense to always run the CharFilters. Perhaps a mechanism where TokenFilters can be tagged for exclusion from wildcard terms would be an idea. That way we can skip stemming, synonym and phonetic for wildcard terms, but still do lowercasing and characterNormalization.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 29. mai 2011, at 19.24, mdz-munich wrote:

> Ah, NOW I got it. It's not a bug, it's a feature. 
> 
> But that would mean, that every character-manipulation (e.g.
> char-mapping/replacement, Porter-Stemmer in some cases ...) would cause a
> wildcard-query to fail. That too bad.
> 
> But why? What's the Problem with passing the prefix through the
> analyzer/filter-chain?  
> 
> Greetz,
> 
> Sebastian
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2999237.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by mdz-munich <se...@bsb-muenchen.de>.

Ah, NOW I got it. It's not a bug, it's a feature. 

But that would mean, that every character-manipulation (e.g.
char-mapping/replacement, Porter-Stemmer in some cases ...) would cause a
wildcard-query to fail. That too bad.

But why? What's the Problem with passing the prefix through the
analyzer/filter-chain?  

Greetz,

Sebastian

--
View this message in context: http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2999237.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by mdz-munich <se...@bsb-muenchen.de>.

I don't get you. Did I wrote something of an Analyzer? Actually not. 

--
View this message in context: http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2999074.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by Markus Jelsma <ma...@openindex.io>.

Wildcard queries are not passed through an analyzer.

> Ah, BTW,
> 
> since the problem seems to be a query-parser-issue a simple workarround
> could be done by simple replace all Umlauts with ASCII-Characters (ä = ae,
> ö = oe, ü = ue for example) before sending the query to Solr and use a
> solr.MappingCharFilterFactory with the same replacements (ä = ae, ö = oe,
> ü = ue) while indexing.
> 
> It's unflexible in some cases, but it works so far.
> 
> Greetz,
> 
> Sebastian
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2
> 998449.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by mdz-munich <se...@bsb-muenchen.de>.

Ah, BTW,

since the problem seems to be a query-parser-issue a simple workarround
could be done by simple replace all Umlauts with ASCII-Characters (ä = ae, ö
= oe, ü = ue for example) before sending the query to Solr and use a
solr.MappingCharFilterFactory with the same replacements (ä = ae, ö = oe, ü
= ue) while indexing. 

It's unflexible in some cases, but it works so far. 

Greetz,

Sebastian 

--
View this message in context: http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2998449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcards and German umlauts

Posted by Daniel Naber <lu...@danielnaber.de>.

On Dienstag, 15. Januar 2008, Alexey Shakov wrote:

> Index-searching works, if i type complete word (such as "übersicht").
> But there are no hits, if i use wildcards (such as "über*")
> Searching with wildcards and without umlauts works as well.

Maybe this describes your problem on the Lucene level?
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

If that doesn't help, try Luke to see how your queries are parsed.

Regards
 Daniel

-- 
http://www.danielnaber.de

Re: wildcards and German umlauts

Posted by mdz-munich <se...@bsb-muenchen.de>.

Hi,

"if i type complete word (such as "übersicht").
But there are no hits, if i use wildcards (such as "über*")
Searching with wildcards and without umlauts works as well." 

I can confirm that. 

Greetz,

Sebastian

--
View this message in context: http://lucene.472066.n3.nabble.com/wildcards-and-German-umlauts-tp499972p2998425.html
Sent from the Solr - User mailing list archive at Nabble.com.