You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marc Bechler <m....@computer.org> on 2007/09/13 22:14:20 UTC

Query for German "Special Characters" (i.e., ä, ö, ß)

Hi SOLR kings,

I'm just playing around with queries, but I was not able to query for 
any special characters like the German "Umlaute" (i.e., ä, ö, ü). Maybe 
others might have the same effects and already found a solution ;-)

Here is my example: I have one field called "sometext" of type "text" 
(the one delivered with the SOLR example). I indexed a few words similar to

<field name="sometext">
<![CDATA[
This is really fünny
]]></field>

Works fine, and searching for "really" shows the result and fünny will 
be displayed correctly. However, the query for "fünny" using the 
/solr/admin page is resolved (correctly) to the URL ...q=f%C3%BCnny... 
but does not find the document.

And now the question: Any ideas? ;-)

Cheers,

  marc

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Marc Bechler <m....@computer.org>.
Hi Walter,

good advice -- but you need to know the language of your material ... 
could be hard for an automatized processing ;-)

I also stumbled on the "same words in different languages" problem. The 
sole solution might be the dream of an English-only documented world ;-)

Regards from good old Umlaute-Germany ;-)

  marc

Walter Underwood schrieb:
> You could use index into multiple fields with different analyzers
> and search all of them.
> 
> text_en: uses English stemmer
> text_de: uses German stemmer
> text_exact: no stemming
> text_strip: uses ISOLatin1AccentFilter
> 
> You can search all of these and put different boosts on them,
> with higher boosts for the more exact matches.
> 
> I don't know if any of these handle "typewriter umlauts", like
> "ueber" for "über".
> 
> The German Porter stemmer probably does not break compound words,
> like "Feuerwehrmannschaft" into "Feuerwehr" and "Mannschaft"
> (but not further). That can cause missed matches.
> 
> You can put these in synonyms.txt, but that could be a lot
> of work.
> 
> One problem that I have seen in cross-language searching is
> strings that appear in both languages. For example, "die" is
> common in German but rare in English, so it will have a higher
> IDF when matched against English and the English hits will
> score higher. Same for "mit". In English, that is the Massachusetts
> Institute of Technology.
> 
> wunder
> ==
> Walter Underwood
> Search Guy, Netflix
> 
> On 9/14/07 2:09 PM, "Marc Bechler" <m....@computer.org> wrote:
> 
>> Hi Tom,
>>
>> thanks for your professional response -- works fine and looks good :-).
>> Since I am playing around with mixed texts (English and German), I do
>> not have any idea whether or not an EnglishPorter will be useful for
>> German texts. But I will find it out by playing around ;-)
>>
>> Regards from Germany,
>>
>>   marc
>>
>>
>>
>> Tom Hill schrieb:
>>> Hi Marc,
>>>
>>> The searches are going to look for an exact match of the query (after
>>> analysis) in the index (after analysis).
>>>
>>> So, realli will not match really.
>>>
>>> So you want to have the same stemmer (probably not the English one, given
>>> your examples) in both in index analyzer, and the query analyzer. I've
>>> appended the section from solr 1.2 example schema.xml, note
>>> EnglishPorterFilterFactory is in both sections. That would be what you want
>>> to do, with the appropriate stemmer for your application.
>>>
>>> Or, you could use no stemmer for BOTH, but I think most people go with
>>> stemming. At least, I do. :-)
>>>
>>> Tom
>>>
>>>     <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>         <!-- in this example, we will only use synonyms at query time
>>>         <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>         -->
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
>>> stopwords.txt"/>
>>>         <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPorterFilterFactory" protected="
>>> protwords.txt"/>
>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
>>> stopwords.txt"/>
>>>         <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPorterFilterFactory" protected="
>>> protwords.txt"/>
>>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>>>> Index for "really": 5* really. Query for "really": 5* really, 2* realli
>>>> (from: EnglishPorterFilterFactory {protected=protwords.txt},
>>>> RemoveDuplicatesTokenFilterFactory {})
>>>>
>>>> For "this" everyting is completely fine.
>>>>
>>>> Is a complete matching required between index and query or is a partial
>>>> matching also okay?
>>>>
>>>> Thanks for helping me
>>>>
>>>>   marc
>>>>
>>>>
>>>>
>>>>
>>>> Tom Hill schrieb:
>>>>> Hi Marc,
>>>>>
>>>>> Are you using the same stemmer on your queries that you use when
>>>> indexing?
>>>>> Try the analysis function in the admin UI, to see how things are stemmed
>>>> for
>>>>> indexing vs. querying. If they don't match for really and fünny, and do
>>>>> match for kraßen, then that's your problem.
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
>>>>>> Thanks for the advice.
>>>>>>
>>>>>> But now I am really curioused. After indexing the document from
>>>> scratch,
>>>>>> I have the effect that queries to "this" and "is" work fine, whereas
>>>>>> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
>>>>>> after extending my sometext to "This is really fünny kraßen.", queries
>>>>>> to "really" and "fünny" still do not work, but "kraßen" is found.
>>>>>> Now I am somehow confused -- hopefully anyone has a good explanation
>>>> ;-)
>>>>>> Regards,
>>>>>>
>>>>>>   marc
>>>>>>
>>>>>>> Tom Hill schrieb:
>>>>>>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
>>>>>>>> tomcat connector.
>>>>>>>>
>>>>>>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
>>>>>>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
>>>>>>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
>>>>>>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
>>>>>>>>
>>>>>>>> use the analysis page of the admin interface to check to see what's
>>>>>>>>  happening to your queries, too.
>>>>>>>>
>>>>>>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
>>>>>>>> port # may vary)
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> On 9/13/07, Marc Bechler < m.bechler@computer.org> wrote:
>>>>>>>>> Hi SOLR kings,
>>>>>>>>>
>>>>>>>>> I'm just playing around with queries, but I was not able to query
>>>>>>>>> for any special characters like the German "Umlaute" ( i.e., ä, ö,
>>>>>>>>> ü). Maybe others might have the same effects and already found a
>>>>>>>>> solution ;-)
>>>>>>>>>
>>>>>>>>> Here is my example: I have one field called "sometext" of type
>>>>>>>>> "text" (the one delivered with the SOLR example). I indexed a few
>>>>>>>>> words similar to
>>>>>>>>>
>>>>>>>>> <field name="sometext"> <![CDATA[ This is really fünny
>>>>>>>>> ]]></field>
>>>>>>>>>
>>>>>>>>> Works fine, and searching for "really" shows the result and fünny
>>>>>>>>> will be displayed correctly. However, the query for "fünny" using
>>>>>>>>> the /solr/admin page is resolved (correctly) to the URL
>>>>>>>>> ...q=f%C3%BCnny... but does not find the document.
>>>>>>>>>
>>>>>>>>> And now the question: Any ideas? ;-)
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> marc
>>>>>>>>>
> 

Re: Query for German "Special Characters" (i.e.,ä, ö, ß)

Posted by Walter Underwood <wu...@netflix.com>.
You could use index into multiple fields with different analyzers
and search all of them.

text_en: uses English stemmer
text_de: uses German stemmer
text_exact: no stemming
text_strip: uses ISOLatin1AccentFilter

You can search all of these and put different boosts on them,
with higher boosts for the more exact matches.

I don't know if any of these handle "typewriter umlauts", like
"ueber" for "über".

The German Porter stemmer probably does not break compound words,
like "Feuerwehrmannschaft" into "Feuerwehr" and "Mannschaft"
(but not further). That can cause missed matches.

You can put these in synonyms.txt, but that could be a lot
of work.

One problem that I have seen in cross-language searching is
strings that appear in both languages. For example, "die" is
common in German but rare in English, so it will have a higher
IDF when matched against English and the English hits will
score higher. Same for "mit". In English, that is the Massachusetts
Institute of Technology.

wunder
==
Walter Underwood
Search Guy, Netflix

On 9/14/07 2:09 PM, "Marc Bechler" <m....@computer.org> wrote:

> Hi Tom,
> 
> thanks for your professional response -- works fine and looks good :-).
> Since I am playing around with mixed texts (English and German), I do
> not have any idea whether or not an EnglishPorter will be useful for
> German texts. But I will find it out by playing around ;-)
> 
> Regards from Germany,
> 
>   marc
> 
> 
> 
> Tom Hill schrieb:
>> Hi Marc,
>> 
>> The searches are going to look for an exact match of the query (after
>> analysis) in the index (after analysis).
>> 
>> So, realli will not match really.
>> 
>> So you want to have the same stemmer (probably not the English one, given
>> your examples) in both in index analyzer, and the query analyzer. I've
>> appended the section from solr 1.2 example schema.xml, note
>> EnglishPorterFilterFactory is in both sections. That would be what you want
>> to do, with the appropriate stemmer for your application.
>> 
>> Or, you could use no stemmer for BOTH, but I think most people go with
>> stemming. At least, I do. :-)
>> 
>> Tom
>> 
>>     <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <!-- in this example, we will only use synonyms at query time
>>         <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>         -->
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
>> stopwords.txt"/>
>>         <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EnglishPorterFilterFactory" protected="
>> protwords.txt"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
>> stopwords.txt"/>
>>         <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EnglishPorterFilterFactory" protected="
>> protwords.txt"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>> 
>> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>>> Index for "really": 5* really. Query for "really": 5* really, 2* realli
>>> (from: EnglishPorterFilterFactory {protected=protwords.txt},
>>> RemoveDuplicatesTokenFilterFactory {})
>>> 
>>> For "this" everyting is completely fine.
>>> 
>>> Is a complete matching required between index and query or is a partial
>>> matching also okay?
>>> 
>>> Thanks for helping me
>>> 
>>>   marc
>>> 
>>> 
>>> 
>>> 
>>> Tom Hill schrieb:
>>>> Hi Marc,
>>>> 
>>>> Are you using the same stemmer on your queries that you use when
>>> indexing?
>>>> Try the analysis function in the admin UI, to see how things are stemmed
>>> for
>>>> indexing vs. querying. If they don't match for really and fünny, and do
>>>> match for kraßen, then that's your problem.
>>>> 
>>>> Tom
>>>> 
>>>> 
>>>> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>>>>> Hi,
>>>>> 
>>>>> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
>>>>> Thanks for the advice.
>>>>> 
>>>>> But now I am really curioused. After indexing the document from
>>> scratch,
>>>>> I have the effect that queries to "this" and "is" work fine, whereas
>>>>> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
>>>>> after extending my sometext to "This is really fünny kraßen.", queries
>>>>> to "really" and "fünny" still do not work, but "kraßen" is found.
>>>>> Now I am somehow confused -- hopefully anyone has a good explanation
>>> ;-)
>>>>> Regards,
>>>>> 
>>>>>   marc
>>>>> 
>>>>>> Tom Hill schrieb:
>>>>>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
>>>>>>> tomcat connector.
>>>>>>> 
>>>>>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
>>>>>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
>>>>>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
>>>>>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
>>>>>>> 
>>>>>>> use the analysis page of the admin interface to check to see what's
>>>>>>>  happening to your queries, too.
>>>>>>> 
>>>>>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
>>>>>>> port # may vary)
>>>>>>> 
>>>>>>> Tom
>>>>>>> 
>>>>>>> On 9/13/07, Marc Bechler < m.bechler@computer.org> wrote:
>>>>>>>> Hi SOLR kings,
>>>>>>>> 
>>>>>>>> I'm just playing around with queries, but I was not able to query
>>>>>>>> for any special characters like the German "Umlaute" ( i.e., ä, ö,
>>>>>>>> ü). Maybe others might have the same effects and already found a
>>>>>>>> solution ;-)
>>>>>>>> 
>>>>>>>> Here is my example: I have one field called "sometext" of type
>>>>>>>> "text" (the one delivered with the SOLR example). I indexed a few
>>>>>>>> words similar to
>>>>>>>> 
>>>>>>>> <field name="sometext"> <![CDATA[ This is really fünny
>>>>>>>> ]]></field>
>>>>>>>> 
>>>>>>>> Works fine, and searching for "really" shows the result and fünny
>>>>>>>> will be displayed correctly. However, the query for "fünny" using
>>>>>>>> the /solr/admin page is resolved (correctly) to the URL
>>>>>>>> ...q=f%C3%BCnny... but does not find the document.
>>>>>>>> 
>>>>>>>> And now the question: Any ideas? ;-)
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> marc
>>>>>>>> 
>> 


Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Marc Bechler <m....@computer.org>.
Hi Tom,

thanks for your professional response -- works fine and looks good :-). 
Since I am playing around with mixed texts (English and German), I do 
not have any idea whether or not an EnglishPorter will be useful for 
German texts. But I will find it out by playing around ;-)

Regards from Germany,

  marc



Tom Hill schrieb:
> Hi Marc,
> 
> The searches are going to look for an exact match of the query (after
> analysis) in the index (after analysis).
> 
> So, realli will not match really.
> 
> So you want to have the same stemmer (probably not the English one, given
> your examples) in both in index analyzer, and the query analyzer. I've
> appended the section from solr 1.2 example schema.xml, note
> EnglishPorterFilterFactory is in both sections. That would be what you want
> to do, with the appropriate stemmer for your application.
> 
> Or, you could use no stemmer for BOTH, but I think most people go with
> stemming. At least, I do. :-)
> 
> Tom
> 
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
> stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory" protected="
> protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="
> stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory" protected="
> protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>> Index for "really": 5* really. Query for "really": 5* really, 2* realli
>> (from: EnglishPorterFilterFactory {protected=protwords.txt},
>> RemoveDuplicatesTokenFilterFactory {})
>>
>> For "this" everyting is completely fine.
>>
>> Is a complete matching required between index and query or is a partial
>> matching also okay?
>>
>> Thanks for helping me
>>
>>   marc
>>
>>
>>
>>
>> Tom Hill schrieb:
>>> Hi Marc,
>>>
>>> Are you using the same stemmer on your queries that you use when
>> indexing?
>>> Try the analysis function in the admin UI, to see how things are stemmed
>> for
>>> indexing vs. querying. If they don't match for really and fünny, and do
>>> match for kraßen, then that's your problem.
>>>
>>> Tom
>>>
>>>
>>> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>>>> Hi,
>>>>
>>>> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
>>>> Thanks for the advice.
>>>>
>>>> But now I am really curioused. After indexing the document from
>> scratch,
>>>> I have the effect that queries to "this" and "is" work fine, whereas
>>>> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
>>>> after extending my sometext to "This is really fünny kraßen.", queries
>>>> to "really" and "fünny" still do not work, but "kraßen" is found.
>>>> Now I am somehow confused -- hopefully anyone has a good explanation
>> ;-)
>>>> Regards,
>>>>
>>>>   marc
>>>>
>>>>> Tom Hill schrieb:
>>>>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
>>>>>> tomcat connector.
>>>>>>
>>>>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
>>>>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
>>>>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
>>>>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
>>>>>>
>>>>>> use the analysis page of the admin interface to check to see what's
>>>>>>  happening to your queries, too.
>>>>>>
>>>>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
>>>>>> port # may vary)
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On 9/13/07, Marc Bechler < m.bechler@computer.org> wrote:
>>>>>>> Hi SOLR kings,
>>>>>>>
>>>>>>> I'm just playing around with queries, but I was not able to query
>>>>>>> for any special characters like the German "Umlaute" ( i.e., ä, ö,
>>>>>>> ü). Maybe others might have the same effects and already found a
>>>>>>> solution ;-)
>>>>>>>
>>>>>>> Here is my example: I have one field called "sometext" of type
>>>>>>> "text" (the one delivered with the SOLR example). I indexed a few
>>>>>>> words similar to
>>>>>>>
>>>>>>> <field name="sometext"> <![CDATA[ This is really fünny
>>>>>>> ]]></field>
>>>>>>>
>>>>>>> Works fine, and searching for "really" shows the result and fünny
>>>>>>> will be displayed correctly. However, the query for "fünny" using
>>>>>>> the /solr/admin page is resolved (correctly) to the URL
>>>>>>> ...q=f%C3%BCnny... but does not find the document.
>>>>>>>
>>>>>>> And now the question: Any ideas? ;-)
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> marc
>>>>>>>
> 

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Tom Hill <so...@zvents.com>.
Hi Marc,

The searches are going to look for an exact match of the query (after
analysis) in the index (after analysis).

So, realli will not match really.

So you want to have the same stemmer (probably not the English one, given
your examples) in both in index analyzer, and the query analyzer. I've
appended the section from solr 1.2 example schema.xml, note
EnglishPorterFilterFactory is in both sections. That would be what you want
to do, with the appropriate stemmer for your application.

Or, you could use no stemmer for BOTH, but I think most people go with
stemming. At least, I do. :-)

Tom

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="
stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="
protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="
stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="
protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>
> Index for "really": 5* really. Query for "really": 5* really, 2* realli
> (from: EnglishPorterFilterFactory {protected=protwords.txt},
> RemoveDuplicatesTokenFilterFactory {})
>
> For "this" everyting is completely fine.
>
> Is a complete matching required between index and query or is a partial
> matching also okay?
>
> Thanks for helping me
>
>   marc
>
>
>
>
> Tom Hill schrieb:
> > Hi Marc,
> >
> > Are you using the same stemmer on your queries that you use when
> indexing?
> >
> > Try the analysis function in the admin UI, to see how things are stemmed
> for
> > indexing vs. querying. If they don't match for really and fünny, and do
> > match for kraßen, then that's your problem.
> >
> > Tom
> >
> >
> > On 9/14/07, Marc Bechler <m....@computer.org> wrote:
> >> Hi,
> >>
> >> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
> >> Thanks for the advice.
> >>
> >> But now I am really curioused. After indexing the document from
> scratch,
> >> I have the effect that queries to "this" and "is" work fine, whereas
> >> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
> >> after extending my sometext to "This is really fünny kraßen.", queries
> >> to "really" and "fünny" still do not work, but "kraßen" is found.
> >> Now I am somehow confused -- hopefully anyone has a good explanation
> ;-)
> >>
> >> Regards,
> >>
> >>   marc
> >>
> >>> Tom Hill schrieb:
> >>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
> >>>> tomcat connector.
> >>>>
> >>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
> >>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
> >>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
> >>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
> >>>>
> >>>> use the analysis page of the admin interface to check to see what's
> >>>>  happening to your queries, too.
> >>>>
> >>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
> >>>> port # may vary)
> >>>>
> >>>> Tom
> >>>>
> >>>> On 9/13/07, Marc Bechler < m.bechler@computer.org> wrote:
> >>>>> Hi SOLR kings,
> >>>>>
> >>>>> I'm just playing around with queries, but I was not able to query
> >>>>> for any special characters like the German "Umlaute" ( i.e., ä, ö,
> >>>>> ü). Maybe others might have the same effects and already found a
> >>>>> solution ;-)
> >>>>>
> >>>>> Here is my example: I have one field called "sometext" of type
> >>>>> "text" (the one delivered with the SOLR example). I indexed a few
> >>>>> words similar to
> >>>>>
> >>>>> <field name="sometext"> <![CDATA[ This is really fünny
> >>>>> ]]></field>
> >>>>>
> >>>>> Works fine, and searching for "really" shows the result and fünny
> >>>>> will be displayed correctly. However, the query for "fünny" using
> >>>>> the /solr/admin page is resolved (correctly) to the URL
> >>>>> ...q=f%C3%BCnny... but does not find the document.
> >>>>>
> >>>>> And now the question: Any ideas? ;-)
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> marc
> >>>>>
> >
>

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Marc Bechler <m....@computer.org>.
Hi Tom,

thanks for your response -- and sorry for the newbie question, may sound 
somehow silly ;-) . Here the quick result of the analysis UI:

Index for "really": 5* really. Query for "really": 5* really, 2* realli 
(from: EnglishPorterFilterFactory {protected=protwords.txt}, 
RemoveDuplicatesTokenFilterFactory {})

For "this" everyting is completely fine.

Is a complete matching required between index and query or is a partial 
matching also okay?

Thanks for helping me

  marc




Tom Hill schrieb:
> Hi Marc,
> 
> Are you using the same stemmer on your queries that you use when indexing?
> 
> Try the analysis function in the admin UI, to see how things are stemmed for
> indexing vs. querying. If they don't match for really and fünny, and do
> match for kraßen, then that's your problem.
> 
> Tom
> 
> 
> On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>> Hi,
>>
>> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
>> Thanks for the advice.
>>
>> But now I am really curioused. After indexing the document from scratch,
>> I have the effect that queries to "this" and "is" work fine, whereas
>> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
>> after extending my sometext to "This is really fünny kraßen.", queries
>> to "really" and "fünny" still do not work, but "kraßen" is found.
>> Now I am somehow confused -- hopefully anyone has a good explanation ;-)
>>
>> Regards,
>>
>>   marc
>>
>>> Tom Hill schrieb:
>>>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
>>>> tomcat connector.
>>>>
>>>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
>>>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
>>>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
>>>> disableUploadTimeout="true" URIEncoding="UTF-8" />
>>>>
>>>> use the analysis page of the admin interface to check to see what's
>>>>  happening to your queries, too.
>>>>
>>>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
>>>> port # may vary)
>>>>
>>>> Tom
>>>>
>>>> On 9/13/07, Marc Bechler <m....@computer.org> wrote:
>>>>> Hi SOLR kings,
>>>>>
>>>>> I'm just playing around with queries, but I was not able to query
>>>>> for any special characters like the German "Umlaute" (i.e., ä, ö,
>>>>> ü). Maybe others might have the same effects and already found a
>>>>> solution ;-)
>>>>>
>>>>> Here is my example: I have one field called "sometext" of type
>>>>> "text" (the one delivered with the SOLR example). I indexed a few
>>>>> words similar to
>>>>>
>>>>> <field name="sometext"> <![CDATA[ This is really fünny
>>>>> ]]></field>
>>>>>
>>>>> Works fine, and searching for "really" shows the result and fünny
>>>>> will be displayed correctly. However, the query for "fünny" using
>>>>> the /solr/admin page is resolved (correctly) to the URL
>>>>> ...q=f%C3%BCnny... but does not find the document.
>>>>>
>>>>> And now the question: Any ideas? ;-)
>>>>>
>>>>> Cheers,
>>>>>
>>>>> marc
>>>>>
> 

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Tom Hill <so...@zvents.com>.
Hi Marc,

Are you using the same stemmer on your queries that you use when indexing?

Try the analysis function in the admin UI, to see how things are stemmed for
indexing vs. querying. If they don't match for really and fünny, and do
match for kraßen, then that's your problem.

Tom


On 9/14/07, Marc Bechler <m....@computer.org> wrote:
>
> Hi,
>
> oops, the URIEncoding was lost during the update to tomcat 6.0.14.
> Thanks for the advice.
>
> But now I am really curioused. After indexing the document from scratch,
> I have the effect that queries to "this" and "is" work fine, whereas
> queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
> after extending my sometext to "This is really fünny kraßen.", queries
> to "really" and "fünny" still do not work, but "kraßen" is found.
> Now I am somehow confused -- hopefully anyone has a good explanation ;-)
>
> Regards,
>
>   marc
>
> > Tom Hill schrieb:
> >> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
> >> tomcat connector.
> >>
> >> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
> >> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
> >> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
> >> disableUploadTimeout="true" URIEncoding="UTF-8" />
> >>
> >> use the analysis page of the admin interface to check to see what's
> >>  happening to your queries, too.
> >>
> >> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
> >> port # may vary)
> >>
> >> Tom
> >>
> >> On 9/13/07, Marc Bechler <m....@computer.org> wrote:
> >>> Hi SOLR kings,
> >>>
> >>> I'm just playing around with queries, but I was not able to query
> >>> for any special characters like the German "Umlaute" (i.e., ä, ö,
> >>> ü). Maybe others might have the same effects and already found a
> >>> solution ;-)
> >>>
> >>> Here is my example: I have one field called "sometext" of type
> >>> "text" (the one delivered with the SOLR example). I indexed a few
> >>> words similar to
> >>>
> >>> <field name="sometext"> <![CDATA[ This is really fünny
> >>> ]]></field>
> >>>
> >>> Works fine, and searching for "really" shows the result and fünny
> >>> will be displayed correctly. However, the query for "fünny" using
> >>> the /solr/admin page is resolved (correctly) to the URL
> >>> ...q=f%C3%BCnny... but does not find the document.
> >>>
> >>> And now the question: Any ideas? ;-)
> >>>
> >>> Cheers,
> >>>
> >>> marc
> >>>
> >>
> >
>

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Marc Bechler <m....@computer.org>.
Hi,

oops, the URIEncoding was lost during the update to tomcat 6.0.14.
Thanks for the advice.

But now I am really curioused. After indexing the document from scratch,
I have the effect that queries to "this" and "is" work fine, whereas
queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
after extending my sometext to "This is really fünny kraßen.", queries
to "really" and "fünny" still do not work, but "kraßen" is found.
Now I am somehow confused -- hopefully anyone has a good explanation ;-)

Regards,

  marc

> Tom Hill schrieb:
>> If you are using tomcat, try adding "URIEncoding="UTF-8" to your
>> tomcat connector.
>> 
>> <Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150"
>> minSpareThreads="25" maxSpareThreads="75" enableLookups="false"
>> redirectPort="8443" acceptCount="100" connectionTimeout="20000"
>> disableUploadTimeout="true" URIEncoding="UTF-8" />
>> 
>> use the analysis page of the admin interface to check to see what's
>>  happening to your queries, too.
>> 
>> http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
>> port # may vary)
>> 
>> Tom
>> 
>> On 9/13/07, Marc Bechler <m....@computer.org> wrote:
>>> Hi SOLR kings,
>>> 
>>> I'm just playing around with queries, but I was not able to query
>>> for any special characters like the German "Umlaute" (i.e., ä, ö,
>>> ü). Maybe others might have the same effects and already found a
>>> solution ;-)
>>> 
>>> Here is my example: I have one field called "sometext" of type
>>> "text" (the one delivered with the SOLR example). I indexed a few
>>> words similar to
>>> 
>>> <field name="sometext"> <![CDATA[ This is really fünny 
>>> ]]></field>
>>> 
>>> Works fine, and searching for "really" shows the result and fünny
>>> will be displayed correctly. However, the query for "fünny" using
>>> the /solr/admin page is resolved (correctly) to the URL
>>> ...q=f%C3%BCnny... but does not find the document.
>>> 
>>> And now the question: Any ideas? ;-)
>>> 
>>> Cheers,
>>> 
>>> marc
>>> 
>> 
> 

Re: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Tom Hill <so...@zvents.com>.
If you are using tomcat, try adding "URIEncoding="UTF-8" to your tomcat
connector.

    <Connector port="8080" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" />

use the analysis page of the admin interface to check to see what's
happening to your queries, too.

http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your port # may
vary)

Tom

On 9/13/07, Marc Bechler <m....@computer.org> wrote:
>
> Hi SOLR kings,
>
> I'm just playing around with queries, but I was not able to query for
> any special characters like the German "Umlaute" (i.e., ä, ö, ü). Maybe
> others might have the same effects and already found a solution ;-)
>
> Here is my example: I have one field called "sometext" of type "text"
> (the one delivered with the SOLR example). I indexed a few words similar
> to
>
> <field name="sometext">
> <![CDATA[
> This is really fünny
> ]]></field>
>
> Works fine, and searching for "really" shows the result and fünny will
> be displayed correctly. However, the query for "fünny" using the
> /solr/admin page is resolved (correctly) to the URL ...q=f%C3%BCnny...
> but does not find the document.
>
> And now the question: Any ideas? ;-)
>
> Cheers,
>
>   marc
>

RE: Query for German "Special Characters" (i.e., ä, ö, ß)

Posted by Aaron Hammond <aa...@sirsidynix.com>.
Are you using Tomcat with Solr? If so you need to add the URIEncoding attribute to your Connector. See this url -

http://tomcat.apache.org/tomcat-6.0-doc/config/http.html

I hope this helps. If you are using Jetty then ..... :) 

Aaron

-----Original Message-----
From: Marc Bechler [mailto:m.bechler@computer.org] 
Sent: Thursday, September 13, 2007 3:14 PM
To: solr-user@lucene.apache.org
Subject: Query for German "Special Characters" (i.e., ä, ö, ß)

Hi SOLR kings,

I'm just playing around with queries, but I was not able to query for 
any special characters like the German "Umlaute" (i.e., ä, ö, ü). Maybe 
others might have the same effects and already found a solution ;-)

Here is my example: I have one field called "sometext" of type "text" 
(the one delivered with the SOLR example). I indexed a few words similar to

<field name="sometext">
<![CDATA[
This is really fünny
]]></field>

Works fine, and searching for "really" shows the result and fünny will 
be displayed correctly. However, the query for "fünny" using the 
/solr/admin page is resolved (correctly) to the URL ...q=f%C3%BCnny... 
but does not find the document.

And now the question: Any ideas? ;-)

Cheers,

  marc