You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Doris Peter <Do...@bsb-muenchen.de> on 2019/07/18 09:01:16 UTC

Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Hi, 

another problem with the stemming:

Most of our texts are in German, so we use the GermanStemFilterFactory. But we also use MappingCharFilterFactory where we map for example ä->ae.

But of course we want the stemming to turn for example 'häuser' into 'haus', which the GermanStemFilterFactory should do, according to the documentation.

At the moment, my configuration looks like this:

    <fieldtype name="text_ocr" class="solr.TextField" termPositions="true" termVectors="true" termPayloads="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.GermanStemFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑"
          encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder" />
        <filter class="solr.WordDelimiterGraphFilterFactory" protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
             types="wdfftypes.txt" />
      </analyzer>
    </fieldtype>

So, Stemming before CharFilter.

But the Solr Analyzer says:

MCF 0 h a e u s e r

WT
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
	
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
LCF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
	
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
GSF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false
DPTF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false
WDGF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false

So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

Solr documentation also says, it should be put before the Tokenizer:
https://lucene.apache.org/solr/guide/7_6/charfilterfactories.html
"CharFilters can be chained like Token Filters and placed in front of a Tokenizer."

But if the word häuser is changed to haeuser, the stemmer doesn't stem the word anymore :-/

Is there a way to solve this problem?

Thanks a lot, Doris

AW: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Posted by Tobias Ibounig <t....@netconomy.net>.

Well the Stemmer only works generalized rules. That these rules sometimes (or surprisingly often) do not result in the action word stem. The question is how much does it matter in your case. When you search for something, the same transformations are applied. So searching for "München", "Muenchen" or "Munchen" you will always get the result.

If you want to do less Stemming you can use the Light or Minimal variants, or you can use a list of words which you do not want to stem by putting the KeywordMarkerFilter before the Stemming.

All the Best
Tobias
________________________________
Von: Doris Peter <Do...@bsb-muenchen.de>
Gesendet: Freitag, 19. Juli 2019 13:48:14
An: solr-user@lucene.apache.org <so...@lucene.apache.org>
Betreff: RE: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Yes, you are right, we should discuss this once more ....
But we have texts, which contain e.g. Muenchen. And we would like to retrieve these documents too, when searching for "München". We would loose them if we would map 'München' to 'Munchen'.
On the other hand, we get in trouble with the wildcard '?' when we map ü to ue :-(

Anyway, I tried it without any mapping and still the GermanStemFilterFactory doesn't work as expected, it turns 'häuser' into 'hau', not into 'haus' :-/

>>> Tobias Ibounig <t....@netconomy.net> 7/19/2019 11:54 AM >>>
Hi Doris,

Are you sure you want 'ä' --> 'ae'
If you check, the German stemmers usually substitute ä --> a (to "reduce over stemming" [1]), so you would be working against the stemmers logic here.

If you take a look at the GermanNormalizationFilter, it even substitutes 'ae' with 'a' [2].

Would recommend to use the default evaluable tools if you don't have a specific requirement against it.

All the Best
Tobias

[1] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java#L164

[2] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java#L31

-----Original Message-----
From: Doris Peter <Do...@bsb-muenchen.de>
Sent: Freitag, 19. Juli 2019 11:13
To: solr-user@lucene.apache.org
Subject: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to me, that it can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it to the following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? Exclusion is not enough ....

>>> Shawn Heisey <ap...@elyograg.org> 7/18/2019 3:08 PM >>>
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then Filters.  This will always be the case, even if you order the config so that the Tokenizer and one or more Filters are listed before CharFilter entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does what the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than MappingCharFilterFactory.  The ICU analysis components do require installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn

RE: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Posted by Doris Peter <Do...@bsb-muenchen.de>.

Yes, you are right, we should discuss this once more ....
But we have texts, which contain e.g. Muenchen. And we would like to retrieve these documents too, when searching for "München". We would loose them if we would map 'München' to 'Munchen'. 
On the other hand, we get in trouble with the wildcard '?' when we map ü to ue :-(

Anyway, I tried it without any mapping and still the GermanStemFilterFactory doesn't work as expected, it turns 'häuser' into 'hau', not into 'haus' :-/

>>> Tobias Ibounig <t....@netconomy.net> 7/19/2019 11:54 AM >>> 
Hi Doris,

Are you sure you want 'ä' --> 'ae'
If you check, the German stemmers usually substitute ä --> a (to "reduce over stemming" [1]), so you would be working against the stemmers logic here.

If you take a look at the GermanNormalizationFilter, it even substitutes 'ae' with 'a' [2].

Would recommend to use the default evaluable tools if you don't have a specific requirement against it.

All the Best
Tobias

[1] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java#L164

[2] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java#L31

-----Original Message-----
From: Doris Peter <Do...@bsb-muenchen.de> 
Sent: Freitag, 19. Juli 2019 11:13
To: solr-user@lucene.apache.org
Subject: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to me, that it can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it to the following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? Exclusion is not enough ....

>>> Shawn Heisey <ap...@elyograg.org> 7/18/2019 3:08 PM >>>
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then Filters.  This will always be the case, even if you order the config so that the Tokenizer and one or more Filters are listed before CharFilter entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does what the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than MappingCharFilterFactory.  The ICU analysis components do require installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn

RE: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Posted by Tobias Ibounig <t....@netconomy.net>.

Hi Doris,

Are you sure you want 'ä' --> 'ae'
If you check, the German stemmers usually substitute ä --> a (to "reduce over stemming" [1]), so you would be working against the stemmers logic here.

If you take a look at the GermanNormalizationFilter, it even substitutes 'ae' with 'a' [2].

Would recommend to use the default evaluable tools if you don't have a specific requirement against it.

All the Best
Tobias

[1] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java#L164

[2] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java#L31

-----Original Message-----
From: Doris Peter <Do...@bsb-muenchen.de> 
Sent: Freitag, 19. Juli 2019 11:13
To: solr-user@lucene.apache.org
Subject: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to me, that it can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it to the following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? Exclusion is not enough ....

>>> Shawn Heisey <ap...@elyograg.org> 7/18/2019 3:08 PM >>>
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then Filters.  This will always be the case, even if you order the config so that the Tokenizer and one or more Filters are listed before CharFilter entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does what the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than MappingCharFilterFactory.  The ICU analysis components do require installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn

Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Posted by Doris Peter <Do...@bsb-muenchen.de>.

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to me, that it can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it to the following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? Exclusion is not enough ....

>>> Shawn Heisey <ap...@elyograg.org> 7/18/2019 3:08 PM >>> 
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then 
Filters.  This will always be the case, even if you order the config so 
that the Tokenizer and one or more Filters are listed before CharFilter 
entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does 
what the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than 
MappingCharFilterFactory.  The ICU analysis components do require 
installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn

Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then 
Filters.  This will always be the case, even if you order the config so 
that the Tokenizer and one or more Filters are listed before CharFilter 
entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does 
what the CharFilter you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than 
MappingCharFilterFactory.  The ICU analysis components do require 
installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn