You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Doris Peter <Do...@bsb-muenchen.de> on 2019/07/18 08:48:05 UTC

Problems with StemFilter and Wildcards

Hi, we have got some problems with the stemming of our ocr-texts:

We use the following configuration for our full-text-ocr field:

 <fieldtype name="text_ocr" class="solr.TextField" termPositions="true" termVectors="true" termPayloads="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.GermanStemFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑"
          encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder" />
        <filter class="solr.WordDelimiterGraphFilterFactory" protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
             types="wdfftypes.txt" />
      </analyzer>
    </fieldtype>


Now it seems, the StemFilter and wildcard queries don't work together.
When I search for 

Weltkriegs I get 6 documents.

But when I search for 

Weltkrie?s I get only 1 document.

For

wel?kriegs as well, only 1 document.


It happens only with terms which are changed by the stemming filter. Is there a way to fix this?


Thanks a lot, Doris