You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2009/04/17 22:01:11 UTC
WordDelimiterFilterFactory removes words when options set to 0
In trying to understand the various options for WordDelimiterFilterFactory, I tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular "can't" and "99dxl" don't get output, nor do any wods containing hypens. Is this correct behavior?
Here is what the Solr Analyzer output
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1 2 3 4 5 6 7 8 9
term text ca-55 99_3_a9 55-67 powerShot ca999x15 foo-bar can't joe's 99dxl
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0}
term position 1 5
term text powerShot joe
term type word word
source start,end 20,29 53,56
Here is the schema
<fieldtype name="mbooksOcrXPatLike" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="0"
generateWordParts="0"
generateNumberParts="0"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Tom
Re: WordDelimiterFilterFactory removes words when options set to 0
Posted by Chris Hostetter <ho...@fucit.org>.
: In trying to understand the various options for
: WordDelimiterFilterFactory, I tried setting all options to 0. This seems
: to prevent a number of words from being output at all. In particular
: "can't" and "99dxl" don't get output, nor do any wods containing hypens.
: Is this correct behavior?
For the record: there are other options you haven't set... splitOnNumerics
defaults to "1"; preserveOriginal defaults to "0" ... i'm guessing if you
set splitOnNumerics="0" you'd see a lot more tokens come through, and if
you set preserveOriginal="1" you'd definitely see a lot more tokens come
through my default.
: <fieldtype name="mbooksOcrXPatLike" class="solr.TextField">
: <analyzer>
: <tokenizer class="solr.WhitespaceTokenizerFactory"/>
: <filter class="solr.WordDelimiterFilterFactory"
: splitOnCaseChange="0"
: generateWordParts="0"
: generateNumberParts="0"
: catenateWords="0"
: catenateNumbers="0"
: catenateAll="0"
: />
: <filter class="solr.LowerCaseFilterFactory"/>
: </analyzer>
: </fieldtype>
-Hoss