You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2009/04/17 22:01:11 UTC

WordDelimiterFilterFactory removes words when options set to 0

In trying to understand the various options for WordDelimiterFilterFactory, I tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular "can't" and "99dxl" don't get output, nor do any wods containing hypens. Is this correct behavior?


Here is what the Solr Analyzer output

org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1	2	3	4	5	6	7	8	9
term text 	ca-55	99_3_a9	55-67	powerShot	ca999x15	foo-bar	can't	joe's	99dxl

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0}

term position 	1	5
term text 	powerShot	joe
term type 	word	word
source start,end 	20,29	53,56

Here is the schema
<fieldtype name="mbooksOcrXPatLike" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange="0"
                generateWordParts="0"
                generateNumberParts="0"
		catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

Tom

Re: WordDelimiterFilterFactory removes words when options set to 0

Posted by Chris Hostetter <ho...@fucit.org>.

: In trying to understand the various options for 
: WordDelimiterFilterFactory, I tried setting all options to 0. This seems 
: to prevent a number of words from being output at all. In particular 
: "can't" and "99dxl" don't get output, nor do any wods containing hypens. 
: Is this correct behavior?

For the record: there are other options you haven't set... splitOnNumerics 
defaults to "1"; preserveOriginal defaults to "0" ... i'm guessing if you 
set splitOnNumerics="0" you'd see a lot more tokens come through, and if 
you set preserveOriginal="1" you'd definitely see a lot more tokens come 
through my default.

: <fieldtype name="mbooksOcrXPatLike" class="solr.TextField">
:       <analyzer>
:           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
:           <filter class="solr.WordDelimiterFilterFactory"
:                 splitOnCaseChange="0"
:                 generateWordParts="0"
:                 generateNumberParts="0"
: 		catenateWords="0"
:                 catenateNumbers="0"
:                 catenateAll="0"
:                 />
:           <filter class="solr.LowerCaseFilterFactory"/>
:       </analyzer>
:     </fieldtype>

-Hoss