You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeff Newburn <jn...@zappos.com> on 2008/11/19 19:51:58 UTC

Multi word Synonym

I am trying to figure out how the synonym filter processes multi word
inputs.  I have checked the analyzer in the GUI with some confusing results.
The indexed field has ³The North Face² as a value. The synonym file has

morthface, morth face, noethface, noeth face, norhtface, norht face,
nortface, nort face, northfac, north fac, northfac3e, north fac3e,
northface, north face, northfae, north fae, northfaqce, north faqce,
northfave, north fave, northhace, north hace, nothface, noth face,
thenorhface, the norh face, thenorth, the north, thenorthandface, the north
and face, thenortheface, the northe face, thenorthfac, the north fac,
thenorthface, thenorthfacee, the north facee, thenothface, the noth face,
thenotrhface, the notrh face, thenrothface, the nroth face, tnf => The North
Face

I have the field type using the WhiteSpaceTokenizer before the synonyms are
running.  My confusion on this is when the term ³morth fac² is run somehow
the system knows to map it to the correct term even though the term is not
present in the file.

How is this happening?  Is the synonym process tokenzing as well?

The datatype schema is as follows:
       <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
           <analyzer>
               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
               <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
               <filter class="solr.LowerCaseFilterFactory"/>
               <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
               <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

               <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
           </analyzer>
       </fieldType>


-Jeff

Re: Multi word Synonym

Posted by gurudev <su...@yahoo.com>.

Just use the query analysis link with appropriate values. It will show how
each filter factories and analyzers breaks the terms during various analysis
levels. Specially check EnglishPorterFilterFactory analysis




Jeff Newburn wrote:
> 
> I am trying to figure out how the synonym filter processes multi word
> inputs.  I have checked the analyzer in the GUI with some confusing
> results.
> The indexed field has ³The North Face² as a value. The synonym file has
> 
> morthface, morth face, noethface, noeth face, norhtface, norht face,
> nortface, nort face, northfac, north fac, northfac3e, north fac3e,
> northface, north face, northfae, north fae, northfaqce, north faqce,
> northfave, north fave, northhace, north hace, nothface, noth face,
> thenorhface, the norh face, thenorth, the north, thenorthandface, the
> north
> and face, thenortheface, the northe face, thenorthfac, the north fac,
> thenorthface, thenorthfacee, the north facee, thenothface, the noth face,
> thenotrhface, the notrh face, thenrothface, the nroth face, tnf => The
> North
> Face
> 
> I have the field type using the WhiteSpaceTokenizer before the synonyms
> are
> running.  My confusion on this is when the term ³morth fac² is run somehow
> the system knows to map it to the correct term even though the term is not
> present in the file.
> 
> How is this happening?  Is the synonym process tokenzing as well?
> 
> The datatype schema is as follows:
>        <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer>
>                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>                <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>            </analyzer>
>        </fieldType>
> 
> 
> -Jeff
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-word-Synonym-tp20586702p20602482.html
Sent from the Solr - User mailing list archive at Nabble.com.