You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "G.S.J. Lobbestael" <g....@amc.uva.nl> on 2009/09/26 14:03:56 UTC

Punctuation marks in documents prevent recognition of synonyms at indexing?

Hi,

The wiki uses the example:

    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory synonyms="syn.txt" ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldtype>

With "dog, canine" in syn.txt and a document with "I have a dog, Bob.", "dog" is not seen as a synonym. With a document "I have a dog Bob" it is.

We could replace the WhitespaceTokenizerFactory with a PatternTokenizerFactory (in this case with a pattern="\s,"), but this may cause trouble further down the line, e.g. with the WordDelimiterFilterFactory if "-" is part of the pattern (suppose whe have a document with "MRI-scan" and a synonym for "MRI").

Or we could try to change the order of the filters (SynonymFilterFactory, StopFilterFactory, WordDelimiterFilterFactory, LowerCaseFilterFactory, SnowballPorterFilterFactory, RemoveDuplicatesTokenFilterFactory). The analysis tool shows that the comma is only removed at the WorldDelimiterFilter stage.

What's the best course?

Geert Lobbestael


Re: Punctuation marks in documents prevent recognition of synonyms at indexing?

Posted by AHMET ARSLAN <io...@yahoo.com>.
> Hi,
> 
> The wiki uses the example:
> 
>     <fieldtype name="syn"
> class="solr.TextField">
>       <analyzer>
>           <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>           <filter
> class="solr.SynonymFilterFactory synonyms="syn.txt"
> ignoreCase="true" expand="false"/>
>       </analyzer>
>     </fieldtype>
> 
> With "dog, canine" in syn.txt and a document with "I have a
> dog, Bob.", "dog" is not seen as a synonym. With a document
> "I have a dog Bob" it is.

Why not use StandardTokenizerFactory which removes punctuations?