You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Will Milspec <wi...@gmail.com> on 2011/08/18 04:02:12 UTC

Synonym and Whitespaces and optional TokenizerFactory

Hi all,

This may be obvious. My question pertains to use of tokenizerFactory
together with SynonymFilterFactory. Which tokenizerFactory does one  use to
treat "synonyms with spaces" as one token,

Example these two entries are synonyms: "lms", "learning management system"

index time expansion would expand "lms" to these terms
           "lms"
           "learning management system"

i.e. not  like this:
           "lms"
           "learning"
           "management"
           "system"

Excerpt from the wiki article:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
<quote>
The optional *tokenizerFactory* parameter names a tokenizer factory class to
analyze synonyms (see https://issues.apache.org/jira/browse/SOLR-319), which
can help with the synonym+stemming problem described in
http://search-lucene.com/m/hg9ri2mDvGk1 .
</quote>

thanks,

will

RE: Synonym and Whitespaces and optional TokenizerFactory

Posted by "Jaeger, Jay - DOT" <Ja...@dot.wi.gov>.

You could presumably do it with solr.PatternTokenizerFactory with the pattern set to .* as your <tokenizer>

Or, maybe, if Solr allows it, you don't use any tokenizer at all?

Or, maybe you could use solr.WhitespaceTokenizerFactory, allowing it to split up the words, along with solr.WordDelimiterFilterFactory with catenateWords="1" to put them back together (with the other parameters set to 0).  My guess is that that will not work -- that once the tokenizer has split up the words, a filter doesn't see them all together after that.

You can use the "analyze" capability on the /solr/admin page to see what will happen under various test scenarios without having to actually load up a bunch of documents.

Then you could use solr.SynonymFilterFactory to do your synonym processing <filter>



-----Original Message-----
From: Will Milspec [mailto:will.milspec@gmail.com] 
Sent: Wednesday, August 17, 2011 9:02 PM
To: solr-user@lucene.apache.org
Subject: Synonym and Whitespaces and optional TokenizerFactory

Hi all,

This may be obvious. My question pertains to use of tokenizerFactory
together with SynonymFilterFactory. Which tokenizerFactory does one  use to
treat "synonyms with spaces" as one token,

Example these two entries are synonyms: "lms", "learning management system"

index time expansion would expand "lms" to these terms
           "lms"
           "learning management system"

i.e. not  like this:
           "lms"
           "learning"
           "management"
           "system"

Excerpt from the wiki article:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
<quote>
The optional *tokenizerFactory* parameter names a tokenizer factory class to
analyze synonyms (see https://issues.apache.org/jira/browse/SOLR-319), which
can help with the synonym+stemming problem described in
http://search-lucene.com/m/hg9ri2mDvGk1 .
</quote>

thanks,

will

Re: Synonym and Whitespaces and optional TokenizerFactory

Posted by Ravi Solr <ra...@gmail.com>.

If you have multi-word synonyms you could use -
tokenizerFactory="solr.KeywordTokenizerFactory" - in the
SynonymFilterFactory filter factory declaration. This assumes that
your tokenizer for that field allows for keeping the phrases as a
single token (achieved by using solr.KeywordTokenizerFactory instead
of Standard Tokenizer), if it is not then you might miss the synonym
setting altogether. See the configuration below


      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.SynonymFilterFactory"
tokenizerFactory="solr.KeywordTokenizerFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>


Then you can use synonyms like

Barack Obama,Barak Obama,Barack H. Obama,Barack Hussein Obama, Barak
Hussein Obama => Barack Obama

Ravi Kiran Bhaskar

On Thu, Aug 18, 2011 at 3:21 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> How about escaping white\ space?
>
> cheers
>
>> Hmmm, why doesn't the multi word synonym syntax in your
>> synonym.txt handle this case? Or am I missing something
>> totally?
>>
>> Best
>> Erick
>>
>> On Wed, Aug 17, 2011 at 10:02 PM, Will Milspec <wi...@gmail.com>
> wrote:
>> > Hi all,
>> >
>> > This may be obvious. My question pertains to use of tokenizerFactory
>> > together with SynonymFilterFactory. Which tokenizerFactory does one  use
>> > to treat "synonyms with spaces" as one token,
>> >
>> > Example these two entries are synonyms: "lms", "learning management
>> > system"
>> >
>> > index time expansion would expand "lms" to these terms
>> >           "lms"
>> >           "learning management system"
>> >
>> > i.e. not  like this:
>> >           "lms"
>> >           "learning"
>> >           "management"
>> >           "system"
>> >
>> > Excerpt from the wiki article:
>> >
>> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> > <quote>
>> > The optional *tokenizerFactory* parameter names a tokenizer factory class
>> > to analyze synonyms (see
>> > https://issues.apache.org/jira/browse/SOLR-319), which can help with the
>> > synonym+stemming problem described in
>> > http://search-lucene.com/m/hg9ri2mDvGk1 .
>> > </quote>
>> >
>> > thanks,
>> >
>> > will
>

Re: Synonym and Whitespaces and optional TokenizerFactory

Posted by Ravi Solr <ra...@gmail.com>.

If you have multi-word synonyms you could use -
tokenizerFactory="solr.KeywordTokenizerFactory" - in the
SynonymFilterFactory filter factory declaration. This assumes that
your tokenizer for that field allows for keeping the phrases as a
single token (achieved by using solr.KeywordTokenizerFactory instead
of Standard Tokenizer), if it is not then you might miss the synonym
setting altogether. See the configuration below


     <analyzer>
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.SynonymFilterFactory"
tokenizerFactory="solr.KeywordTokenizerFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false" />
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>


Then you can use synonyms like

Barack Obama,Barak Obama,Barack H. Obama,Barack Hussein Obama, Barak
Hussein Obama => Barack Obama

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post Digital
1150 15th Street NW, Washington, DC 20071


On Thu, Aug 18, 2011 at 3:21 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> How about escaping white\ space?
>
> cheers
>
>> Hmmm, why doesn't the multi word synonym syntax in your
>> synonym.txt handle this case? Or am I missing something
>> totally?
>>
>> Best
>> Erick
>>
>> On Wed, Aug 17, 2011 at 10:02 PM, Will Milspec <wi...@gmail.com>
> wrote:
>> > Hi all,
>> >
>> > This may be obvious. My question pertains to use of tokenizerFactory
>> > together with SynonymFilterFactory. Which tokenizerFactory does one  use
>> > to treat "synonyms with spaces" as one token,
>> >
>> > Example these two entries are synonyms: "lms", "learning management
>> > system"
>> >
>> > index time expansion would expand "lms" to these terms
>> >           "lms"
>> >           "learning management system"
>> >
>> > i.e. not  like this:
>> >           "lms"
>> >           "learning"
>> >           "management"
>> >           "system"
>> >
>> > Excerpt from the wiki article:
>> >
>> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> > <quote>
>> > The optional *tokenizerFactory* parameter names a tokenizer factory class
>> > to analyze synonyms (see
>> > https://issues.apache.org/jira/browse/SOLR-319), which can help with the
>> > synonym+stemming problem described in
>> > http://search-lucene.com/m/hg9ri2mDvGk1 .
>> > </quote>
>> >
>> > thanks,
>> >
>> > will
>

Re: Synonym and Whitespaces and optional TokenizerFactory

Posted by Markus Jelsma <ma...@openindex.io>.

How about escaping white\ space?

cheers 

> Hmmm, why doesn't the multi word synonym syntax in your
> synonym.txt handle this case? Or am I missing something
> totally?
> 
> Best
> Erick
> 
> On Wed, Aug 17, 2011 at 10:02 PM, Will Milspec <wi...@gmail.com> 
wrote:
> > Hi all,
> > 
> > This may be obvious. My question pertains to use of tokenizerFactory
> > together with SynonymFilterFactory. Which tokenizerFactory does one  use
> > to treat "synonyms with spaces" as one token,
> > 
> > Example these two entries are synonyms: "lms", "learning management
> > system"
> > 
> > index time expansion would expand "lms" to these terms
> >           "lms"
> >           "learning management system"
> > 
> > i.e. not  like this:
> >           "lms"
> >           "learning"
> >           "management"
> >           "system"
> > 
> > Excerpt from the wiki article:
> > 
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> > <quote>
> > The optional *tokenizerFactory* parameter names a tokenizer factory class
> > to analyze synonyms (see
> > https://issues.apache.org/jira/browse/SOLR-319), which can help with the
> > synonym+stemming problem described in
> > http://search-lucene.com/m/hg9ri2mDvGk1 .
> > </quote>
> > 
> > thanks,
> > 
> > will

Re: Synonym and Whitespaces and optional TokenizerFactory

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, why doesn't the multi word synonym syntax in your
synonym.txt handle this case? Or am I missing something
totally?

Best
Erick

On Wed, Aug 17, 2011 at 10:02 PM, Will Milspec <wi...@gmail.com> wrote:
> Hi all,
>
> This may be obvious. My question pertains to use of tokenizerFactory
> together with SynonymFilterFactory. Which tokenizerFactory does one  use to
> treat "synonyms with spaces" as one token,
>
> Example these two entries are synonyms: "lms", "learning management system"
>
> index time expansion would expand "lms" to these terms
>           "lms"
>           "learning management system"
>
> i.e. not  like this:
>           "lms"
>           "learning"
>           "management"
>           "system"
>
> Excerpt from the wiki article:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> <quote>
> The optional *tokenizerFactory* parameter names a tokenizer factory class to
> analyze synonyms (see https://issues.apache.org/jira/browse/SOLR-319), which
> can help with the synonym+stemming problem described in
> http://search-lucene.com/m/hg9ri2mDvGk1 .
> </quote>
>
> thanks,
>
> will
>