You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Brown <Ja...@sjp.co.uk> on 2010/11/26 11:15:24 UTC

Synonym Filtering on String Fields

I have the following field type set up in my schema. The idea is to fire phrases of text such as 'fund manager summary' (without the quotes) at it, and for the synonym processing to recognise this, and add the rest of the synonyms (index-time synonym processing with expansion) to the index from my synonym file (example below)

 <fieldType name="synonymstring" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
     </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>  


in synonyms.txt.....

fund manager summary, fund manager report
guide, product guide

I run into 2 issues...

(1) After analysis of the field in SOLR, I find that both 

fund manager summay
fund manage report

are NOT getting picked up by the synonym factory (after processing I just get the source term outputted from the synonym filter)

(2) If I analyse guide, I do get product and guide (*2) outputted from the synonym filter factory - but as  seperate terms (3 terms in total), I expected it to generate just 1 additional term - i.e. product guide

It seems that it is able to pick up a single word and output two (as seperate terms), but it fails to pick up multiple words.

Can anyone help? (incidentally when I use this approach on a SOLR text field type it all works fine, but I cant use a SOLR text field type for this as I use this field for facetting.



If you wish to view the St. James's Place email disclaimer, please use the link below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer

RE: Synonym Filtering on String Fields

Posted by Jason Brown <Ja...@sjp.co.uk>.
Thanks Erick - I do exactly want multiple terms generated from my string field i.e.

I want the single term fund manager summary to be turned into 2 terms > fund manager summary, fund manager report
I want the single term guide to be turned into the 2 terms -> guide, product guide

I am using term synonomoulsly with what will be in the index. (I appreciate the outputs of the synonym filter wont be stored per se, just added as terms to the index)

The problem I was having is that I am doing this on a a field as I described below and was having problems with the multi-word terms, the behaviour is

guide is getting turned into 3 terms guide, product, guide (3 terms, I only want 2, guide and product guide)
fund manager summary and fund manager report were not having any impact on the synonym filter, the output was the same as the input.

I need these as strings (I dont search on this field, its just for facetting), I have another text field which I do the search on.

I will give Ahmet's comments a go. Thanks All.



-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Fri 26/11/2010 14:16
To: solr-user@lucene.apache.org
Subject: Re: Synonym Filtering on String Fields
 
Besides Ahmet's comments, I have to wonder if you want to do this in a
single field?
The problem is that you're expanding your synonyms into a field. Let's say
you
expand "memory" into "memory", "recall" and "RAM". Now you have three
tokens in your field. What does faceting mean now? Perhaps you would be
better
off using the <copyField> directive to make a field for faceting and use
Solr text type for your searchable field? Of course this may be waaaay off
base....

About your point (1), you say synonyms aren't getting picked up. You might
be
getting fooled by seeing the stored value. Look in the admin page under
"schema
browser" to see the terms in the index, which would have the synonyms. Just
selecting the document via search will only show you the stored values which
would
NOT have the synonyms.

Best
Erick

On Fri, Nov 26, 2010 at 5:15 AM, Jason Brown <Ja...@sjp.co.uk> wrote:

>
> I have the following field type set up in my schema. The idea is to fire
> phrases of text such as 'fund manager summary' (without the quotes) at it,
> and for the synonym processing to recognise this, and add the rest of the
> synonyms (index-time synonym processing with expansion) to the index from my
> synonym file (example below)
>
>  <fieldType name="synonymstring" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>     </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
>
> in synonyms.txt.....
>
> fund manager summary, fund manager report
> guide, product guide
>
> I run into 2 issues...
>
> (1) After analysis of the field in SOLR, I find that both
>
> fund manager summay
> fund manage report
>
> are NOT getting picked up by the synonym factory (after processing I just
> get the source term outputted from the synonym filter)
>
> (2) If I analyse guide, I do get product and guide (*2) outputted from the
> synonym filter factory - but as  seperate terms (3 terms in total), I
> expected it to generate just 1 additional term - i.e. product guide
>
> It seems that it is able to pick up a single word and output two (as
> seperate terms), but it fails to pick up multiple words.
>
> Can anyone help? (incidentally when I use this approach on a SOLR text
> field type it all works fine, but I cant use a SOLR text field type for this
> as I use this field for facetting.
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>


If you wish to view the St. James's Place email disclaimer, please use the link below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer

Re: Synonym Filtering on String Fields

Posted by Erick Erickson <er...@gmail.com>.
Besides Ahmet's comments, I have to wonder if you want to do this in a
single field?
The problem is that you're expanding your synonyms into a field. Let's say
you
expand "memory" into "memory", "recall" and "RAM". Now you have three
tokens in your field. What does faceting mean now? Perhaps you would be
better
off using the <copyField> directive to make a field for faceting and use
Solr text type for your searchable field? Of course this may be waaaay off
base....

About your point (1), you say synonyms aren't getting picked up. You might
be
getting fooled by seeing the stored value. Look in the admin page under
"schema
browser" to see the terms in the index, which would have the synonyms. Just
selecting the document via search will only show you the stored values which
would
NOT have the synonyms.

Best
Erick

On Fri, Nov 26, 2010 at 5:15 AM, Jason Brown <Ja...@sjp.co.uk> wrote:

>
> I have the following field type set up in my schema. The idea is to fire
> phrases of text such as 'fund manager summary' (without the quotes) at it,
> and for the synonym processing to recognise this, and add the rest of the
> synonyms (index-time synonym processing with expansion) to the index from my
> synonym file (example below)
>
>  <fieldType name="synonymstring" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>     </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
>
> in synonyms.txt.....
>
> fund manager summary, fund manager report
> guide, product guide
>
> I run into 2 issues...
>
> (1) After analysis of the field in SOLR, I find that both
>
> fund manager summay
> fund manage report
>
> are NOT getting picked up by the synonym factory (after processing I just
> get the source term outputted from the synonym filter)
>
> (2) If I analyse guide, I do get product and guide (*2) outputted from the
> synonym filter factory - but as  seperate terms (3 terms in total), I
> expected it to generate just 1 additional term - i.e. product guide
>
> It seems that it is able to pick up a single word and output two (as
> seperate terms), but it fails to pick up multiple words.
>
> Can anyone help? (incidentally when I use this approach on a SOLR text
> field type it all works fine, but I cant use a SOLR text field type for this
> as I use this field for facetting.
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the
> link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>

Re: Synonym Filtering on String Fields

Posted by Ahmet Arslan <io...@yahoo.com>.
Two things can be done. 1 or 2.

1-) You can use tokenizerFactory attribute of synonym filter factory.

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" tokenizerFactory="KeywordTokenizerFactory"/>


2-) You can use escape white spaces in synonyms.txt

fund\ manager\ summary, fund\ manager\ report



--- On Fri, 11/26/10, Jason Brown <Ja...@sjp.co.uk> wrote:

> From: Jason Brown <Ja...@sjp.co.uk>
> Subject: Synonym Filtering on String Fields
> To: solr-user@lucene.apache.org
> Date: Friday, November 26, 2010, 12:15 PM
> 
> I have the following field type set up in my schema. The
> idea is to fire phrases of text such as 'fund manager
> summary' (without the quotes) at it, and for the synonym
> processing to recognise this, and add the rest of the
> synonyms (index-time synonym processing with expansion) to
> the index from my synonym file (example below)
> 
>  <fieldType name="synonymstring" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>         <filter
> class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>      </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="solr.KeywordTokenizerFactory"/>
>       </analyzer>
>     </fieldType>  
> 
> 
> in synonyms.txt.....
> 
> fund manager summary, fund manager report
> guide, product guide
> 
> I run into 2 issues...
> 
> (1) After analysis of the field in SOLR, I find that both 
> 
> fund manager summay
> fund manage report
> 
> are NOT getting picked up by the synonym factory (after
> processing I just get the source term outputted from the
> synonym filter)
> 
> (2) If I analyse guide, I do get product and guide (*2)
> outputted from the synonym filter factory - but as 
> seperate terms (3 terms in total), I expected it to generate
> just 1 additional term - i.e. product guide
> 
> It seems that it is able to pick up a single word and
> output two (as seperate terms), but it fails to pick up
> multiple words.
> 
> Can anyone help? (incidentally when I use this approach on
> a SOLR text field type it all works fine, but I cant use a
> SOLR text field type for this as I use this field for
> facetting.
> 
> 
> 
> If you wish to view the St. James's Place email disclaimer,
> please use the link below
> 
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>