You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by mtdowling <mt...@gmail.com> on 2010/08/17 20:23:38 UTC

Solr synonyms format query time vs index time

My company recently started using Solr for site search and autocomplete. 
It's working great, but we're running into a problem with synonyms.  We are
generating a synonyms.txt file from a database table and using that
synonyms.txt file at index time on a text type field.  Here's an excerpt
from the synonyms file:

reebox => Reebok
shinguards => Shin Guards
shirt => T-Shirt,Shirt
shmak => Shmack
shocks => shox
skateboard => Skate
skateboarding => Skate
skater => Skate
skates => Skate
skating => Skate
skirt => Dresses

When we do a search for reebox, we want the term to be mapped to "Reebok"
through explicit mapping, but for some reason this isn't happening.  We do
have multi-word synonyms, and from what I've read on the mailing list, those
only work at index time, so we are only using the synonym filter factory at
index time:

<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

Here's more relevant schema.xml configs:

<field name="mashup" type="search" indexed="true" stored="false"
multiValued="true"/>
<copyField source="keywords" dest="mashup"/>
<copyField source="category" dest="mashup"/>
<copyField source="name" dest="mashup"/>
<copyField source="brand" dest="mashup"/>
<copyField source="description_overview" dest="mashup"/>
<copyField source="sku" dest="mashup"/>
<!-- other copy fields... -->

The output of the query analyzer shows the following:

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload

So "reebox" is never being converted to "Reebok".  I thought that if I had
index time synonyms with expansion configured that I wouldn't need query
time synonyms.  Maybe my dynamic synonyms generation isn't formatted
correctly for my desired result?

If I use the same synonyms.txt file and use the index analyzer, reebox is
mapped to Reebok and then indexed correctly:

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 	1
term text 	reebox
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 	1
term text 	Reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 	1
term text 	reebok
term type 	word
source start,end 	0,6
payload 	


Should I use equivalent mapping instead of explicit mapping if I'm only
using index-time synonyms?  Or should I turn query time synonyms on for my
search field?

Thanks,
Michael
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-synonyms-format-query-time-vs-index-time-tp1192743p1192743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr synonyms format query time vs index time

Posted by Lance Norskog <go...@gmail.com>.

solr/admin/analysis.jsp lets you see how this works. Use the index boxes.

Lance

On Tue, Aug 17, 2010 at 11:56 AM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Michael,
>
> I think the problem you're seeing is that no document contains "reebox", and you've used the "explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term).
>
> I'm guessing that if you convert your synonym file from:
>
>        reebox => Reebok
>
> to:
>
>        reebox, Reebok
>
> and leave expand=true, and then reindex, everything will work: your indexed documents containing "Reebok" will be made to include "reebox", so queries for "reebox" will produce hits on those documents.
>
> Steve
>
>> -----Original Message-----
>> From: mtdowling [mailto:mtdowling@gmail.com]
>> Sent: Tuesday, August 17, 2010 2:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Solr synonyms format query time vs index time
>>
>>
>> My company recently started using Solr for site search and autocomplete.
>> It's working great, but we're running into a problem with synonyms.  We
>> are
>> generating a synonyms.txt file from a database table and using that
>> synonyms.txt file at index time on a text type field.  Here's an excerpt
>> from the synonyms file:
>>
>> reebox => Reebok
>> shinguards => Shin Guards
>> shirt => T-Shirt,Shirt
>> shmak => Shmack
>> shocks => shox
>> skateboard => Skate
>> skateboarding => Skate
>> skater => Skate
>> skates => Skate
>> skating => Skate
>> skirt => Dresses
>>
>> When we do a search for reebox, we want the term to be mapped to "Reebok"
>> through explicit mapping, but for some reason this isn't happening.  We do
>> have multi-word synonyms, and from what I've read on the mailing list,
>> those
>> only work at index time, so we are only using the synonym filter factory
>> at
>> index time:
>>
>> <fieldType name="search" class="solr.TextField"
>> positionIncrementGap="100">
>>             <analyzer type="index">
>>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>                 <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>                 <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>                 <filter class="solr.LowerCaseFilterFactory"/>
>>                 <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>             </analyzer>
>>             <analyzer type="query">
>>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>                 <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>                 <filter class="solr.LowerCaseFilterFactory"/>
>>                 <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>             </analyzer>
>>         </fieldType>
>>
>> Here's more relevant schema.xml configs:
>>
>> <field name="mashup" type="search" indexed="true" stored="false"
>> multiValued="true"/>
>> <copyField source="keywords" dest="mashup"/>
>> <copyField source="category" dest="mashup"/>
>> <copyField source="name" dest="mashup"/>
>> <copyField source="brand" dest="mashup"/>
>> <copyField source="description_overview" dest="mashup"/>
>> <copyField source="sku" dest="mashup"/>
>> <!-- other copy fields... -->
>>
>> The output of the query analyzer shows the following:
>>
>> Query Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>>
>> So "reebox" is never being converted to "Reebok".  I thought that if I had
>> index time synonyms with expansion configured that I wouldn't need query
>> time synonyms.  Maybe my dynamic synonyms generation isn't formatted
>> correctly for my desired result?
>>
>> If I use the same synonyms.txt file and use the index analyzer, reebox is
>> mapped to Reebok and then indexed correctly:
>>
>> Index Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=true, ignoreCase=true}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>>
>>
>> Should I use equivalent mapping instead of explicit mapping if I'm only
>> using index-time synonyms?  Or should I turn query time synonyms on for my
>> search field?
>>
>> Thanks,
>> Michael
>



-- 
Lance Norskog
goksron@gmail.com

RE: Solr synonyms format query time vs index time

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Michael,

I think the problem you're seeing is that no document contains "reebox", and you've used the "explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term). 

I'm guessing that if you convert your synonym file from:

	reebox => Reebok

to:

	reebox, Reebok

and leave expand=true, and then reindex, everything will work: your indexed documents containing "Reebok" will be made to include "reebox", so queries for "reebox" will produce hits on those documents.

Steve

> -----Original Message-----
> From: mtdowling [mailto:mtdowling@gmail.com]
> Sent: Tuesday, August 17, 2010 2:24 PM
> To: solr-user@lucene.apache.org
> Subject: Solr synonyms format query time vs index time
> 
> 
> My company recently started using Solr for site search and autocomplete.
> It's working great, but we're running into a problem with synonyms.  We
> are
> generating a synonyms.txt file from a database table and using that
> synonyms.txt file at index time on a text type field.  Here's an excerpt
> from the synonyms file:
> 
> reebox => Reebok
> shinguards => Shin Guards
> shirt => T-Shirt,Shirt
> shmak => Shmack
> shocks => shox
> skateboard => Skate
> skateboarding => Skate
> skater => Skate
> skates => Skate
> skating => Skate
> skirt => Dresses
> 
> When we do a search for reebox, we want the term to be mapped to "Reebok"
> through explicit mapping, but for some reason this isn't happening.  We do
> have multi-word synonyms, and from what I've read on the mailing list,
> those
> only work at index time, so we are only using the synonym filter factory
> at
> index time:
> 
> <fieldType name="search" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>         </fieldType>
> 
> Here's more relevant schema.xml configs:
> 
> <field name="mashup" type="search" indexed="true" stored="false"
> multiValued="true"/>
> <copyField source="keywords" dest="mashup"/>
> <copyField source="category" dest="mashup"/>
> <copyField source="name" dest="mashup"/>
> <copyField source="brand" dest="mashup"/>
> <copyField source="description_overview" dest="mashup"/>
> <copyField source="sku" dest="mashup"/>
> <!-- other copy fields... -->
> 
> The output of the query analyzer shows the following:
> 
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> 
> So "reebox" is never being converted to "Reebok".  I thought that if I had
> index time synonyms with expansion configured that I wouldn't need query
> time synonyms.  Maybe my dynamic synonyms generation isn't formatted
> correctly for my desired result?
> 
> If I use the same synonyms.txt file and use the index analyzer, reebox is
> mapped to Reebok and then indexed correctly:
> 
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> 
> 
> Should I use equivalent mapping instead of explicit mapping if I'm only
> using index-time synonyms?  Or should I turn query time synonyms on for my
> search field?
> 
> Thanks,
> Michael