You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mtdowling <mt...@gmail.com> on 2010/08/17 20:23:38 UTC
Solr synonyms format query time vs index time
My company recently started using Solr for site search and autocomplete.
It's working great, but we're running into a problem with synonyms. We are
generating a synonyms.txt file from a database table and using that
synonyms.txt file at index time on a text type field. Here's an excerpt
from the synonyms file:
reebox => Reebok
shinguards => Shin Guards
shirt => T-Shirt,Shirt
shmak => Shmack
shocks => shox
skateboard => Skate
skateboarding => Skate
skater => Skate
skates => Skate
skating => Skate
skirt => Dresses
When we do a search for reebox, we want the term to be mapped to "Reebok"
through explicit mapping, but for some reason this isn't happening. We do
have multi-word synonyms, and from what I've read on the mailing list, those
only work at index time, so we are only using the synonym filter factory at
index time:
<fieldType name="search" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Here's more relevant schema.xml configs:
<field name="mashup" type="search" indexed="true" stored="false"
multiValued="true"/>
<copyField source="keywords" dest="mashup"/>
<copyField source="category" dest="mashup"/>
<copyField source="name" dest="mashup"/>
<copyField source="brand" dest="mashup"/>
<copyField source="description_overview" dest="mashup"/>
<copyField source="sku" dest="mashup"/>
<!-- other copy fields... -->
The output of the query analyzer shows the following:
Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
So "reebox" is never being converted to "Reebok". I thought that if I had
index time synonyms with expansion configured that I wouldn't need query
time synonyms. Maybe my dynamic synonyms generation isn't formatted
correctly for my desired result?
If I use the same synonyms.txt file and use the index analyzer, reebox is
mapped to Reebok and then indexed correctly:
Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1
term text reebox
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.WordDelimiterFilterFactory {generateNumberParts=0,
catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
term position 1
term text Reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position 1
term text reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}
term position 1
term text reebok
term type word
source start,end 0,6
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position 1
term text reebok
term type word
source start,end 0,6
payload
Should I use equivalent mapping instead of explicit mapping if I'm only
using index-time synonyms? Or should I turn query time synonyms on for my
search field?
Thanks,
Michael
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-synonyms-format-query-time-vs-index-time-tp1192743p1192743.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr synonyms format query time vs index time
Posted by Lance Norskog <go...@gmail.com>.
solr/admin/analysis.jsp lets you see how this works. Use the index boxes.
Lance
On Tue, Aug 17, 2010 at 11:56 AM, Steven A Rowe <sa...@syr.edu> wrote:
> Hi Michael,
>
> I think the problem you're seeing is that no document contains "reebox", and you've used the "explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term).
>
> I'm guessing that if you convert your synonym file from:
>
> reebox => Reebok
>
> to:
>
> reebox, Reebok
>
> and leave expand=true, and then reindex, everything will work: your indexed documents containing "Reebok" will be made to include "reebox", so queries for "reebox" will produce hits on those documents.
>
> Steve
>
>> -----Original Message-----
>> From: mtdowling [mailto:mtdowling@gmail.com]
>> Sent: Tuesday, August 17, 2010 2:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Solr synonyms format query time vs index time
>>
>>
>> My company recently started using Solr for site search and autocomplete.
>> It's working great, but we're running into a problem with synonyms. We
>> are
>> generating a synonyms.txt file from a database table and using that
>> synonyms.txt file at index time on a text type field. Here's an excerpt
>> from the synonyms file:
>>
>> reebox => Reebok
>> shinguards => Shin Guards
>> shirt => T-Shirt,Shirt
>> shmak => Shmack
>> shocks => shox
>> skateboard => Skate
>> skateboarding => Skate
>> skater => Skate
>> skates => Skate
>> skating => Skate
>> skirt => Dresses
>>
>> When we do a search for reebox, we want the term to be mapped to "Reebok"
>> through explicit mapping, but for some reason this isn't happening. We do
>> have multi-word synonyms, and from what I've read on the mailing list,
>> those
>> only work at index time, so we are only using the synonym filter factory
>> at
>> index time:
>>
>> <fieldType name="search" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> Here's more relevant schema.xml configs:
>>
>> <field name="mashup" type="search" indexed="true" stored="false"
>> multiValued="true"/>
>> <copyField source="keywords" dest="mashup"/>
>> <copyField source="category" dest="mashup"/>
>> <copyField source="name" dest="mashup"/>
>> <copyField source="brand" dest="mashup"/>
>> <copyField source="description_overview" dest="mashup"/>
>> <copyField source="sku" dest="mashup"/>
>> <!-- other copy fields... -->
>>
>> The output of the query analyzer shows the following:
>>
>> Query Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>>
>> So "reebox" is never being converted to "Reebok". I thought that if I had
>> index time synonyms with expansion configured that I wouldn't need query
>> time synonyms. Maybe my dynamic synonyms generation isn't formatted
>> correctly for my desired result?
>>
>> If I use the same synonyms.txt file and use the index analyzer, reebox is
>> mapped to Reebok and then indexed correctly:
>>
>> Index Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position 1
>> term text reebox
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=true, ignoreCase=true}
>> term position 1
>> term text Reebok
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position 1
>> term text Reebok
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position 1
>> term text Reebok
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position 1
>> term text reebok
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position 1
>> term text reebok
>> term type word
>> source start,end 0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position 1
>> term text reebok
>> term type word
>> source start,end 0,6
>> payload
>>
>>
>> Should I use equivalent mapping instead of explicit mapping if I'm only
>> using index-time synonyms? Or should I turn query time synonyms on for my
>> search field?
>>
>> Thanks,
>> Michael
>
--
Lance Norskog
goksron@gmail.com
RE: Solr synonyms format query time vs index time
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Michael,
I think the problem you're seeing is that no document contains "reebox", and you've used the "explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term).
I'm guessing that if you convert your synonym file from:
reebox => Reebok
to:
reebox, Reebok
and leave expand=true, and then reindex, everything will work: your indexed documents containing "Reebok" will be made to include "reebox", so queries for "reebox" will produce hits on those documents.
Steve
> -----Original Message-----
> From: mtdowling [mailto:mtdowling@gmail.com]
> Sent: Tuesday, August 17, 2010 2:24 PM
> To: solr-user@lucene.apache.org
> Subject: Solr synonyms format query time vs index time
>
>
> My company recently started using Solr for site search and autocomplete.
> It's working great, but we're running into a problem with synonyms. We
> are
> generating a synonyms.txt file from a database table and using that
> synonyms.txt file at index time on a text type field. Here's an excerpt
> from the synonyms file:
>
> reebox => Reebok
> shinguards => Shin Guards
> shirt => T-Shirt,Shirt
> shmak => Shmack
> shocks => shox
> skateboard => Skate
> skateboarding => Skate
> skater => Skate
> skates => Skate
> skating => Skate
> skirt => Dresses
>
> When we do a search for reebox, we want the term to be mapped to "Reebok"
> through explicit mapping, but for some reason this isn't happening. We do
> have multi-word synonyms, and from what I've read on the mailing list,
> those
> only work at index time, so we are only using the synonym filter factory
> at
> index time:
>
> <fieldType name="search" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Here's more relevant schema.xml configs:
>
> <field name="mashup" type="search" indexed="true" stored="false"
> multiValued="true"/>
> <copyField source="keywords" dest="mashup"/>
> <copyField source="category" dest="mashup"/>
> <copyField source="name" dest="mashup"/>
> <copyField source="brand" dest="mashup"/>
> <copyField source="description_overview" dest="mashup"/>
> <copyField source="sku" dest="mashup"/>
> <!-- other copy fields... -->
>
> The output of the query analyzer shows the following:
>
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
>
> So "reebox" is never being converted to "Reebok". I thought that if I had
> index time synonyms with expansion configured that I wouldn't need query
> time synonyms. Maybe my dynamic synonyms generation isn't formatted
> correctly for my desired result?
>
> If I use the same synonyms.txt file and use the index analyzer, reebox is
> mapped to Reebok and then indexed correctly:
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 1
> term text reebox
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> term position 1
> term text Reebok
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 1
> term text Reebok
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 1
> term text Reebok
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 1
> term text reebok
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 1
> term text reebok
> term type word
> source start,end 0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 1
> term text reebok
> term type word
> source start,end 0,6
> payload
>
>
> Should I use equivalent mapping instead of explicit mapping if I'm only
> using index-time synonyms? Or should I turn query time synonyms on for my
> search field?
>
> Thanks,
> Michael