You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christian Zambrano <cz...@gmail.com> on 2009/09/11 07:53:26 UTC

What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

There are a lot of company names that people are uncertain as to the 
correct spelling. A few of examples are:
1. best buy, bestbuy
2. walmart, wal mart, wal-mart
3. Holiday Inn, HolidayInn

What Tokenizer Factory and/or TokenFilterFactory should I use so that 
somebody typing "wal mart"(quotes not included) will find "wal mart" and 
"walmart"(again, quotes not included)

Thanks,

Christian

Re: What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Sep 11, 2009 at 11:23 AM, Christian Zambrano <cz...@gmail.com>wrote:

> There are a lot of company names that people are uncertain as to the
> correct spelling. A few of examples are:
> 1. best buy, bestbuy
> 2. walmart, wal mart, wal-mart
> 3. Holiday Inn, HolidayInn
>
> What Tokenizer Factory and/or TokenFilterFactory should I use so that
> somebody typing "wal mart"(quotes not included) will find "wal mart" and
> "walmart"(again, quotes not included)
>
>
Look at Intra word delimiter section in the SolrRelevancyCookbook.
WordDelimiterFactory can help here.

http://wiki.apache.org/solr/SolrRelevancyCookbook#head-353fcfa33e5c4a0a5959aa3d8d33c5a3a61f2683

If you need to provide spelling suggestions, see the SpellCheckComponent:

http://wiki.apache.org/solr/SpellCheckComponent

-- 
Regards,
Shalin Shekhar Mangar.

Re: What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

Posted by Christian Zambrano <cz...@gmail.com>.
Ahmet,

Thanks a lot. Your suggestion was really helpful. I tried using synonyms 
before but for some reason it didn't work but this time around it worked.

On 09/11/2009 02:55 AM, AHMET ARSLAN wrote:
>> There are a lot of company names that
>> people are uncertain as to the correct spelling. A few of
>> examples are:
>> 1. best buy, bestbuy
>> 2. walmart, wal mart, wal-mart
>> 3. Holiday Inn, HolidayInn
>>
>> What Tokenizer Factory and/or TokenFilterFactory should I
>> use so that somebody typing "wal mart"(quotes not included)
>> will find "wal mart" and "walmart"(again, quotes not
>> included)
>>      
> I faced a similar requirement before. I solved it by hardcoding those names to synonyms_index.txt and using SynonymFilterFactory at index time.
>
> synonyms_index.txt will contain:
>
> best buy, bestbuy
> walmart, wal mart
> Holiday Inn, HolidayInn
>
> <analyzer type="index">
>    <tokenizer class="solr.StandardTokenizerFactory" />
>    <filter class="solr.LowerCaseFilterFactory" />
>    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true" />
>    </analyzer>
> <analyzer type="query">
>    <tokenizer class="solr.StandardTokenizerFactory" />
>    <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
>
> Since solr wiki[1] advices to use index time synonym when dealing with multi-word synonyms, I am using index time synonym expansion only.
>
> [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
>
> When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and mart. So you dont need to write - forms of the words in synonyms_index.txt
>
>
> If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory (without writing all these company named to a file) but you can't handle "wal mart" and "walmart" with it.
>
> Hope this helps.
>
>
>
>    

Re: What Tokenizerfactory/TokenFilterFactory can/should I use so a search for "wal mart" matches "walmart"(quotes not included in search or index)?

Posted by AHMET ARSLAN <io...@yahoo.com>.
> There are a lot of company names that
> people are uncertain as to the correct spelling. A few of
> examples are:
> 1. best buy, bestbuy
> 2. walmart, wal mart, wal-mart
> 3. Holiday Inn, HolidayInn
> 
> What Tokenizer Factory and/or TokenFilterFactory should I
> use so that somebody typing "wal mart"(quotes not included)
> will find "wal mart" and "walmart"(again, quotes not
> included)

I faced a similar requirement before. I solved it by hardcoding those names to synonyms_index.txt and using SynonymFilterFactory at index time.

synonyms_index.txt will contain:

best buy, bestbuy
walmart, wal mart
Holiday Inn, HolidayInn

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" /> 
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true" /> 
  </analyzer>
<analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" /> 
</analyzer>

Since solr wiki[1] advices to use index time synonym when dealing with multi-word synonyms, I am using index time synonym expansion only.

[1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and mart. So you dont need to write - forms of the words in synonyms_index.txt


If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory (without writing all these company named to a file) but you can't handle "wal mart" and "walmart" with it.

Hope this helps.