You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Persson <ma...@gmail.com> on 2012/04/19 10:25:23 UTC

Abbreviations with KeywordTokenizerFactory

Hi solr users.

I'm trying to create an index of geographic data to search with solr.

And I get a problem with searches with abbreviations.

At the moment I use an index filter with

      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory" />
      </analyzer>

This is because my searches at the moment are need to be full Keywords to
enable correct hits and ranking.

I have other tokenizers for other types of searches.

The problem I got now is with a streets with names like

East Saint James Street.

This could be abbreviated as

E St James St

Anyone got a suggestion what to try?

My guess was to use synonyms but that seems to work only with
WhitespaceTokenizer. I've thought about PatternReplaceCharFilter but that
will be a lot of rules to cover all abbreviations.

Best regards

Daniel

Re: Abbreviations with KeywordTokenizerFactory

Posted by Erick Erickson <er...@gmail.com>.
Yeah, this is a pretty ugly problem. You have two
problems, neither of which is all that amenable to
simple solutions.

1> context at index time. St, in your example, is
    either Saint or Street. Solr has nothing built
    in to it to distinguish this. so you need to do some
    processing "somewhere else" to get the proper
    substitutions.
2> Query time. Same issue, but you have virtually no
     context to figure this out...

But, it is NOT the case that "Synonyms only work
with Whitespace tokenizer". Synonyms will work
with any tokenizer, the problem is that the tokens
produced have to match when they get to the
SynonymFilter. Even KeywordTokenizer will
"work with synonyms", with the caveat that
you'd have to have single-word input....

The admin/analysis page will help you
see how all this fits together. For instance,
if you have the stemmer _before_ the
synonym filter, and your original input contains, say,
"story", by the time it gets to the synonym filter, the
word being matched will be something like "stori".

But even getting synonyms working with other
tokenizers won't help you with the context problem....

Best
Erick

On Thu, Apr 19, 2012 at 4:25 AM, Daniel Persson <ma...@gmail.com> wrote:
> Hi solr users.
>
> I'm trying to create an index of geographic data to search with solr.
>
> And I get a problem with searches with abbreviations.
>
> At the moment I use an index filter with
>
>      <analyzer type="index">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.ICUFoldingFilterFactory" />
>      </analyzer>
>
> This is because my searches at the moment are need to be full Keywords to
> enable correct hits and ranking.
>
> I have other tokenizers for other types of searches.
>
> The problem I got now is with a streets with names like
>
> East Saint James Street.
>
> This could be abbreviated as
>
> E St James St
>
> Anyone got a suggestion what to try?
>
> My guess was to use synonyms but that seems to work only with
> WhitespaceTokenizer. I've thought about PatternReplaceCharFilter but that
> will be a lot of rules to cover all abbreviations.
>
> Best regards
>
> Daniel