You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dirk Högemann <di...@googlemail.com> on 2012/02/06 11:44:22 UTC

Phonetic search and matching

Hi,

I have a question on phonetic search and matching in solr.
In our application all the content of an article is written to a full-text
search field, which provides stemming and a phonetic filter (cologne
phonetic for german).
This is the relevant part of the configuration for the index analyzer
(search is analogous):

        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German2"
/>
        <filter class="solr.PhoneticFilterFactory"
encoder="ColognePhonetic" inject="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />

Unfortunately this results sometimes in strange, but also explainable,
matches.
For example:

Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.

This results in a match, if we search for "puf"  as the result of the
phonetic filter for this is 13.
(As a consequence the 13 is then also highlighted)

Does anyone has an idea how to handle this in a reasonable way that a
search for "puf" does not match 13 in the content?

Thanks in advance!

Dirk

Re: Phonetic search and matching

Posted by Erick Erickson <er...@gmail.com>.
Yes, you could do that. I guess numbers will give you trouble
under all circumstances.

You may be able to do something like search against your non-
phonetic field with higher boosts to preferentially do those
matches.

Best
Erick

On Tue, Feb 7, 2012 at 2:30 PM, Dirk Högemann
<di...@googlemail.com> wrote:
> Thanks Erick.
> In the first place we thought of removing numbers with a pattern filter.
> Setting inject to false will have the "same" effect
> If we want to be able to search for numbers in the content this solution
> will not work,but another field without phonetic filtering and searching in
> both fields would be ok,right?
>
> Dirk
> Am 07.02.2012 14:01 schrieb "Erick Erickson" <er...@gmail.com>:
>
>> What happens if you do NOT inject? Setting  inject="false"
>> stores only the phonetic reduction, not the original text. In that
>> case your false match on "13" would go away....
>>
>> Not sure what that means for the rest of your app though.
>>
>> Best
>> Erick
>>
>> On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
>> <di...@googlemail.com> wrote:
>> > Hi,
>> >
>> > I have a question on phonetic search and matching in solr.
>> > In our application all the content of an article is written to a
>> full-text
>> > search field, which provides stemming and a phonetic filter (cologne
>> > phonetic for german).
>> > This is the relevant part of the configuration for the index analyzer
>> > (search is analogous):
>> >
>> >        <tokenizer class="solr.StandardTokenizerFactory"/>
>> >        <filter class="solr.WordDelimiterFilterFactory"
>> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>> >        <filter class="solr.LowerCaseFilterFactory"/>
>> >        <filter class="solr.SnowballPorterFilterFactory"
>> language="German2"
>> > />
>> >        <filter class="solr.PhoneticFilterFactory"
>> > encoder="ColognePhonetic" inject="true"/>
>> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>> >
>> > Unfortunately this results sometimes in strange, but also explainable,
>> > matches.
>> > For example:
>> >
>> > Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
>> >
>> > This results in a match, if we search for "puf"  as the result of the
>> > phonetic filter for this is 13.
>> > (As a consequence the 13 is then also highlighted)
>> >
>> > Does anyone has an idea how to handle this in a reasonable way that a
>> > search for "puf" does not match 13 in the content?
>> >
>> > Thanks in advance!
>> >
>> > Dirk
>>

Re: Phonetic search and matching

Posted by Dirk Högemann <di...@googlemail.com>.
Thanks Erick.
In the first place we thought of removing numbers with a pattern filter.
Setting inject to false will have the "same" effect
If we want to be able to search for numbers in the content this solution
will not work,but another field without phonetic filtering and searching in
both fields would be ok,right?

Dirk
Am 07.02.2012 14:01 schrieb "Erick Erickson" <er...@gmail.com>:

> What happens if you do NOT inject? Setting  inject="false"
> stores only the phonetic reduction, not the original text. In that
> case your false match on "13" would go away....
>
> Not sure what that means for the rest of your app though.
>
> Best
> Erick
>
> On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
> <di...@googlemail.com> wrote:
> > Hi,
> >
> > I have a question on phonetic search and matching in solr.
> > In our application all the content of an article is written to a
> full-text
> > search field, which provides stemming and a phonetic filter (cologne
> > phonetic for german).
> > This is the relevant part of the configuration for the index analyzer
> > (search is analogous):
> >
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="German2"
> > />
> >        <filter class="solr.PhoneticFilterFactory"
> > encoder="ColognePhonetic" inject="true"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
> >
> > Unfortunately this results sometimes in strange, but also explainable,
> > matches.
> > For example:
> >
> > Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
> >
> > This results in a match, if we search for "puf"  as the result of the
> > phonetic filter for this is 13.
> > (As a consequence the 13 is then also highlighted)
> >
> > Does anyone has an idea how to handle this in a reasonable way that a
> > search for "puf" does not match 13 in the content?
> >
> > Thanks in advance!
> >
> > Dirk
>

Re: Phonetic search and matching

Posted by Erick Erickson <er...@gmail.com>.
What happens if you do NOT inject? Setting  inject="false"
stores only the phonetic reduction, not the original text. In that
case your false match on "13" would go away....

Not sure what that means for the rest of your app though.

Best
Erick

On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
<di...@googlemail.com> wrote:
> Hi,
>
> I have a question on phonetic search and matching in solr.
> In our application all the content of an article is written to a full-text
> search field, which provides stemming and a phonetic filter (cologne
> phonetic for german).
> This is the relevant part of the configuration for the index analyzer
> (search is analogous):
>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="German2"
> />
>        <filter class="solr.PhoneticFilterFactory"
> encoder="ColognePhonetic" inject="true"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>
> Unfortunately this results sometimes in strange, but also explainable,
> matches.
> For example:
>
> Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
>
> This results in a match, if we search for "puf"  as the result of the
> phonetic filter for this is 13.
> (As a consequence the 13 is then also highlighted)
>
> Does anyone has an idea how to handle this in a reasonable way that a
> search for "puf" does not match 13 in the content?
>
> Thanks in advance!
>
> Dirk