You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Luoni Cornelia <Co...@salt.ch> on 2023/03/16 14:34:52 UTC

phonetic search and accents

Hi,



I'm using Solr for a search in a name database and get the best results using the standard query parser with a phonetic search. The only downside of it is that the phonetic search - as the name says - looks for matches that sound similar. Therefore, if there is a typo in a letter with an accent that changes the pronunciation, there is no match.



Examples:

- Search with Muller doesn't find Müller

- Search with Francois doesn't find François



I'm using the Solr UI for my tests, setting q=phonetic_full_name:Francois for example. I have also tried to do a fuzzy search adding a tilde to the name (phonetic_full_name:Francois~), but that didn't change the result.



Is there a way to use Solr's phonetic search but somehow adding a mapping for a list of accented and non-accented letters which would consider them equally (ç<->c, ü<->u, è<->e, ñ<->n etc)?



Thanks for any tips.


Re: phonetic search and accents

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
There is also a "hammer" of ICUTransformFilterFactory.

For a fun demo, I did phonetic English search against Thai text:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55

Regards,
   Alex.

On Thu, 16 Mar 2023 at 10:51, Mikhail Khludnev <mk...@apache.org> wrote:
>
> Diacritics are handled via
> https://solr.apache.org/guide/solr/latest/indexing-guide/charfilterfactories.html#solr-mappingcharfilterfactory
> Literally phonetic match are handled well with
> https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#beider-morse-filter
> You may also check other
> https://solr.apache.org/guide/solr/latest/indexing-guide/phonetic-matching.html
> I remember that I had to combine bphm with soundex.
> Use SolrAdmin Analysis page for evaluating.
>
> On Thu, Mar 16, 2023 at 5:36 PM Luoni Cornelia <Co...@salt.ch>
> wrote:
>
> > Hi,
> >
> >
> >
> > I'm using Solr for a search in a name database and get the best results
> > using the standard query parser with a phonetic search. The only downside
> > of it is that the phonetic search - as the name says - looks for matches
> > that sound similar. Therefore, if there is a typo in a letter with an
> > accent that changes the pronunciation, there is no match.
> >
> >
> >
> > Examples:
> >
> > - Search with Muller doesn't find Müller
> >
> > - Search with Francois doesn't find François
> >
> >
> >
> > I'm using the Solr UI for my tests, setting q=phonetic_full_name:Francois
> > for example. I have also tried to do a fuzzy search adding a tilde to the
> > name (phonetic_full_name:Francois~), but that didn't change the result.
> >
> >
> >
> > Is there a way to use Solr's phonetic search but somehow adding a mapping
> > for a list of accented and non-accented letters which would consider them
> > equally (ç<->c, ü<->u, è<->e, ñ<->n etc)?
> >
> >
> >
> > Thanks for any tips.
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!

Re: phonetic search and accents

Posted by Mikhail Khludnev <mk...@apache.org>.
Diacritics are handled via
https://solr.apache.org/guide/solr/latest/indexing-guide/charfilterfactories.html#solr-mappingcharfilterfactory
Literally phonetic match are handled well with
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html#beider-morse-filter
You may also check other
https://solr.apache.org/guide/solr/latest/indexing-guide/phonetic-matching.html
I remember that I had to combine bphm with soundex.
Use SolrAdmin Analysis page for evaluating.

On Thu, Mar 16, 2023 at 5:36 PM Luoni Cornelia <Co...@salt.ch>
wrote:

> Hi,
>
>
>
> I'm using Solr for a search in a name database and get the best results
> using the standard query parser with a phonetic search. The only downside
> of it is that the phonetic search - as the name says - looks for matches
> that sound similar. Therefore, if there is a typo in a letter with an
> accent that changes the pronunciation, there is no match.
>
>
>
> Examples:
>
> - Search with Muller doesn't find Müller
>
> - Search with Francois doesn't find François
>
>
>
> I'm using the Solr UI for my tests, setting q=phonetic_full_name:Francois
> for example. I have also tried to do a fuzzy search adding a tilde to the
> name (phonetic_full_name:Francois~), but that didn't change the result.
>
>
>
> Is there a way to use Solr's phonetic search but somehow adding a mapping
> for a list of accented and non-accented letters which would consider them
> equally (ç<->c, ü<->u, è<->e, ñ<->n etc)?
>
>
>
> Thanks for any tips.
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: phonetic search and accents

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I think the common approach was multi-indexing with increasingly less
precice mapping and searching those alternative fields with different
weights (E. G. With expanding field name aliases to manage those weights).
Similar to issues for searching some Asian names where 1st name and 2nd
name may be entered in unexpected order.

But that will not combine well with any simple sort then.

Regards,
   Alex

On Thu., Mar. 16, 2023, 1:02 p.m. dmitri maziuk, <dm...@gmail.com>
wrote:

> On 2023-03-16 10:33 AM, Andy C wrote:
> > A perhaps simplistic option would be to map accented letters to their
> > unaccented versions using either the ASCII Folding Filter or the ICU
> > Folding Filter.
>
> Or the equivalent of
> '''
> unicodedata.normalize( "NFKD", v ).encode('ascii','ignore').decode()
> '''
> (v.2 python) when importing into the index.
>
> Unfortunately that's not going to help: we had people complain that
> "Muller" does not find "Mueller" -- which is/was a common English way to
> transcribe "Müller".
>
> It gets worse: e.g. with Slav "short i" and "open i" that "Zelenskyy"
> spells as "-yy". They are two different sounds, neither has a Latin
> letter for it, Russians would usually transcribe it as "-iy" because "i"
> isn't that "open" but Poles would more likely use a single "-i" because
> the "y" is "almost silent".
>
> If anyone knows of a usable implementation of name search in Solr, I
> would very much like to hear about it too because we do have lots of
> name records in our index and genealogy researchers are complaining.
>
> Dima
>
>

Re: phonetic search and accents

Posted by dmitri maziuk <dm...@gmail.com>.
On 2023-03-16 2:40 PM, Mikhail Khludnev wrote:
> Dima, I did a simple exercise with BMPM. It seems it handles these cases
> well.
> BMPM Rocks!!! – Telegraph <https://telegra.ph/BMPM-Rocks-03-16>

Thank you! Now I've something new to play with

D


Re: phonetic search and accents

Posted by Mikhail Khludnev <mk...@apache.org>.
Dima, I did a simple exercise with BMPM. It seems it handles these cases
well.
BMPM Rocks!!! – Telegraph <https://telegra.ph/BMPM-Rocks-03-16>

On Thu, Mar 16, 2023 at 8:02 PM dmitri maziuk <dm...@gmail.com>
wrote:

> On 2023-03-16 10:33 AM, Andy C wrote:
> > A perhaps simplistic option would be to map accented letters to their
> > unaccented versions using either the ASCII Folding Filter or the ICU
> > Folding Filter.
>
> Or the equivalent of
> '''
> unicodedata.normalize( "NFKD", v ).encode('ascii','ignore').decode()
> '''
> (v.2 python) when importing into the index.
>
> Unfortunately that's not going to help: we had people complain that
> "Muller" does not find "Mueller" -- which is/was a common English way to
> transcribe "Müller".
>
> It gets worse: e.g. with Slav "short i" and "open i" that "Zelenskyy"
> spells as "-yy". They are two different sounds, neither has a Latin
> letter for it, Russians would usually transcribe it as "-iy" because "i"
> isn't that "open" but Poles would more likely use a single "-i" because
> the "y" is "almost silent".
>
> If anyone knows of a usable implementation of name search in Solr, I
> would very much like to hear about it too because we do have lots of
> name records in our index and genealogy researchers are complaining.
>
> Dima
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: phonetic search and accents

Posted by dmitri maziuk <dm...@gmail.com>.
On 2023-03-16 10:33 AM, Andy C wrote:
> A perhaps simplistic option would be to map accented letters to their
> unaccented versions using either the ASCII Folding Filter or the ICU
> Folding Filter.

Or the equivalent of
'''
unicodedata.normalize( "NFKD", v ).encode('ascii','ignore').decode()
'''
(v.2 python) when importing into the index.

Unfortunately that's not going to help: we had people complain that 
"Muller" does not find "Mueller" -- which is/was a common English way to 
transcribe "Müller".

It gets worse: e.g. with Slav "short i" and "open i" that "Zelenskyy" 
spells as "-yy". They are two different sounds, neither has a Latin 
letter for it, Russians would usually transcribe it as "-iy" because "i" 
isn't that "open" but Poles would more likely use a single "-i" because 
the "y" is "almost silent".

If anyone knows of a usable implementation of name search in Solr, I 
would very much like to hear about it too because we do have lots of 
name records in our index and genealogy researchers are complaining.

Dima


Re: phonetic search and accents

Posted by Andy C <an...@gmail.com>.
A perhaps simplistic option would be to map accented letters to their
unaccented versions using either the ASCII Folding Filter or the ICU
Folding Filter.

- Andy -

On Thu, Mar 16, 2023 at 10:36 AM Luoni Cornelia <Co...@salt.ch>
wrote:

> Hi,
>
>
>
> I'm using Solr for a search in a name database and get the best results
> using the standard query parser with a phonetic search. The only downside
> of it is that the phonetic search - as the name says - looks for matches
> that sound similar. Therefore, if there is a typo in a letter with an
> accent that changes the pronunciation, there is no match.
>
>
>
> Examples:
>
> - Search with Muller doesn't find Müller
>
> - Search with Francois doesn't find François
>
>
>
> I'm using the Solr UI for my tests, setting q=phonetic_full_name:Francois
> for example. I have also tried to do a fuzzy search adding a tilde to the
> name (phonetic_full_name:Francois~), but that didn't change the result.
>
>
>
> Is there a way to use Solr's phonetic search but somehow adding a mapping
> for a list of accented and non-accented letters which would consider them
> equally (ç<->c, ü<->u, è<->e, ñ<->n etc)?
>
>
>
> Thanks for any tips.
>
>