You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Peter Tyrrell <PT...@andornot.com> on 2021/06/10 18:57:05 UTC

Approaches to indexing indigenous languages?

I'm quite familiar with indexing English and French languages in Solr, but has anybody got any tips on indexing and querying (Canadian) indigenous First Nations languages? Depending on the language, terms may be written in a syllabic script (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) or in Americanist phonetic notation (https://en.wikipedia.org/wiki/Americanist_phonetic_notation).


Peter

Peter Tyrrell, MLIS
Lead Developer at Andornot
1-866-266-2525 x706 / ptyrrell@andornot.com

Re: Approaches to indexing indigenous languages?

Posted by Michael Gibney <mi...@michaelgibney.net>.

+1 to ICU, and I'd also be interested in follow-up. In case
transliteration might also be helpful for your case, I took a cursory
glance at the out-of-the-box transliteration ids
(https://github.com/unicode-org/icu/tree/main/icu4c/source/data/translit)
and I don't think there's anything for the scripts you're interested
in (but I also didn't really know what I was looking for, so you may
want to look yourself). If you _do_ find yourself in the position of
wanting transliteration for these scripts and not being able to find
an out-of-the-box impl, I'll also note that I _think_ it may be more
straightforward than one might initially assume to write, load,
register, and employ a custom transliterator rule file. I haven't
actually tried this yet, but the possibility occurred to me in the
course of working on LUCENE-8972 and I thought I'd share the idea.
Feel free to reach out if you decide to try to tackle the custom
transliteration; I have some preliminary ideas about how to proceed
with it.

Michael


On Fri, Jun 11, 2021 at 10:21 AM Alexandre Rafalovitch
<ar...@gmail.com> wrote:
>
> Hi Peter,
>
> This is a fascinating problem. I would not mind seeing a resolved
> solution fed back into the list.
>
> I think your best bet lies in exploring the icu4j library that ships
> with Solr, but needs to be enabled in solrconfig.xml. A little bit is
> explained at https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation
> and https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory
>
> After that, it is basically "the shoulders of the giants". If you are
> trying to trace the true support then ICU4J is the implementation of
> http://site.icu-project.org/ (International Components for Unicode)
> which implements Unicode, which seems to have support for the
> languages you discuss: https://www.unicode.org/charts/#scripts
> (Unified Canadian Aboriginal Syllabics). This seems to imply that word
> and sentence boundaries (which is what I assume you are after) are
> also in Unicode, therefore in ICU, therefore in ICU4j, therefore in
> Solr.
>
> And that brings us back to the valid magical invocation. The specific
> invocation would depend on the exact search issue you are trying to
> resolve and figuring out the language codes/names for your
> languages/locales.
>
> I did do a Thai language demo of phonetic search against Thai text.
> Very long time ago, so not a copy/paste, but still relevant. This is
> excerpt from my demo:
> https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55
>
>         <!--
>             During indexing:
>             1) tokenize Thai text with built-in rules+dictionary
>             2) map it to latin characters (with special accents indicating tones
>             3) get rid of tone marks, as nobody uses them
>             4) do some phonetic (BMF) broadening to match possible
> alternative spellings in English
>
>             During querying, we don't want this field type matching
> Thai text on query (BMFF is a little too aggressive for that). So, we
> are doing English-specific query chain
>         -->
>         <fieldType name="thai_english" class="solr.TextField">
>             <analyzer type="index">
>                 <tokenizer class="solr.ICUTokenizerFactory"/>
>                 <filter class="solr.ICUTransformFilterFactory"
> id="Thai-Latin" />
>                 <filter class="solr.ICUTransformFilterFactory"
> id="NFD; [:Nonspacing Mark:] Remove; NFC" />
>                 <filter class="solr.BeiderMorseFilterFactory" />
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.BeiderMorseFilterFactory" />
>             </analyzer>
>         </fieldType>
>
> Hope this helps,
>     Alex.
> P.s. If you progress but still get stuck, feel free to reach out
> directly as well. I am in Montreal, the questions resonated with me.
>
> On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <PT...@andornot.com> wrote:
> >
> > I'm quite familiar with indexing English and French languages in Solr, but has anybody got any tips on indexing and querying (Canadian) indigenous First Nations languages? Depending on the language, terms may be written in a syllabic script (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) or in Americanist phonetic notation (https://en.wikipedia.org/wiki/Americanist_phonetic_notation).
> >
> >
> > Peter
> >
> > Peter Tyrrell, MLIS
> > Lead Developer at Andornot
> > 1-866-266-2525 x706 / ptyrrell@andornot.com
> >

Re: Approaches to indexing indigenous languages?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Hi Peter,

This is a fascinating problem. I would not mind seeing a resolved
solution fed back into the list.

I think your best bet lies in exploring the icu4j library that ships
with Solr, but needs to be enabled in solrconfig.xml. A little bit is
explained at https://solr.apache.org/guide/8_8/language-analysis.html#unicode-collation
and https://solr.apache.org/guide/8_8/charfilterfactories.html#solr-icunormalizer2charfilterfactory

After that, it is basically "the shoulders of the giants". If you are
trying to trace the true support then ICU4J is the implementation of
http://site.icu-project.org/ (International Components for Unicode)
which implements Unicode, which seems to have support for the
languages you discuss: https://www.unicode.org/charts/#scripts
(Unified Canadian Aboriginal Syllabics). This seems to imply that word
and sentence boundaries (which is what I assume you are after) are
also in Unicode, therefore in ICU, therefore in ICU4j, therefore in
Solr.

And that brings us back to the valid magical invocation. The specific
invocation would depend on the exact search issue you are trying to
resolve and figuring out the language codes/names for your
languages/locales.

I did do a Thai language demo of phonetic search against Thai text.
Very long time ago, so not a copy/paste, but still relevant. This is
excerpt from my demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34-L55

        <!--
            During indexing:
            1) tokenize Thai text with built-in rules+dictionary
            2) map it to latin characters (with special accents indicating tones
            3) get rid of tone marks, as nobody uses them
            4) do some phonetic (BMF) broadening to match possible
alternative spellings in English

            During querying, we don't want this field type matching
Thai text on query (BMFF is a little too aggressive for that). So, we
are doing English-specific query chain
        -->
        <fieldType name="thai_english" class="solr.TextField">
            <analyzer type="index">
                <tokenizer class="solr.ICUTokenizerFactory"/>
                <filter class="solr.ICUTransformFilterFactory"
id="Thai-Latin" />
                <filter class="solr.ICUTransformFilterFactory"
id="NFD; [:Nonspacing Mark:] Remove; NFC" />
                <filter class="solr.BeiderMorseFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.BeiderMorseFilterFactory" />
            </analyzer>
        </fieldType>

Hope this helps,
    Alex.
P.s. If you progress but still get stuck, feel free to reach out
directly as well. I am in Montreal, the questions resonated with me.

On Thu, 10 Jun 2021 at 15:38, Peter Tyrrell <PT...@andornot.com> wrote:
>
> I'm quite familiar with indexing English and French languages in Solr, but has anybody got any tips on indexing and querying (Canadian) indigenous First Nations languages? Depending on the language, terms may be written in a syllabic script (https://en.wikipedia.org/wiki/Canadian_Aboriginal_syllabics) or in Americanist phonetic notation (https://en.wikipedia.org/wiki/Americanist_phonetic_notation).
>
>
> Peter
>
> Peter Tyrrell, MLIS
> Lead Developer at Andornot
> 1-866-266-2525 x706 / ptyrrell@andornot.com
>