You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2018/10/26 10:53:00 UTC

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

    [ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665014#comment-16665014 ] 

Robert Muir commented on LUCENE-8548:
-------------------------------------

As far as the suggested fix, why reinvent the wheel? In unicode each character gets assigned a script integer value. But there are special values such as "Common" and "Inherited", etc.

See [https://unicode.org/reports/tr24/] or icutokenizer code [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java#L141]

 

> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
>                 Key: LUCENE-8548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite line up with Unicode character blocks), which leads to weird results for non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many more chances for false positives when the tokens are split up like this. In particular, individual numbers and combining diacritics are indexed separately (e.g., in the Cyrillic example above), which can lead to a performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before Nori processes the text. This doesn't solve the Greek, Hebrew, or English cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. Combining diacritics should not trigger token splits. Non-CJK text should be tokenized on spaces and punctuation, not by character type shifts. Apostrophe-like characters should not trigger token splits (though I could see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Posted by Michael Sokolov <ms...@gmail.com>.

I agree w/Robert let's not reinvent solutions that are solved elsewhere. In
an ideal world, wouldn't you want to be able to delegate tokenization of
latin script portions to StandardTokenizer? I know that's not possible
today, and I wouldn't derail the work here to try to make it happen since
it would be a big shift, but personally I'd like to see some more
discussion about composing Tokenizers

On Fri, Oct 26, 2018 at 3:53 AM Robert Muir (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665014#comment-16665014
> ]
>
> Robert Muir commented on LUCENE-8548:
> -------------------------------------
>
> As far as the suggested fix, why reinvent the wheel? In unicode each
> character gets assigned a script integer value. But there are special
> values such as "Common" and "Inherited", etc.
>
> See [https://unicode.org/reports/tr24/] or icutokenizer code [
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java#L141
> ]
>
>
>
> > Reevaluate scripts boundary break in Nori's tokenizer
> > -----------------------------------------------------
> >
> >                 Key: LUCENE-8548
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
> >             Project: Lucene - Core
> >          Issue Type: Improvement
> >            Reporter: Jim Ferenczi
> >            Priority: Minor
> >
> > This was first reported in
> https://issues.apache.org/jira/browse/LUCENE-8526:
> > {noformat}
> > Tokens are split on different character POS types (which seem to not
> quite line up with Unicode character blocks), which leads to weird results
> for non-CJK tokens:
> > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other
> symbol) + μί/SL(Foreign language)
> > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign
> language) + ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other
> symbol) + k/SL(Foreign language) + ̚/SY(Other symbol)
> > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol)
> + лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> > don't is tokenized as don + t; same for don't (with a curly apostrophe).
> > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м +
> oscow
> > While it is still possible to find these words using Nori, there are
> many more chances for false positives when the tokens are split up like
> this. In particular, individual numbers and combining diacritics are
> indexed separately (e.g., in the Cyrillic example above), which can lead to
> a performance hit on large corpora like Wiktionary or Wikipedia.
> > Work around: use a character filter to get rid of combining diacritics
> before Nori processes the text. This doesn't solve the Greek, Hebrew, or
> English cases, though.
> > Suggested fix: Characters in related Unicode blocks—like "Greek" and
> "Greek Extended", or "Latin" and "IPA Extensions"—should not trigger token
> splits. Combining diacritics should not trigger token splits. Non-CJK text
> should be tokenized on spaces and punctuation, not by character type
> shifts. Apostrophe-like characters should not trigger token splits (though
> I could see someone disagreeing on this one).{noformat}
> >
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>