You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2008/12/11 16:27:44 UTC

[jira] Created: (LUCENE-1488) issues with standardanalyzer on multilingual text

issues with standardanalyzer on multilingual text
-------------------------------------------------

Key: LUCENE-1488
URL: https://issues.apache.org/jira/browse/LUCENE-1488
Project: Lucene - Java
Issue Type: Wish
Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor

The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties.

I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard.

in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose).

I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.

Thanks,
Robert

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Vilaythong Southavilay (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802568#action_12802568 ] 

Vilaythong Southavilay commented on LUCENE-1488:
------------------------------------------------

I am developing an IR system for Lao. I've been searching for this kind of analyzers to be used in my development to index documents containing languages like Lao, French and English in one single passage.

I tested it for Lao language for Lucene 2.9 and 3.0 using my short passage. It worked correctly for both versions as I expected, especially for segmenting Lao single syllables. I also tried it with the bi-gram filter option for two syllables, which worked fine for simple words. The result contained some two-syllable words which do not make sense in Lao language. I guess this not a big issue. As Robert pointed out (in an email to me), we still need dictionary-based word segmentation for Lao, which can be integrated in ICU and used by this analyzer.

Any way, thanks for your assistance. This work will be helpful not only for Lao, but others as well because it's good to have a common analyzer for unicode characters.

I'll continue testing it and report any problems if I find one. 

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703645#action_12703645 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

what version of icu4j are you using? needs to be >= 4.0

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716686#action_12716686 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

here's a simple description of what the current functionality buys you, its this:

all indic languages (Hindi, Bengali, Tamil, ...), middle eastern languages (Arabic, Hebrew, etc) will work pretty well here (by that I mean tokenized, normalized, etc). Most of these lucene cannot parse correctly with any of the built-in analyzers.

obviously european languages lucene handles quite well already, but unicode still has some improvements here, i.e. better case-folding.

And finally, of course, the situation where you have data in a bunch of these different languages!

in general, the unicode defaults work quite well for almost all languages, with the exception of CJK and southeast-asian languages. 
its not my intent to really solve those harder cases, only to provide a mechanism for someone else to deal with it if they don't like the defaults.

a great example is the arabic tokenizer, it should not exist. unicode defaults work great for that language. and it would be silly to think about HindiTokenizer, BengaliTokenizer, etc etc when unicode defaults will tokenize those correctly as well. 

there's still some annoying complexity here, and any comments are appreciated. Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released... Another thing is i didn't optimize the BMP case anywhere [i.e. working at 32-bit codepoint to ensure surrogate data works], and I think thats worth considering... like 99.9% of data is in the BMP :)

Thanks,
Robert

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802596#action_12802596 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Thanks for sharing those results! Yes the bigram behavior (right now enabled for Han, Lao, Khmer, and Myanmar) is an attempt to boost relevance in a consistent way since we do not have dictionary-based word segmentation for those writing systems, only the ability to segment into syllables.

In the next patch I'll make it easier to configure this behavior, and turn it off when you want, without writing your own analyzer.

I am glad to hear the syllable segmentation algorithm is working well! 
The credit really belongs to the Pan Localization Project, I simply implemented the algorithm described here: http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
You can see the code in Lao.rbbi in the patch, warning, as it mentions, I am pretty sure Lao numeric digits are not yet working correctly, but hopefully I will fix those too in the next version.


> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726571#action_12726571 ] 

Earwin Burrfoot commented on LUCENE-1488:
-----------------------------------------

bq. There is no morphological processing or any other language-specific functionality in this patch... 
I'm speaking of stemming in ArabicAnalyzer. Why can't you use its stemming tokenfilter over all ICU goodness from this patch? Everything else ArabicAnalyzer consists of might as well be deleted right after.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.txt

add analysis tests for a few languages to demonstrate what this does.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
    Fix Version/s: 3.1
         Assignee: Robert Muir
       Issue Type: New Feature  (was: Wish)
          Summary: multilingual analyzer based on icu  (was: issues with standardanalyzer on multilingual text)

setting a fix version, setting a correct description of the issue

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Vilaythong Southavilay (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803058#action_12803058 ] 

Vilaythong Southavilay commented on LUCENE-1488:
------------------------------------------------

I tested Lao numbers. It only worked for 2 digit numbers (because of two syllable segmentation),  but the result tokens were converted to Arabic numbers (instead of Lao). This is not too bad for analyzing heading numbers and ordered lists with less than 100 items (the meaning and order are preserved).

In the documents i encountered most of scientific numerals or financial figures (complex numeric digits) were written using Arabic numbers.

Nevertheless, recognizing long Lao numeric digits is a "must-to-have" to complete this set for Laos.



> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719322#action_12719322 ] 

Earwin Burrfoot commented on LUCENE-1488:
-----------------------------------------

bq. But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text.
What about separating word-tokenizing from morphological processing?

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.patch

this is latest copy of my code (in response to java-user discussion).

not many changes except tokenstream changes and work for writing systems with no word separation: lao, myanmar, cjk, etc.
for these, the tokenizer does not break text into words, but subwords (syllables), and bigrams of these are indexed.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655840#action_12655840 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

thats a good idea. you know, currently trying to get it to pass all the standard analyzer unit tests causes some problems since lucene has some rather obscure definitions of 'number' (i think ip addresses, etc are included) which differ dramatically from the basic unicode definition.

Other things of note:

instantiating the analyzer takes a long time (couple seconds) because ICU must "compile" the rules. I'm not sure of the specifics but by compile I think that means building massive FSM or similar based on all the unicode data. Its possible to precompile the rules into binary format but I think this is not currently exposed in ICU.

the lucene tokenization pipeline makes the implementation a little hairy. I hack around it by tokenizing on whitespace first, then acting as a token filter (just like the Thai analyzer does, which also uses RBBI). I don't think this really is that bad from a linguistic standpoint because the rare cases where 'token' can have whitespace inside of it (persian, etc) need serious muscle somewhere else and should be handled by a language analyzer.

i'll try to get this thing in reasonable shape at least to document the approach.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851714#action_12851714 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

bq. I have a possibly naive question on the bigram filter

Its not naive at all really! I think we to do exactly what you suggest.

change it slightly to behave just like CJKTokenizer (except of course, working with mixed language text, and supporting UCS-4)


> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.patch

updated patch, not ready yet but you can see where i am going.

ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text Segmentation. Text is divided across script boundaries so that this segmentation can be tailored for different writing systems; for example Thai text is segmented with a different method. The default and script-specific rules can be tailored. In the resources folder i have some examples for Southeast Asian scripts, etc.  Since i need script boundaries for tailoring, i stuff the ISO 15924 script code constant in the flags; this could be useful for downstream consumers.

ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; Full case folding. This may change the length of the token, for example german sharp s is folded to 'ss'. This filter interacts with the downstream normalization filter in a special way, so you can provide a hint as to what the desired normalization form will be. In the NFKC or NFKD case it will apply the NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold(x))))

ICUDigitFoldingFilter: Standardize digits from different scripts to the latin values, 0-9.

ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those from the Format category. 

ICUNormalizationFilter: Apply unicode normalization to text. This is accelerated with a quick-check.

ICUAnalyzer ties all this together. All of these components should also work correctly with surrogate-pair data. 

Needs more doc and tests. any comments appreciated.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759731#action_12759731 ] 

Uwe Schindler commented on LUCENE-1488:
---------------------------------------

Hi Robert: If you do a restoreState() no clearAttributes() is needed before, as the restoreState overwrites all attributes. Everything else looks good.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845917#action_12845917 ] 

Uwe Schindler commented on LUCENE-1488:
---------------------------------------

Attribute looks good! I would only fix toString() to match the defaulkt impl by using syntax variableName + "=" + value, here  "code="+getName(code). This makes AttrubuteSource.toString() look nice.

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.txt

just an update, still more work to be done.

some of the components are javadoc'ed and have pretty good tests (case folding and normalization). These might be useful to someone in the meantime.

also added some tests to TestICUAnalyzer for various jira issues (LUCENE-1032, LUCENE-1215, LUCENE-1343, LUCENE-1545, etc) that are solved here. 


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: ICUAnalyzer.patch

i've attached a patch for 'ICUAnalyzer'. I see that some things involving Token have changed but I created it before that point.

I stole the unit tests from standard analyzer and put comments as to why certain ones arent appropriate and disabled those.

i added some unit tests that demonstrate some of the value, correct analysis for arabic numerals, hindi text, decomposed latin diacritics, hebrew punctuation, cantonese and linear-b text outside of the BMP, etc.

one issue is that setMaxTokenLength() doesnt work correctly for values > 255 because CharTokenizer has a hardcoded private limit of 255 that i can't override. This is a problem since i use WhitespaceTokenizer first and then break down those tokens with the RBBI.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656040#action_12656040 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

as soon as I figure out how to invoke the ICU RBBI compiler i'll see if i can update the patch with compiled rules so instantiation of this thing is cheap...

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803061#action_12803061 ] 

Robert Muir edited comment on LUCENE-1488 at 1/20/10 11:46 PM:
---------------------------------------------------------------

Hi,this is not intentional to split them into 2 digits, it is really only because of rbbi rule-chaining turned off.
So now ໐໑໒໓ stays as a single token, and later becomes 0123.

I've written tests for, and fixed numerics in my local copy for lao, myanmar, and khmer. I will post an updated patch hopefully soon with all the improvements.

bq. but the result tokens were converted to Arabic numbers (instead of Lao).

Yes this is intentional, later there is a filter that converts all numeric digits to Arabic so the search will match either.

      was (Author: rcmuir):
    Hi Vilay, this is not intentional to split them into 2 digits, it is really only because of rbbi rule-chaining turned off.

I've written tests for, and fixed numerics in my local copy for lao, myanmar, and khmer. I will post an updated patch hopefully soon with all the improvements.

  
> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.patch

uploading a dump of my workspace, so Uwe can review the new attribute.

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759735#action_12759735 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Uwe, thanks for taking a look! I'll fix this.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719126#action_12719126 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Michael, I don't think it will be ready for 2.9, here is some answers to your questions.

going with your arabic example:
The only thing this absorbs is language-specific tokenization (like ArabicLetterTokenizer), because as mentioned I think thats generally the wrong approach.
But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text.

Some of what it does the language-specific analyzers don't do though.

In this specific example, it would be nice if ArabicAnalyzer really used the functionality here, then did its Arabic-specific stuff!
Because this functionality will do things like normalize 'Arabic Presentation Forms' and deal with Arabic digits and things that aren't in the ArabicAnalyzer. It also will treat any non-Arabic text in your corpus very nicely!

Yes, you are correct about the difference from StandardAnalyzer and I would argue there are tokenization bugs in how StandardAnalyzer works with European languages too, just see LUCENE-1545!

I know StandardAnalyzer does these things. This tokenizer has some built-in types already, such as number. If you want to add more types, its easy. Just make a .txt file with your grammar, create a RuleBasedBreakIterator with it, and pass it along to the tokenizer constructor. you will have to subclass the tokenizer's getType() for any new types though, because RBBI 'types' are really just integer codes in the rule file, and you have to map them to some text such as "WORD".

Yes, case-folding will work better than lowercase for a few european languages.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719342#action_12719342 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Earwin, I don't understand your question... 
There is no morphological processing or any other language-specific functionality in this patch...


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "uday kumar maddigatla (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703598#action_12703598 ] 

uday kumar maddigatla commented on LUCENE-1488:
-----------------------------------------------

hi,

i too just facing the same problem. my documet contains english as well as danish elements.

I tried to use this analyzer. when i try to use this i got this error .

Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74)
	at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48)
	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117)
	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
	at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)
	at com.IndexFiles.indexDocs(IndexFiles.java:87)
	at com.IndexFiles.indexDocs(IndexFiles.java:80)
	at com.IndexFiles.main(IndexFiles.java:57)
Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17
	at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505)
	at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047)
	at com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484)
	at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912)
	at com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298)
	at com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316)
	at com.ibm.icu.text.RuleBasedBreakIterator.<init>(RuleBasedBreakIterator.java:71)
	at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:53)
	at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:45)
	at org.apache.lucene.analysis.icu.ICUTokenizer.<clinit>(ICUTokenizer.java:58)
	... 12 more

please help me in this.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845951#action_12845951 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Thanks for the review Uwe! moving forwards...

> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803061#action_12803061 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Hi Vilay, this is not intentional to split them into 2 digits, it is really only because of rbbi rule-chaining turned off.

I've written tests for, and fixed numerics in my local copy for lao, myanmar, and khmer. I will post an updated patch hopefully soon with all the improvements.


> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726670#action_12726670 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

Earwin, you are absolutely correct.

though i would also want to keep the ArabicNormalizationFilter as it does "non-standard" normalization that is usually helpful for arabic text.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752331#action_12752331 ] 

Robert Muir edited comment on LUCENE-1488 at 9/7/09 10:09 PM:
--------------------------------------------------------------

this is latest copy of my code (in response to java-user discussion).

not many changes except tokenstream changes and work for writing systems with no word separation: lao, myanmar, cjk, etc.
for these, the tokenizer does not break text into words, but subwords (syllables), and unigrams&bigrams of these are indexed.


      was (Author: rcmuir):
    this is latest copy of my code (in response to java-user discussion).

not many changes except tokenstream changes and work for writing systems with no word separation: lao, myanmar, cjk, etc.
for these, the tokenizer does not break text into words, but subwords (syllables), and bigrams of these are indexed.

  
> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785054#action_12785054 ] 

Robert Muir commented on LUCENE-1488:
-------------------------------------

DM, I really appreciate your review. You have brought up some good ideas that I haven't yet thought about.

bq. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt

Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in general. its not easy to customize the way its supposed to be: where you as a user can actually supply BreakIterator impls to the tokenizer and say "use these rules/dictionary/whatever for tokenizing XYZ script only".

bq. I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος".

Yeah this is something to consider. I don't think it makes sense for the case folding filter, but maybe for the transform filter? will have to think about it.
There's use cases here like what you mentioned, also real-world ones like invoking Serbian-Latin or something, where you want users to search in either writing system and there actually is a clearly defined transformation.

I guess on the other hand, you could always use a separate field (with different analysis/transforms) for each and search both.

bq. The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is?

You could put them all in one list with regular stopfilter now. They won't clash since they are different unicode Strings. Obviously I would normalize this list with the same stuff (normalization form/case folding/whatever) that your analyzer users.

I don't put any stopwords in this, because thats language dependent, trying to stick with language-independent (either stuff that applies to unicode as a whole, or specific writing systems, which can be accurately detected).

bq. Can/How does all this integrate with stemmers?

Right this is just supposed to be what "StandardTokenizer"-type stuff does, and you would add stemming on top of it. The idea is you would use this even if you think you only have english text, maybe then applying your porter english stemmer. But if it happens to stumble upon some CJK or Thai or something along the way, everything will be ok.

In all honesty, I probably put 90% of the work into the Khmer, Myanmar, Lao, etc cases. Having good tokenization I think makes a usable search engine, for a lot of languages stemming is only a bonus.

However, one thing it also does is put the script value in the flags for each token. This can work pretty well: if its Greek script, its probably Greek language, but if its Hebrew script, well it could be Yiddish too.  If its Latin script, could be english, german, etc. Its ended only to make life easier since the information is already available... but I don't know yet how to make use of it in a nice way.

bq. Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)

Yeah i haven't updated it to java 5/Lucene 3.x yet, started working it, but kinda forgot about that so far. I guess this is a good thing, so you can play with it if you want.



> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785036#action_12785036 ] 

DM Smith commented on LUCENE-1488:
----------------------------------

Robert, just finished reviewing the code. Looks great! Doesn't look like there's too much left. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt;)

The documentation in ICUNormalizationFilter is very instructive. Kudos. The only part that's hard to for me to understand is the filter order dependency, but then again that's a hard topic in the first place.

I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. Specifically, transliterations and case-folding. That is, the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining a use case where a end user enters for a search request a Latin script transliteration of Greek "uios" but might also enter "υιος".

The other question on my mind is that given a text of German, Greek and Hebrew (three distinct scripts) does it make sense to apply stop words to them based on script? And should stop words be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as is?

Can/How does all this integrate with stemmers?

Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)



> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655808#action_12655808 ] 

Grant Ingersoll commented on LUCENE-1488:
-----------------------------------------

Very interesting, Robert.  I'd like to see your patch.  I don't think we need to think of it as a Std. Analyzer replacement, but I could totally see offering it as the the ICUAnalyzer or some other better name.  In other words, I'd approach this as another Analyzer in the arsenal of Analyzers, otherwise, we'll have to deal with back-compatibility issues, etc.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719109#action_12719109 ] 

Michael McCandless commented on LUCENE-1488:
--------------------------------------------

ICUAnalyzer looks very useful!  Good work Robert.  (And, thanks!).

Do you think this'll be ready to go in time for 2.9 (which we are
trying to wrap up soonish)?

It seems like this absorbs the functionality of many of Lucene's
current analyzers.  EG you mentioned ArabicAnalyzer already.  What
other analyzers (eg in contrib/analyzers/*) would you say are
logically subsumed by this?

Also, this seems quite different from StandardAnalyzer, in that it
focuses entirely on doing "good" tokenization, by relying on the
Unicode standard (defaults) instead of fixed char ranges in
StandardAnalyzer.  So it fixes many bugs in how StandardAnalyzer
tokenizes, especially on non-European languages.

Also, StandardAnalyzer goes beyond making the initial tokens: it also
tries to label things as acronym, host name, number, etc.; tries to
filter out stop words.

I assume ICUCaseFoldingFilter logically subsumes LowercaseFilter?

bq. Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could be a lot faster, but then it would have to be updated when a new unicode version is released.

I think it's fine to worry about this later.  Correctness is more
important than performance at this point.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1488) multilingual analyzer based on icu

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851713#action_12851713 ] 

David Bowen commented on LUCENE-1488:
-------------------------------------

I have a possibly naive question on the bigram filter: why would you want to index the individual one-character-tokens, as well as the bigrams?   The CJK Tokenizer just emits the bigrams.  Wouldn't indexing and searching on the unigrams as well as the bigrams just slow down search?



> multilingual analyzer based on icu
> ----------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.patch

here I complete Lao support (fully implementing http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf)

Also fix a tokenstream bug (not back-compat issue!) in the bigramfilter.

I think all language/unicode features are done, basically we can get better language support in the future from ICU automatically, but I think all languages are handled in a reasonable way for now. 

imho all that is left is:
* fix docs, improve tests, java api, rbbi grammars, any bugs, TODOs
* decide if we want to merge this with the collation contrib (I think it might be a good idea)
* test various versions of ICU to know which ones it works with

it works and the tests pass, but some tests are slow (10+ seconds, though I made them faster).
The problem is these slow tests have found bugs and will help test version compatibility, so I like them.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org