You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andreas Hauser (JIRA)" <ji...@apache.org> on 2009/02/20 16:22:01 UTC

[jira] Created: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
-------------------------------------------------------------------------------------------------------

                 Key: LUCENE-1545
                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.4
         Environment: Linux x86_64, Sun Java 1.6
            Reporter: Andreas Hauser
             Fix For: 2.9


Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718825#action_12718825 ] 

Robert Muir commented on LUCENE-1545:
-------------------------------------

michael, I don't see a way from the manual to do it.

its not just the rules, but the JRE used to compile the rules (and its underlying unicode defs) so you might need separate standardtokenizerimpl's to really control the thing...

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Andreas Hauser (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Hauser updated LUCENE-1545:
-----------------------------------

    Attachment: AnalyzerTest.java

$ java -Dfile.encoding=UTF-8 -cp lib/lucene-core-2.4-20090219.021329-1.jar:. AnalyzerTest    
(mo,0,2,type=<ALPHANUM>)
(chte,3,7,type=<ALPHANUM>)
(m,8,9,type=<ALPHANUM>)
(mo,10,12,type=<ALPHANUM>)
(chte,13,17,type=<ALPHANUM>)
$locale
LANG=de_DE.UTF-8
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE=de_DE.UTF-8
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES=de_DE.UTF-8
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=


> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>             Fix For: 2.9
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718795#action_12718795 ] 

Michael McCandless commented on LUCENE-1545:
--------------------------------------------

bq. but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.

Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar?  Ie, with matchVersion (LUCENE-1684) we could preserve back-compat on this issue, but I'm not sure how to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into the grammar so that eg we're not force to make a new full copy of the grammar for each fix.  (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back compat...).

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1545:
---------------------------------------

    Fix Version/s:     (was: 3.0)
                   3.1

Mark, when we push, we should push to 3.1 not 3.0 (I just added a 3.1 version to Jira for Lucene)... because 3.0 will come quickly after 2.9 and will "only" remove deprecations, etc.

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718300#action_12718300 ] 

Robert Muir commented on LUCENE-1545:
-------------------------------------

if you are looking for a more short-term solution (since i think 1488 will take quite a bit more time), it would be possible to make StandardAnalyzer more 'unicode-friendly'.

its not possible to make it 'correct', and adding additional unicode friendliness would make backwards compat a much more complex issue (different unicode versions across JVM  versions, etc).

but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.


> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1545:
--------------------------------

         Priority: Minor  (was: Major)
    Fix Version/s:     (was: 2.9)
                   3.0

Feel free to switch back, but for now I'm going to mark this as part of LUCENE-1488, as offhand, that looks like the best solution for this issue. As that issue is not marked 2.9 at the moment, I'm pushing this off to 3.0.

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675381#action_12675381 ] 

Robert Muir commented on LUCENE-1545:
-------------------------------------

this is an example of why i started messing with LUCENE-1488

> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1545
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1545
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>         Environment: Linux x86_64, Sun Java 1.6
>            Reporter: Andreas Hauser
>             Fix For: 2.9
>
>         Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org