You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/08/03 12:48:28 UTC

[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078689#comment-13078689 ] 

Robert Muir commented on LUCENE-3358:
-------------------------------------

Remember, things in StandardTokenizer are only bugs if they differ from http://unicode.org/cldr/utility/breaks.jsp

But in the hiragana case, thats definitely a bug in the jflex grammar, because we shouldn't be splitting a base character from its combining mark here.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org