You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Trejkaz (JIRA)" <ji...@apache.org> on 2011/08/03 07:09:27 UTC

[jira] [Created] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
--------------------------------------------------------------------------------------------------------------------

                 Key: LUCENE-3358
                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 3.3
            Reporter: Trejkaz


Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

Here's a unit test:

{code}
    @Test
    public void testHiraganaWithCombiningMarkDakuten() throws Exception
    {
        // Hiragana 'S' following by the combining mark dakuten
        TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));

        // Should be kept together.
        List<String> expectedTokens = Arrays.asList("\u3055\u3099");
        List<String> actualTokens = new LinkedList<String>();
        CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
        while (stream.incrementToken())
        {
            actualTokens.add(term.toString());
        }

        assertEquals("Wrong tokens", expectedTokens, actualTokens);

    }
{code}

This code fails with:
{noformat}
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
{noformat}

It seems as if the tokeniser is throwing away the combining mark entirely.

3.0's behaviour was also undesirable:
{noformat}
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
{noformat}

But at least the token was there, so it was possible to write a filter to work around the issue.

Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078689#comment-13078689 ] 

Robert Muir commented on LUCENE-3358:
-------------------------------------

Remember, things in StandardTokenizer are only bugs if they differ from http://unicode.org/cldr/utility/breaks.jsp

But in the hiragana case, thats definitely a bug in the jflex grammar, because we shouldn't be splitting a base character from its combining mark here.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078878#comment-13078878 ] 

Steven Rowe commented on LUCENE-3358:
-------------------------------------

+1 Robert's patch looks good.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>             Fix For: 3.4
>
>         Attachments: LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079578#comment-13079578 ] 

Steven Rowe commented on LUCENE-3358:
-------------------------------------

+1 to commit.  

I applied the patch, then ran 'ant jflex' and 'ant test' in {{modules/analysis/common/}}.  All succeeded.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>             Fix For: 3.4
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078726#comment-13078726 ] 

Robert Muir commented on LUCENE-3358:
-------------------------------------

The rules are wrong here for Han also.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3358:
--------------------------------

    Attachment: LUCENE-3358.patch

here's a patch: without re-generation or backwards compat yet.

we should fix the URL+Email one also, and add backwards for both.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>         Attachments: LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079747#comment-13079747 ] 

Robert Muir commented on LUCENE-3358:
-------------------------------------

{quote}
It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable.
{quote}

I'm not concerned about this, while your users may not like it, I think we should stick by the Standard for these reasons:
# its not desirable to deviate from the standard here, anyone can customize the behavior to do what they want.
# its not shown that what you say is true, experiments have been done here (see below) and I would say as a default, what is happening here is just fine.
# splitting this katakana up in some non-standard way leaves me with performance concerns of long postings lists for common terms.

{noformat}
For the Japanese collection (Table 4), it is not clear whether bigram generation should have
been done for both Kanji and Katakana characters (left part) or only for Kanji characters
(right part of Table 4). When using title-only queries, the Okapi model provided the best
mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 when
generating bigrams on both Kanji and Katakana. This difference is rather small, and is even
smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Based on these results
we cannot infer that for the Japanese language one indexing procedure is always significantly
better than another.
{noformat}

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6738

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3358:
--------------------------------

    Attachment: LUCENE-3358.patch

Here's a patch with sophisticated backwards.

I'd like to commit this and open a followup issue for the URL+Email one, that one is more complicated and needs to first be ported to Standard's interface.

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>             Fix For: 3.4
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3358:
--------------------------------

    Fix Version/s: 3.4

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>             Fix For: 3.4
>
>         Attachments: LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3358.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
         Assignee: Robert Muir

Thanks Trejkaz!

i opened LUCENE-3361 for the URL+email variant

> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3358) StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Posted by "Trejkaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079727#comment-13079727 ] 

Trejkaz commented on LUCENE-3358:
---------------------------------

Thanks for such a fast fix! :D  (I will still wait for 3.4 because it will make backwards-compat much simpler.)

I am aware of the Unicode word breaking rules and read the standard through, which is where I discovered that the non-breaking of Katakana was part of the standard (which is why I haven't filed it as a bug or improvement about that as well.)  It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable. When I brought the change up with Japanese users, they were 100% against that behaviour, so it's a wonder that the standard got past the Japanese without any objections (I am, of course, assuming that they actually consulted an expert in the language.)  But breaking it up as a separate filter isn't so hard.  It's only a single Unicode area with few combining marks, so the logic is not that difficult and StandardTokenizer even marks the token as katakana for us.


> StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3358
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3358
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Trejkaz
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3358.patch, LUCENE-3358.patch
>
>
> Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
> Here's a unit test:
> {code}
>     @Test
>     public void testHiraganaWithCombiningMarkDakuten() throws Exception
>     {
>         // Hiragana 'S' following by the combining mark dakuten
>         TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
>         // Should be kept together.
>         List<String> expectedTokens = Arrays.asList("\u3055\u3099");
>         List<String> actualTokens = new LinkedList<String>();
>         CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>         while (stream.incrementToken())
>         {
>             actualTokens.add(term.toString());
>         }
>         assertEquals("Wrong tokens", expectedTokens, actualTokens);
>     }
> {code}
> This code fails with:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
> {noformat}
> It seems as if the tokeniser is throwing away the combining mark entirely.
> 3.0's behaviour was also undesirable:
> {noformat}
> java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
> {noformat}
> But at least the token was there, so it was possible to write a filter to work around the issue.
> Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org