You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrian Nistor (JIRA)" <ji...@apache.org> on 2013/06/24 16:16:20 UTC

[jira] [Created] (LUCENE-5076) Incorrect behavior for TestLaoBreakIterator.isWord()

Adrian Nistor created LUCENE-5076:
-------------------------------------

             Summary: Incorrect behavior for TestLaoBreakIterator.isWord()
                 Key: LUCENE-5076
                 URL: https://issues.apache.org/jira/browse/LUCENE-5076
             Project: Lucene - Core
          Issue Type: Bug
    Affects Versions: 4.3.1
         Environment: any
            Reporter: Adrian Nistor


The incorrect behavior appears in version 4.3.1 and in revision
1496055.

Method "TestLaoBreakIterator.isWord" contains this loop:

{code:java|borderStyle=solid}
for (int i = start; i < end; i += UTF16.getCharCount(codepoint)) {
    codepoint = UTF16.charAt(text, 0, end, start);

    if (UCharacter.isLetterOrDigit(codepoint))
        return true;
}
{code}

It appears that the code is reading only one character again and
again, irrespective of "i".  This looks incorrect.  I think the code
inside the loop should use "i", e.g., read characters based on "i".

If the intended behavior is to read only one character, then the loop
should not be necessary.

A similar problem appears in method
"BreakIteratorWrapper.BIWrapper.calcStatus" for this loop:

{code:java|borderStyle=solid}
for (int i = begin; i < end; i += UTF16.getCharCount(codepoint)) {
    codepoint = UTF16.charAt(text, 0, end, begin);

    if (UCharacter.isDigit(codepoint))
        return RuleBasedBreakIterator.WORD_NUMBER;
    else if (UCharacter.isLetter(codepoint)) {
        // TODO: try to separately specify ideographic, kana? 
        // [currently all bundled as letter for this case]
        return RuleBasedBreakIterator.WORD_LETTER;
    }
}
{code}

Again, the computation inside the loop does not use "i", which seems
incorrect.  It appears that the code is reading only one character
again and again, irrespective of "i".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org