You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/24 15:29:39 UTC

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi simon, at a glance your patch is ok.

I wonder though if we should try to consistently improve both this and LowerCaseFilter patch in the same way.
i have two ideas that might make it easier...? I am very inconsistent with these things myself so I guess we can try to make it consistent.

1.
{code}  
   for(int i=0;i<len;i++) {
        if (Character.toLowerCase(text1[off+i]) != text2[i])
        final int codePointAt = Character.codePointAt(text1, off+i);
        if (Character.toLowerCase(codePointAt) != Character.codePointAt(text2, i))
           return false;
        if(codePointAt >= Character.MIN_SUPPLEMENTARY_CODE_POINT){
          ++i;
         }
      }
{code}

I wonder if instead loops like this should look like
{code}
 for (int i =0; i < len; ) {
  ...
  i += Character.charCount(codepoint);
 }
{code}

2. I wonder if we should even add an if (supplementary) for things like lowercasing.
toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway, 
so we could just make the code easier to read.
{code}
for (int i = 0; i < len; ) {
 i += Character.toChars(arr, ... 
          Character.toLowerCase(
             Character.codePointAt(...)))
}
{code}


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org