You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Simon Willnauer (JIRA)" <ji...@apache.org> on 2009/11/24 15:19:39 UTC

[jira] Created: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Prepare CharArraySet for Unicode 4.0
------------------------------------

                 Key: LUCENE-2094
                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.2, 3.0, 3.0.1, 3.1
            Reporter: Simon Willnauer
             Fix For: 3.1


CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783772#action_12783772 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

It is recommeneded to turn it on. But if its off, it must be also disabled in the QueryParser. Becazuse of that QP now also have a version ctor.

But you are right, Mike preserved the setPositionIncrement method in QP.

At other places like in StandardAnalyzer there is no longer a posIncr setting, and that's good!

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ] 

Simon Willnauer edited comment on LUCENE-2094 at 11/24/09 3:00 PM:
-------------------------------------------------------------------

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
{code}
    public static boolean isLowerCase(char ch) {
        return isLowerCase((int)ch);
    }
{code}

That is the case all over the place as far as I can see.

      was (Author: simonw):
    bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
    public static boolean isLowerCase(char ch) {
        return isLowerCase((int)ch);
    }

That is the case all over the place as far as I can see.
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783954#action_12783954 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. So you see, I think its dependent upon language and how you want the application to work.

OK, indeed, the issue is not simple -- thanks for the examples!

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797772#action_12797772 ] 

Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 8:07 PM:
-----------------------------------------------------------------

bq. Simon, I think yonik refers to this code in stopfilter itself: 
Thanks god jira lets me edit my comments :)
My X60 was too small to spot the comment about charArraySet and ingoreCase. This is absolutely true - this issue introduced this change and it should 100% use CharArraySet.copy instead of constructing a new CharArraySet

I will create a new issue and upload a patch.

      was (Author: simonw):
    bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet.

This should also be backported to 2.9 / 3.0 to enable solr to at least fix things where possible.


  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783807#action_12783807 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I'm still failing to see why it shouldn't just always be enabled in the query parser. Solr forces it to always be enabled. Will this cause a bug in any scenarios?

it won't cause any bugs as far as I can see. The root cause for all this compatibility - we try hard to preserve bw compat with version all over the place. The reason for this setter is more or less some kind of "expert convenience" My personal feeling would be to make it always true / let version do it.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781994#action_12781994 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API. 
So the thing here is that lowercasing for supplementary characters does only apply to a hand ful of chars see this link http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ACase_Sensitive%3DTrue%3A]%26[^[\u0000-\uFFFF]]]&esc=on
Those characters are from the Deseret Alphabet (mormons) which means we are introducing a "pain in the neck" Version flag into CharArraySet for about 40 chars which would be broken?! I don't see this here! Nothing personal related to the Deseret Alphabet or anyone who is using it but this seem a bit too much of a hassle. It would make the code very ugly though.

simon

 

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783921#action_12783921 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. The Version should allow us having the capability to change defaults (while still providing options), not just fix bugs

Right, Version allows us to change defaults and fix bugs w/o breaking back compat.

The patch looks good to me, and I think deprecating the setter makes sense -- being able to specify this on ctor is enough.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781998#action_12781998 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

I would also break compat in LowerCaseFilter and bring out a large NOTE that if you index mormon you need to reindex.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.patch

Changed the StopFilter(..,posInc,..) ctor to private for convenience.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler reassigned LUCENE-2094:
-------------------------------------

    Assignee: Uwe Schindler

I take this one as communicated in private chat.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783934#action_12783934 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

A godd idea might be to use two StopFilters:
- One with the real stop words that use the Version-default setting for posIncr
- One for the plural suffixes and so on, that should simply be removed. This StopFilter would use false for posIncr.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783871#action_12783871 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hello, my proposal is still the same:
* we use Version to control StopFilter's *default* behavior
* we deprecate the static getDefault... method and the setter
* we add an explicit, even "expert" if  you want, ctor that still uses Version, but also has this boolean param.

this would mean we do not have to have a boolean in all of our analyzers (it is just Version)
also that there is no setter behavior (i do not like these from a ts reusability perspective)
and finally that people still get to change to non-default behavior for this param if they want.

I'm sorry i havent been able to keep up with this today (busy), but if there's consensus I will create the patch, etc.
I think all we have to do is change one of Simon's ctors from private to public and add javadocs.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781965#action_12781965 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

simon yeah,

I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
in trunk ICU there is not even char-based methods, it is all int, where its a trie lookup, with a special fast-path array for linear access to Latin-1


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783868#action_12783868 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Sorry, the commit is not the problem.

We are discussing only about one line, if we deprecate the explicit boolean arg or not. The rest of the patch is not affected, Because of that I committed, because Robert and Simon want to go forward with other analyzer/unicode work.

So this commit does not remove anything. And Version was introduced already in 2.9.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781995#action_12781995 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Another option would be to list a back break in changes:

if you are indexing Deseret language, you should reindex.

we could remove the Version from LowerCaseFilter this way, too.
If you are indexing this language, things werent working right before so you surely wrote your own filters...?!

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783806#action_12783806 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work (we had this issue in 2.9.1 shortly before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a new Version constant you should use the old version constant everywhere and things will play happy together. Even solr users will have old indexes, and for them there should be a property to specify the version constant (using this valueOf of enums). Solr should then create all components that require a version (and since 3.0 *all* analyzers need this) using this property. And then everything will play wonderful together (anayzers, query parser and so on).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.patch

updated patch to trunk - uwe on heavy committing

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784175#action_12784175 ] 

DM Smith commented on LUCENE-2094:
----------------------------------

bq. I would like to open another issue for roberts patch. The reason for this is that I feel that issues like that get sidetracked quite often and its hard to follow once this happens. This would make discussions more clear and would help to prevent situations like this.

Just my opinion:

I don't like committing part of an issue. I think that when/if there is a point at which a commit is needed, for whatever reason, and there is more to do or to discuss, the issue needs to be split. I think a JIRA issue should be represented by a single commit.

This issue pertains to making CharSetArray properly handle surrogates when lowercasing. The use case in Lucene are the stop word lists. These are used by the StopFilter, which has an ugliness that needed fixing.

I understand that sometimes more than one thing gets done in an issue because it is to hard to manage as multiple issues. What I call a ripple effect. It appears that this is happening here.

I think changes other than that should be another issue, a sub-issue, or a linked issue? As it stands, Robert's patch, having the same name as Simon's, makes it appear that it supersedes the prior with the same name. It is confusing without the context of reading the thread.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783399#action_12783399 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Uwe, yeah, that is what I was thinking. 
I guess I think an alternate ctor that allows explicit control of this with a boolean is ok,
but I think if you want the "defaults" it should just be with Version.

This really doesn't have a lot to do with Simon's patch but it becomes noticeable now.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783924#action_12783924 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

what is "properly" ?

If I search on english for "book for sale", it will match "books for sale"
this is considered ok for english.

If I am using persian analyzer, such a thing will not work, because the plural form of book (کتاب) is formed by adding an additional word afterwards (کتاب ها).

So the way plural forms get "stemmed" to their singular form in persian is implemented with stopwords (ها is in the list). I think this is a clean simple approach, which is why I did it this way.

For english, its attached to the word with an s... should we bump the posinc gap after stemmed words in english too?

So you see, I think its dependent upon language and how you want the application to work.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784182#action_12784182 ] 

DM Smith commented on LUCENE-2094:
----------------------------------

In reviewing Simon's latest patch, I see a lot of Version.LUCENE_CURRENT. For example, all the default stop word lists are created with LUCENE_CURRENT.

By using LUCENE_CURRENT, it means that the most recent behavior should always be used. That might change in the future. If it does, then it would silently invalidate an index.

Is this right? Seems to me that a specific version should be used.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi simon, at a glance your patch is ok.

I wonder though if we should try to consistently improve both this and LowerCaseFilter patch in the same way.
i have two ideas that might make it easier...? I am very inconsistent with these things myself so I guess we can try to make it consistent.

1.
{code}  
   for(int i=0;i<len;i++) {
        if (Character.toLowerCase(text1[off+i]) != text2[i])
        final int codePointAt = Character.codePointAt(text1, off+i);
        if (Character.toLowerCase(codePointAt) != Character.codePointAt(text2, i))
           return false;
        if(codePointAt >= Character.MIN_SUPPLEMENTARY_CODE_POINT){
          ++i;
         }
      }
{code}

I wonder if instead loops like this should look like
{code}
 for (int i =0; i < len; ) {
  ...
  i += Character.charCount(codepoint);
 }
{code}

2. I wonder if we should even add an if (supplementary) for things like lowercasing.
toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway, 
so we could just make the code easier to read.
{code}
for (int i = 0; i < len; ) {
 i += Character.toChars(arr, ... 
          Character.toLowerCase(
             Character.codePointAt(...)))
}
{code}


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784023#action_12784023 ] 

Mark Miller commented on LUCENE-2094:
-------------------------------------

bq. Revert?

Not from me. I don't think its a huge deal, certainly not something that requires a revert. I just worry some times about the pace of things - and that the more active one becomes, the more they/we should worry about allowing proper time for comments/objections when committing. I like how people have tended to err on the side of waiting for solid consensus myself. Its not a big issue here - but as we move away from that I think it will be. Its something that will spread as new users look at old users when determining how they act.

The more heavy committing one does, the easier I think it is to just decide stuff and cram it in - personally (and I'm just one voice).  The more you do, I think its also more important to allow brief time periods between saying what you are going to do and doing it (though that should always be done). Its easy to say, well we can just change it, or pull it out - but with lazy consensus and how the community works, I think thats conducive to worse code. Its much easier for someone to debate and have questions than it is to hound changes or code out of trunk. In my mind its better if the bottleneck is on the going in, as it has been, rather then shifting things to fixing whats in. Especially if there is debate in an issue still - whether it belongs there or not - I think there should be warning and consensus before a commit.

I realize thats a bit of a tough sell based on this little issue alone - but its a general feeling I've been having as lucene dev has really been ramping up in recent times. I think its important we stick to being conservative about waiting for consensus - giving others a chance to voice their opinion - no matter how sure you are about your decision. I think its an important example for new users, and an important characteristic of Lucene development.

Thats just me though - I don't speak for anyone but myself.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783829#action_12783829 ] 

Yonik Seeley commented on LUCENE-2094:
--------------------------------------

bq. Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  Seems like that's where the back comat break should have been addressed... water under the bridge at this point though.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783773#action_12783773 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. its not clear from the Solr issue that added this option either.

Because they wanted to preserve bw compatibility for old indexes. And at this time Version was not available. Newer versions of Solr should just add a property to their factories giving the version (or a global solr option automatically applied to the whole Solr installation; that how I do it in panFMP, my own Solr-like project).

bq. Or if its considered a bug that posInc wasn't working before. 

It is a bug.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.patch

I hope we made it with this patch - don't want to keep this growing. 
I fixed a problem in CharArraySet (equals / getHashCode) with limits which is also the reason why CharacterUtils now has a codePointAt(char[], offset, limit) method.
This patch also moves Version into StopFilter but exposes an expert ctor to set the posInc manually.

happy reviewing

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783848#action_12783848 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

I think if indeed there are valid reasons to have StopFilter throw away the holes, then, we shouldn't hide this setting behind Version.  Ie, we should keep the explicit setters / separate param to ctor.  So I think that's the question... is it a bug or a feature?



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783922#action_12783922 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. Mike, in my opinion the holes can have some impact on phrase queries.

But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ] 

Robert Muir edited comment on LUCENE-2094 at 11/24/09 3:09 PM:
---------------------------------------------------------------

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.

slower meaning, the "if" itself in the lowercasefilter patch, it can now be removed.


      was (Author: rcmuir):
    Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783863#action_12783863 ] 

Mark Miller commented on LUCENE-2094:
-------------------------------------

bq. It is a bug. 

It was never considered a bug before. It was well known - its in Lucene In Action that you can leave gaps if you'd like to.

bq. Committed revision: 885592

Sucks to rush a commit when an issue is under discussion. Easy to say we can come back to this, easy not to. I'm against such heavy committing myself, without some consensus to do so. In the old days, there was a bias towards not committing.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783395#action_12783395 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi Simon,

One thing I noticed is with this patch we get:
{code}
public StopFilter(Version matchVersion, boolean enablePositionIncrements, TokenStream input, Set<?> stopWords, boolean ignoreCase)
{code}

I know this is really not related to what you are doing here, but I wonder if instead StopFilter should look like this:
{code}
public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, boolean ignoreCase)
{code}

and use matchVersion to determine enablePositionIncrements. 

I think its already wierd how to create a stopfilter, you have to pass version to a static method getEnablePositionIncrementsVersionDefault. I don't think the user should have to pass Version twice:
{code}
new StopFilter(Version.WHATEVER, StopFilter.getEnablePositionIncrementsVersionDefault(Version.WHATEVER), ...)
{code}

I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797772#action_12797772 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783410#action_12783410 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. This is one thing I thought about too - I did not change it to keep the noise as low as possible in the patch but if we want to do it we can do in this patch too.

well I think it will be noisy either way (updating all the analyzers, etc), but will make things a lot more consistent and easier to maintain...
if you do this then StopFilter takes version so it can be modified / bugfixed in the future in other ways too, with less noise.
I also think it will make it easier to write an analyzer.

because even completely ignoring the unicode issue, with the current codebase:

{code}
streams.source = new StandardTokenizer(matchVersion, reader);
streams.result = new StandardFilter(streams.source);
streams.result = new LowerCaseFilter(matchVersion, streams.result);
streams.result = new StopFilter(matchVersion, streams.result, stoptable);
...
{code}

reads a lot easier to me than
{code}
streams.source = new StandardTokenizer(matchVersion, reader);
streams.result = new StandardFilter(streams.source);
streams.result = new LowerCaseFilter(matchVersion, streams.result);
streams.result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                                      streams.result, stoptable);
...
{code}


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781980#action_12781980 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

question of the day - should we use Version or not :)



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783762#action_12783762 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

fwiw Solr uses .setEnablePositionIncrements method to accomplish this.

So to me it doesn't matter, as there is a way to explicity do this, I think?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783938#action_12783938 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. A godd idea might be to use two StopFilters: 

in theory, but sometimes these terms are ambiguous, and the computer (especially a very simple analyzer) does not know which one it is, sometimes it can be both. 

sometimes its a real word too, but on average its better to ignore it.

I don't think we need to go to this effort optimal phrasequeries either. A user who really cares can do this themself... and thats my whole point, they should be able to do something liek what you said, and explicitly say 'no i don't want posIncr for this stopfilter, but yes I'll take the real bugfixes, thanks'

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784025#action_12784025 ] 

Mark Miller commented on LUCENE-2094:
-------------------------------------

bq. The term "heavy committing" is just a joke

I know, I wasn't using it very seriously myself ;)

bq. So I am sorry for lot's of commits today!

No worries - I don't mean to frame anything is a way that you should have to apologize for. Lots of commits are still good from my point view! I just think there should be something of a warning before a commit in an issue that is being actively discussed. 

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/29/09 12:56 PM:
------------------------------------------------------------------

Mike didn't want to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use only matchVersion everywhere and eliminate the enablePosIncr setting at all.

      was (Author: thetaphi):
    Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere.
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783895#action_12783895 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. I'm against quick commits like this. 

Revert?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783920#action_12783920 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

I like the patch if it solves this issue and we are all happy again. The updates to javadocs are also fine, the 2.9 thing was missing.

The problem in this issue was, that some participants were not fully informed about the Version parameter at all and that it prevents from breaking backwards. My opinion is: Please also add matchVersion as a factory property for query parser and analyzers! Solr would profit from it, too. Fewer options and you can preserve your config file even after a major Solr update without breaking any existing indexes. That is the lesson out of this issue.

Discussing about a separate get/set for this posIncr stuff is another discussion for a separate issue.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783392#action_12783392 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Why do you use Version.LUCENE_CURRENT for all predefined stop word sets (ok, they do not need a match version, because they are already lowercased). 

1. the do not ignore case at all so the version will not affect those sets.
2. they are private and we have the full control over the sets. The are all lowercased (as you figured correctly) and none of them contains any supplementary character.
3. The are static and private so passing any usersupplied version is not feasible.

bq. In my opinion the whole stuff is only needed for chararrayssets, which are not already lowercased. So is there any chararrayset in lucene with predefined stop-words, that is not lowercased)?
Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784153#action_12784153 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. I consider this as a misunderstanding caused by too much instant messaging.

We need to strive to keep technical discussions public, for this reason.  If you ever find yourself discussing technical direction of changes to Lucene, privately, kick yourself and others near you and then take it back to the list :)

You all probably felt that you reached consensus (over IM), but to the rest of us, it suddenly felt like you committed while we were still debating the approach.  Like Mark, I'm certainly not saying anything bad was done, but, it's a dangerous precedent that "consensus reached in private is then acted on in public without corresponding public consensus".

Erring on the side of caution, moving more slowly, when consensus is iffy, is best.  Mark's original idea -- to separately open the question of "hey let's commit this as it stands now and then debate the Version vs separate boolean separately" would've been a good approach, I think.

Also, I think other Apache communities have public chats, where devs/contributors/users can hang out and discuss stuff, and it's well publicized that this is the standard place where things are discussed, anyone can join in, etc.  We could consider taking that step for Lucene?  Those of us who are online can then hang out.  I don't like that it's not archived/searchable (or... is it?  Anyone know what other communities do?  EG, is there a way to set up an IRC channel and have it archived?).  IM clearly is a nice, fast interactive way to discuss things, far better than email / appending to Jira issues.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781962#action_12781962 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon definitely, it is not a problem with your patch...
Thinking we can fix both to be clean.

btw, I have no idea if there is any performance difference between doing things this way.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783765#action_12783765 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

...Solr can also use the version to do this.

We removed the posIncr ctors also for a lot of Analyzers (StandardAnalzer,...), so why not also remove (deprecate and remove in 4.0)  from StopFilter? There is another issue open, that says: Remove all per-instance setters and make all filters final (I think it was the hell issue). All parameters should be passed on ctor and that prferably using version. Alternate an ctor only taking booleans, but most of the were removed in 3.0. The only relict in core is StopFilter.

So the ctor taking version should not make posIncr available and the other way round. If you want to control the falgs yourself, create an ctor with posIncr and smartjava5unicode switches (ugly).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783974#action_12783974 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Uwe, the problem I think is still what DM/Mike said before:

{quote}
I think if indeed there are valid reasons to have StopFilter throw away the holes, then, we shouldn't hide this setting behind Version. Ie, we should keep the explicit setters / separate param to ctor. So I think that's the question... is it a bug or a feature?
{quote}



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783872#action_12783872 ] 

Mark Miller commented on LUCENE-2094:
-------------------------------------

bq. We are discussing only about one line, if we deprecate the explicit boolean arg or not. 

One line thats part of this patch. By committing, you remove incentive to deal with the issue as the patch works in line with this being a bug. Now its in the code, now everyone can go home and forget.

bq. because Robert and Simon want to go forward with other analyzer/unicode work.

Whats the rush? They can do other work without this being in trunk today. Thats not a valid reason for any commit in my mind.

bq. So this commit does not remove anything. And Version was introduced already in 2.9.

I don't think it matters - where is the consensus to do this commit now after discussion around it (one line or not) started? I don't see it.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781964#action_12781964 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. btw, I have no idea if there is any performance difference between doing things this way.
The change to charCount is pretty much the same as the if statement - this at least would not kill any performance.
The increment by 2 should also not be an issue. it is slightly slower than a ++ but this will be fine I guess.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783915#action_12783915 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

And Simon wanted to also work on the "massive code duplication" (LUCENE-2034)  issue which would break this patch and vice versa. You never get these two patches to merge, because the code duplication issue does large refactoring of almost all analyzers. And xmas is coming, so we want to have a nice xmas present for all analyzer writers...

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797763#action_12797763 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon, I think yonik refers to this code in stopfilter itself:

{code}
  private StopFilter(Version matchVersion, boolean enablePositionIncrements, TokenStream input, Set<?> stopWords, boolean ignoreCase){
    super(input);
    this.stopWords = CharArraySet.unmodifiableSet(new CharArraySet(matchVersion, stopWords, ignoreCase));
{code}

it seems to me that if you pass a charArraySet to stopfilter, no matter what, you get an expensive copy with "new"...?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783402#action_12783402 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity

OK, I agree, let's also push Version down into StopFilter (to get posIncr setting).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783967#action_12783967 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Oh, you are right! All posIncr ctors are deprecated, the matchVersion ones bringing defaults are the new ones. And you can change this default later - prefect. Just more documentation! :-)

+1 from my side.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783834#action_12783834 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

No they did not break. If you use Version.LUCENE_24 in the ctor of StandardAnalyzer it behaves like in 2.4. Because of that we have Version! We preserver BW comp by *requiring* a matchVersion parameter to *all* ctors of Analyzers.

In 2.9 the deprecated non-version ctors default to version 2.4 (from 3.0 on you *have to* specify the version)

If you use always Version.LUCENE_CURRENT then you have to reindex after each version upgrade.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783926#action_12783926 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 11:08 PM:
------------------------------------------------------------------

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the matchVersion. Here it is the other way round, the ctor uses the default with Version and you can change it by a setter later (which is still not deprecated and available in 3.0).

In my opinion we should go that way (which is against Robert's opinion). The ctor taking two booleans is very bad...

      was (Author: thetaphi):
    bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the matchVersion. Here it is the other way round, the ctor uses the default with Version and you can change it by a setter later (which is still not deprecated and available in 3.0).
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783878#action_12783878 ] 

Mark Miller commented on LUCENE-2094:
-------------------------------------

IMHO the strongest this should have happened is: you propose that the current discussion is not pertinent to committing this patch. You then ask what others think about committing and keeping it open. You then say, if no one objects, you will commit in a day or two. I'm against quick commits like this.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797740#action_12797740 ] 

Yonik Seeley commented on LUCENE-2094:
--------------------------------------

I obviously haven't had the chance to keep up with changes in Lucene... but as part of LUCENE-2195 that caught my eye, I was looking at StopFilter.
In the past, pains were made to keep creation fast - if you passed a CharArraySet, it would be directly used.  It now looks like a new set is created each time (actually two - a CharArraySet and then an UnmodifiableCharArraySet).  This new behavior also no longer matches the javadoc for the constructor.

It looks like it was committed as part of this issue, but I can't find any comments here about either the need to make a copy or the need to make a unmodifiable set.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783855#action_12783855 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

I will commit the patc now and we can later think about undeprecating. Simon wants to go forward with other patches and there are heavy cahnges in it, so I need to do heavy committing.

The discussion should have been here before 2.9, because most individual setters are now removed, this is the really only relict. All others are subsumed under version.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783913#action_12783913 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

I personally don't like that StopFilter can discard the holes.  It
loses information, that you can never get back, once indexed.

That said, it's clearly not black & white -- enough people feel it's a
feature (not a bug), and should be preserved, so I think we should
preserve it as a standalone option.

But I think we should keep the default as "don't discard the holes".

{quote}
Hello, my proposal is still the same:
  * we use Version to control StopFilter's default behavior
  * we deprecate the static getDefault... method and the setter
  * we add an explicit, even "expert" if you want, ctor that still uses Version, but also has this boolean param.
{quote}

I think this is a good approach!


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783760#action_12783760 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. As discussed before, please deprecate this. The posIncr stuff was deprecated everywhere else too (in 2.9 already).

I think i disagree, only because Solr StopFilterFactory allows the user to explicitly set this.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783770#action_12783770 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Uwe, yeah as long as they have some way to do it.

I guess I don't understand if users view this posInc flag / versioning thing as really itself an option, and the version use is just about having just a better default? Or if its considered a bug that posInc wasn't working before. I think there are some tradeoffs in behavior between the two and I'm not sure one size fits all.

its not clear from the Solr issue that added this option either.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783906#action_12783906 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. Really discussions like this should happen in public.

Actually, what I asked Uwe was, if he could take this issue for me, since I will be busy at work this week and its holding Simon up.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783842#action_12783842 ] 

DM Smith commented on LUCENE-2094:
----------------------------------

bq. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Yes. But, IMHO, this seems like advocating that a desired behavior be gained by a backward compatibility mechanism.

I see two problems in using Version to enable position increments (or any other particular behavior):
a) If a prior behavior is desired now, one should not need to use a prior Version to get it.
b) Version codifies a particular combination of behavior. It does not allow for rolling one's combination.

Make that 3 problems:
c) At some point a prior version's behavior will/should be removed.

It seems like this was discussed at length for creating a Settings object. I'd rather see Attribute/AttributeSources used for such a thing than Version.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783761#action_12783761 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I think i disagree, only because Solr StopFilterFactory allows the user to explicitly set this.
+1

bq.     As discussed before, please deprecate this. The posIncr stuff was deprecated everywhere else too (in 2.9 already).
Except of StopFilter, this class had a none-deprected  posInc constructor. I also thing this one should be accessible and not deprecated.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.patch

This patch uses CharacterUtils and Version to preserve backwards compatibility. It has grown to a very large patch and changes a lot of stuff in core too. I'm not sure if this is the best way to go with the limited usecase in mind. - Only Deseret language has upper / lowercase pairs which are not in the BMP. Yet this could change in the future - who knows that way we could get rid of the deprecated methodes little quicker...

>From a backwards policy perspective we should do it that way.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781958#action_12781958 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783963#action_12783963 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. In my opinion, this is more clear than supplying both in one ctor (they are two params that seem to interact with each other, but they don't!). I would also prefer to initialize StopFilter with the defaults in the ctor, and later change it using setters.

if this is better, then we need not do anything (except I still think we should fix up some minor unrelated javadocs problems i had in the patch). The setter is not deprecated currently.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783904#action_12783904 ] 

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq.  Robert, Simon and me were chatting private a lot in parallel and came to the opinion, that we should commit this first and then discuss about this one ctor more.

Really discussions like this should happen in public.

bq. Discussing in this JIRA issue is a pain because of long page loading time.

We can carry it over to java-dev, in general.  I agree page load time gets annoying for big issues...

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783282#action_12783282 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Why do you use Version.LUCENE_CURRENT for all predefined stop word sets (ok, they do not need a match version, because they are already lowercased).

In my opinion the whole stuff is only needed for chararrayssets, which are not already lowercased. So is there any chararrayset in lucene with predefined stop-words, that is not lowercased)?

How about deprecating lowercasing at all and enforcing the stop lists to be lowercased before adding to an chararrayset? For current hard-coded sets, its no problem. And all File/Reader/... params to analyzers with lowercase could be deprecated and the user told to use the new ones which need already lowercased stop word sets.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784189#action_12784189 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I think changes other than that should be another issue, a sub-issue, or a linked issue? As it stands, Robert's patch, having the same name as Simon's, makes it appear that it supersedes the prior with the same name. It is confusing without the context of reading the thread.
+1  - I created LUCENE-2099 for that purpose and added Roberts latest patch to it.
I will like those two in a second.

bq. By using LUCENE_CURRENT, it means that the most recent behavior should always be used. That might change in the future. If it does, then it would silently invalidate an index.
This has many reasons. The most important one is that the version will not affect those sets as the pass false to ingoreCase and we have full control over the stopwords. But I agree this would be more "secure" if it would use Version.LUCENE-31 just to make sure nobody changes the internal behavior of CharArraySet. I still would expect anybody changing the behavior of this class to revise their usage.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797772#action_12797772 ] 

Simon Willnauer edited comment on LUCENE-2094 at 1/7/10 7:53 PM:
-----------------------------------------------------------------

bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet.

This should also be backported to 2.9 / 3.0 to enable solr to at least fix things where possible.



      was (Author: simonw):
    bq. Simon, I think yonik refers to this code in stopfilter itself: 
I see, the problem with this piece of code is that it has the caseinsensitive flag which would be ignored if I would not create such a set though. As far as I can see even previous version did not really do what the javadoc says. 
{code}
  if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
{code}

I agree we should prevent this costly operation but it doesn't seem to be easy though. My first impression is to deprecate the ctors which have the ignorecase boolean and fix documentation to use charArraySet if case should be ignored. At the same time we should introduce a getter to charArraySet and only create a new set if the boolean given and the ignorecase member in CharArraySet does not match, provided it is an instance of charArraySet.


  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782009#action_12782009 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon, yeah. its tricky you know, like many suppl. char issues.

even if we provide perfect backwards compatibility with what 3.0 did, if you care about these languages, you *WANT* to reindex, because stuff wasn't working at all before.
and if you really care, you weren't using any of lucene's analysis components anyway (except maybe WhitespaceTokenizer).
For example, StandardAnalyzer currently discards these characters anyway.

but we don't want to screw over CJK users where things might have been "mostly" working before, either.
In this case, CJK is completely unaffected, I think we should not use version here or in any other lowercasing fixes, including LowerCaseFilter itself.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783901#action_12783901 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

That's much easier to discuss, everybody sees in one small patch whats happending. The other one was too big and unrelated. A new issue would have been better at all.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2094:
--------------------------------

    Attachment: LUCENE-2094.patch

attached is my proposal mentioned in the comments above.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783959#action_12783959 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Before I go to bed, how about ctor vs get/set:
QueryParser currently only has a ctor taking matchVersion which sets the default. If somebody wants to change the default, he can later call setEnablePositionIncrements().
In my opinion, this is more clear than supplying both in one ctor (they are two params that seem to interact with each other, but they don't!). I would also prefer to initialize StopFilter with the defaults in the ctor, and later change it using setters.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.txt

Changed loop to use Charater.charCount()

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783409#action_12783409 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

+1 for pushing version downto StopFilter (it is there already, but hidden in this getDefault() method! Its presence was justified by Lucene 2.9/3.0 migration. Now it should just take a matchVersion and no more setters inside StopFilter.

The noise is the same, as all analyzers using stopfilter then need the version arg / need to be changed anyhow.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782007#action_12782007 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

+1 for breaking backwards for these chars. From the web: there are only 4 books written in this charset (the books of mormon, see [http://en.wikipedia.org/wiki/Deseret_alphabet], [http://www.omniglot.com/writing/deseret.htm]), so it is rather seldom. People affected by this will for sure have their own analyzers.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783406#action_12783406 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity

This is one thing I thought about too - I did not change it to keep the noise as low as possible in the patch but if we want to do it we can do in this patch too. 

The question if we want to drop bw. compat and simply update CharArraySet to Unicode 4.0 seems to be more important. But IMO if we push Version to StopFilter we can also make CharArraySet using Version though.

thoughts?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.txt

Added some more tests including single highsurrogate chars.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783789#action_12783789 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Solr should make Version a property to all factories and create all Filters/Parsers using that flag. Thats why we implemented Version (to get rid of all these strange boolean flags). Just use Version.valueOf(property) and use the result to create your filters. It is now implemented everywhere in Lucene Core and Contrib.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Closed: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer closed LUCENE-2094.
-----------------------------------

    Resolution: Fixed

Since nobody commented on the proposal to close this issue I will close it now.
Issues for remaining discussions are available.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783829#action_12783829 ] 

Yonik Seeley edited comment on LUCENE-2094 at 11/30/09 8:15 PM:
----------------------------------------------------------------

bq. Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  Seems like that's where the back comat break should have been addressed (and it was)... water under the bridge at this point though.

      was (Author: yseeley@gmail.com):
    bq. Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default?  Seems like that's where the back comat break should have been addressed... water under the bridge at this point though.

  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783796#action_12783796 ] 

Yonik Seeley commented on LUCENE-2094:
--------------------------------------

> > Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

> We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

I'm still failing to see why it shouldn't just always be enabled in the query parser.  Solr forces it to always be enabled.  Will this cause a bug in any scenarios?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783394#action_12783394 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking.

no doubt! :) But if you do not want your terms to be lowercased but you do not care if "The" is at has an uppercase "T" you want this behaviour. Yet, either way we go we need the version somehow to preserve bw. compat. 

We should rather think about breaking bw. compat for this particular language (deseret) but we have no idea what happens with unicode in the future. Its tough.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783393#action_12783393 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set.

If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783930#action_12783930 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. So the way plural forms get "stemmed" to their singular form in persian is implemented with stopwords (ها is in the list). I think this is a clean simple approach, which is why I did it this way.

But if this is so, you should have initialized the stop filter in persian analyzer with a fixed "false". Bt it also used StopFilter.getEnablePositionIncrementsVersionDefault() and used the version default. Should we fix this?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783917#action_12783917 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Mike, in my opinion the holes can have some impact on phrase queries.
Personally I think the situation is complex (and I hate to say but language dependent), but I think "holes" are a good default.

But I should be able to change them explicitly, overriding the default.
The Version should allow us having the capability to change defaults (while still providing options), not just fix bugs

if anyone has time to glance at the patch, let me know what you think. We don't have to deprecate the setter, thats just me being anal.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783806#action_12783806 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 7:25 PM:
-----------------------------------------------------------------

Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work (we had this issue in 2.9.1 shortly before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a new Version constant you should use the old version constant everywhere and things will play happy together. Even solr users will have old indexes, and for them there should be a property to specify the version constant (using this valueOf of enums). Solr should then create all components that require a version (and since 3.0 *all* analyzers need this) using this property. And then everything will play wonderful together (anayzers, query parser and so on).

Also Highlighter had a problem with it (same issue with the QP problem in pre-2.9.1)!

      was (Author: thetaphi):
    Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work (we had this issue in 2.9.1 shortly before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a new Version constant you should use the old version constant everywhere and things will play happy together. Even solr users will have old indexes, and for them there should be a property to specify the version constant (using this valueOf of enums). Solr should then create all components that require a version (and since 3.0 *all* analyzers need this) using this property. And then everything will play wonderful together (anayzers, query parser and so on).
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781960#action_12781960 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]). 
I think calling those 3 methods should be fine without a utils method. We will see how it goes until the "end" of this whole issues I might change my mind.

simon

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.txt

This patch contains a testcase and a fixed CharArraySet. Yet this does not use Version to preserve compatibility. I bring this patch up to start the discussion how we should handle this particular case.
Using version would not be that much of an issue as all Analyzers using a CharArraySet do have the Version class already.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783784#action_12783784 ] 

Yonik Seeley commented on LUCENE-2094:
--------------------------------------

Preserving gaps from stopword removal isn't always desirable... seems it should remain an option to enable/disable it.
No biggie if y'all don't agree in Lucene land though - Solr's factory could just switch between alternate classes to enable/disable position increments.

Solr's query parser (the one that just extends Lucene's QueryParser) always enables position increments.  That allows the true control to rest with the filters for specific fields.

bq. It is recommeneded to turn it on. But if its off, it must be also disabled in the QueryParser.

Why?  What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784164#action_12784164 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. You all probably felt that you reached consensus (over IM), but to the rest of us, it suddenly felt like you committed while we were still debating the approach.
Actually this is almost what happened. We discussed over days and after this was committed I realized that most of the discussion was private. While this was discussed on JIRA the fact that we did so much IM wasn't that obvious to me and I guess this applies to Uwe too. I hope I made my point with saying this should not happen in the future though.

bq. We could consider taking that step for Lucene? 
I thing it would be an improvement in any way! +1 for this

bq. I don't like that it's not archived/searchable
I think it is a good place to discuss things but results of the discussion should go to JIRA / mailing list. That way we have an archive for the decisions. I would not consider a IM discussion as a consensus as it always depends on the time of the day, people leaving the chat when they go offline etc.
Having a public "realtime" chat room for lucene would be nice as many discussions can move quicker - results and descisions must go to a official place where they can be archived. This would give others, not on the chat during the discussion, the possibility to join again.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783926#action_12783926 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the matchVersion. Here it is the other way round, the ctor uses the default with Version and you can change it by a setter later (which is still not deprecated and available in 3.0).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784248#action_12784248 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

I have created separate issues for the outstanding discussions. I suggest to close this issue and move the discussions to the new issues [LUCENE-2101] and [LUCENE-2099]

throughts?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797758#action_12797758 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

Hi Yonik,

bq. It looks like it was committed as part of this issue, but I can't find any comments here about either the need to make a copy or the need to make a unmodifiable set.
I try to help you to reconstruct the whole thing a bit. UnmodifiableCharArraySet was introduces with LUCENE-1688 as far as I recall to replace the static string array (stopwords) in StopAnalyzer. 
During the refactoring / improvements in contrib/analyzers we decided to make analyzers and tokenfilters immutable and use chararrayset whereever we can. To prevent provided set from being modified while they are in use in a filter the given set is copied and wrapped in an immutable instance of chararrayset. At the same time (still ongoing) we try to convert every set which is likely to be used in a TokenFilter into a charArraySet.  Wordlistloader is not done yet but on the list, the plan is to change the return values from HashSet<?> into Set<?> and create CharArraySet instances internally. 
With LUCENE-2034 we introduced StopwordAnalyzerBase which also uses the UnmodifiableCharArraySet with a copy of the given set.
The copy of a charArraySet is very fast even for large sets and the creation of a unmodifiableCharArraySet from a CharArraySet instance is basically just an object creation. The background is, again to prevent any modification to those sets while they are in use.

bq. This new behavior also no longer matches the javadoc for the constructor. 
I agree we should adjust the javadoc for ctors expecting stopwords to reflect the behavior.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2094:
----------------------------------

    Affects Version/s:     (was: 3.0.1)
                           (was: 2.9.2)
                           (was: 2.9.1)
                           (was: 3.1)
                           (was: 2.4.2)
                           (was: 2.4.1)
                           (was: 2.3.3)
                           (was: 2.3.2)
                           (was: 2.3.1)
                           (was: 2.9)
                           (was: 2.4)
                           (was: 2.3)
                           (was: 2.2)
                           (was: 2.1)
                           (was: 2.0.0)
                           (was: 1.9)

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

    Attachment: LUCENE-2094.patch

I updated the patch to use Version in StopFilter. This seems to be reasonable though.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781983#action_12781983 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783757#action_12783757 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. This patch also moves Version into StopFilter but exposes an expert ctor to set the posInc manually. 

As discussed before, please deprecate this. The posIncr stuff was deprecated everywhere else too (in 2.9 already).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783789#action_12783789 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 6:55 PM:
-----------------------------------------------------------------

bq. Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Solr should make Version a property to all factories and create all Filters/Parsers using that flag. Thats why we implemented Version (to get rid of all these strange boolean flags). Just use Version.valueOf(property) and use the result to create your filters. It is now implemented everywhere in Lucene Core and Contrib (Version.valueOf() would not work in 2.9, because Version extends Parameter there, but in 3.0 it's an enum)

      was (Author: thetaphi):
    bq. Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Solr should make Version a property to all factories and create all Filters/Parsers using that flag. Thats why we implemented Version (to get rid of all these strange boolean flags). Just use Version.valueOf(property) and use the result to create your filters. It is now implemented everywhere in Lucene Core and Contrib.
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783932#action_12783932 ] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. But if this is so, you should have initialized the stop filter in persian analyzer with a fixed "false". Bt it also used StopFilter.getEnablePositionIncrementsVersionDefault() and used the version default. Should we fix this?

I don't think so. I think its up to the user to decide how they want the search to work, even in this example.
If they don't like the defaults for how phrasequery works, they can create an analyzer that uses the stopfilter differently.

I don't think the issue is clear for any given language, I think it always depends on how your application works.
I mean we add a hole for "the" in english, but in bulgarian (LUCENE-2062) this is a suffix attached to the end of a noun.
With arabic its always a prefix. I don't think we need to have options to add a posinc gap if we stem leading ال off an arabic word.

I'm just trying to show some examples of why a user might want to change the defaults.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783861#action_12783861 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now everywhere and it is better than goiving a boolean to *every* analyzer that uses StopFilter. And that was the reason for creating the Version constants in 2.9.

bq. So I think that's the question... is it a bug or a feature? Everybody should update the code and raise the version constant to 31

It is a bug.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781959#action_12781959 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

Robert, I tried to make it consistent to the LowerCaseFilter issues but I would vote +1 for both! This makes it much cleaner but we need to change the LowerCaseFilter one too!
I will quickly change my patch.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784146#action_12784146 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Sucks to rush a commit when an issue is under discussion. Easy to say we can come back to this, easy not to. I'm against such heavy committing myself, without some consensus to do so. In the old days, there was a bias towards not committing.
I have to give uwe a bit of a backup for this. At least to my eyes there was a consensus on this patch and to the relation between the deprecation discussion and the patch. But I understand you feelings about this commit. I consider this as a misunderstanding caused by too much instant messaging. We should not consider IM as a consensus at all and I guess both of us have learned from this issue. I completely agree that we always should wait until the discussion has lead to a consensus on the list or in JIRA - otherwise it should always be a bias towards committing. 

I would like to open another issue for roberts patch. The reason for this is that I feel that issues like that get sidetracked quite often and its hard to follow once this happens. This would make discussions more clear and would help to prevent situations like this. 

Thoughts?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783882#action_12783882 ] 

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

The term "heavy committing" is just a joke, I did not do any real cuncontrolled heavy committing today, I just called it so, because the patch was very large and affected lot's of files. We are sorry; Robert, Simon and me were chatting private a lot in parallel and came to the opinion, that we should commit this first and then discuss about this one ctor more. Discussing in this JIRA issue is a pain because of long page loading time.

The addition of Version to StopFilter was agreed already, the only thing was the deprecation of the boolean flag. Let's open another issue for it and solve it separate. We should have opened another issue for it already, but we merged/developed both patches (add matchVersion to CharArraySet and StopFilter), as always the same files were affected and because of that not so many patches can get out of sync.

So I am sorry for lot's of commits today!

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ] 

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
    public static boolean isLowerCase(char ch) {
        return isLowerCase((int)ch);
    }

That is the case all over the place as far as I can see.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783861#action_12783861 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 9:52 PM:
-----------------------------------------------------------------

Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now everywhere and it is better than goiving a boolean to *every* analyzer that uses StopFilter. And that was the reason for creating the Version constants in 2.9.

bq. So I think that's the question... is it a bug or a feature?

It is a bug.  Everybody should update the code and raise the version constant to 31.

      was (Author: thetaphi):
    Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now everywhere and it is better than goiving a boolean to *every* analyzer that uses StopFilter. And that was the reason for creating the Version constants in 2.9.

bq. So I think that's the question... is it a bug or a feature? Everybody should update the code and raise the version constant to 31

It is a bug.
  
> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that  String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org