You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/16 17:41:39 UTC

[jira] Created: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

fix LowerCaseFilter for unicode 4.0
-----------------------------------

                 Key: LUCENE-2069
                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Muir
            Priority: Minor
             Fix For: 3.1
         Attachments: LUCENE-2069.patch

lowercase suppl. characters correctly. 

this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2069:
------------------------------------

    Attachment: LUCENE-2069.patch

I updated the patch with another testcase for a trailing surrogate leftover in the termbuffer. I also added a missing @Override and fixed some wording in the javadoc

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2069:
------------------------------------

    Attachment: LUCENE-2069.patch

Added CHANGES.TXT entry for this new feature.
We both agreed that we can deprecated CharacterUtils later once we are close to getting rid of it.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782439#action_12782439 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, i took a quick look at contrib analyzers, for example.
This utility class could make back compat easier for a lot of the code, i.e. unicode block calculations in the CJK code, greek diacritic/lowercase folding in the greek code, ...
I think we should go this route.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778401#action_12778401 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, if you have a moment maybe you can review this one for me?

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782637#action_12782637 ] 

Simon Willnauer edited comment on LUCENE-2069 at 11/25/09 9:54 PM:
-------------------------------------------------------------------

bq. damn we have to use the limit form of codePointAt, just to be sure. 
no we don't - at least not in this particular case

bq. if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this. The lead surrogate will not be changed at all in this case - if there is a situation where this could happen I'm not aware of it!

      was (Author: simonw):
    bq. damn we have to use the limit form of codePointAt, just to be sure. 
no we don't - at least not in this particular case

bq. if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this.
  
> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2069:
------------------------------------

    Attachment: LUCENE-2069.patch

I revised the patch and fixed some issues:
- replaced real characters in tests
- extended tests to boundaries
- Removed "code duplication" in LowercaseFilter

the latter is the most important issue. I figured that if we implement a factory with the basic codePointAt method based on a version we can implement the most of the algorithms / methods just by obtaining the version correspondent instance of CharacterUtils (new class I introduced) What this class does is pretty simple - if version >= 3.1 it delegates to the Character correspondent while for earlier versions it convert a character to a codepoint without checking the for high surrogates. Once we have done this conversion we can simply use all the Character.*(int) methods as they are.



> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779164#action_12779164 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! :)

doesn't matter to me, I just present both alternatives! all i want is for us to make a decision.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782637#action_12782637 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

bq. damn we have to use the limit form of codePointAt, just to be sure. 
no we don't - at least not in this particular case

bq. if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778508#action_12778508 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

forgot javadocs describing what the version does, sorry.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779156#action_12779156 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

if you want my vote, it is that we treat issues like this as bugs and not do all this Version stuff.

i supplied this patch (22KB versus 2KB) to show how even the smallest issue creates more complexity.
Also, read the javadocs for what Version does, it reads just like a bug:
* As of 3.1, supplementary characters are properly lowercased.

I mean, honestly, its not like we provided a back compat mechanism for 3.0,
where this behavior changed for lots of contrib that uses String-based methods, such as String toLowerCase (they return different results on JRE5 than JRE4)

but we can go either way, doesn't matter to me.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

here is a patch that supports the old broken behavior also via Version.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782432#action_12782432 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Hi Simon, this is a cool idea!

I need to think this through, can you think of other places (non-lowercasing) where we could use this?
Even if we can only use it there, I think it might still be a good idea to keep things simple.

I do think we should mark the class deprecated and only used for lucene back compat purposes if we decide to use it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778514#action_12778514 ] 

Uwe Schindler commented on LUCENE-2069:
---------------------------------------

we can change it whenever we want, we must only supply a matchVersion switch....

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782649#action_12782649 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

If we are too desperate about it I would suggest to have something like the following just above the loop:
{code}
 if(buffer.length >= length)
        buffer[length] = 0x00;
{code}



> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783158#action_12783158 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon you are right, there is no problem.

maybe for other things in the future we will need codePointAt() with the limit param, we could just add it to CharacterUtils if/when we need it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782393#action_12782393 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

btw. this also works for CharArraySet - that way we can easily implement it with Version without duplicating any code. Readable, clean and compatible.

I will update the CharArraySet patch once I got comments on this.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778509#action_12778509 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

we might need a changes.txt entry here too?!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778515#action_12778515 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Uwe, we can use matchVersion for all of this, this is true, and I will help.

but see my comment on LUCENE-1689 (since i feel it affects all the issues), it will result in a lot of code complexity. Just a warning.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779319#action_12779319 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

bq. Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0! 
thats what I guessed :D otherwise it would not work though :). I was just wondering if there are some more expressive once out there.

bq. Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! 
+1

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778510#action_12778510 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, yes see LUCENE-1689. 
this is my question of the day, how are we handling this which is really a backwards break in a way, but honestly a bugfix because we should have supported Unicode 4.0 in Lucene 3.0, since thats the unicode version of java 5.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Assignee: Robert Muir

Thanks for your work here Simon. I will commit soon if no one objects.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782449#action_12782449 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

I also found some others like
BrazilianStemmer
ChineseTokenizer
FrenchStemmer
DutchStemmer

and many more.... +1 for this from my side.
As this seems to be fundamental we should try to get it in sooner or later so we can get the rest going.

simon

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2069:
--------------------------------

    Attachment: LUCENE-2069.patch

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782623#action_12782623 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

damn we have to use the limit form of codePointAt, just to be sure.

if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779160#action_12779160 ] 

Mark Miller commented on LUCENE-2069:
-------------------------------------

But we try and maintain index back compatibility with bugs too? We don't want terms to be lost in an index.

But it depends as always - if something has long been a problem and broken, then perhaps it doesn't make sense to bend over backwards about it now.  We just have to look at everything, put the priority on making life best for users while balancing somewhat with dev/maintenance headaches and come to a consensus - easy ! :)

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778504#action_12778504 ] 

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

Robert, I assume you did use those weird chars in the test on purpose - I wonder if there are some "real" codepoints that we could use in the test? 

The code looks good to me, this is the way to go for char lowercaseing with Unicode 4.0

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2069.
---------------------------------

    Resolution: Fixed

Committed revision 885024.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779499#action_12779499 ] 

Robert Muir commented on LUCENE-2069:
-------------------------------------

bq. But we try and maintain index back compatibility with bugs too?

Mark, you are right. The Version description says this: Match settings and bugs in Lucene's 3.0 release.
I guess we should at least try, I think we can do it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783094#action_12783094 ] 

Uwe Schindler commented on LUCENE-2069:
---------------------------------------

Looks good, +1 to commit!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
>                 Key: LUCENE-2069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly. 
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org