You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/06/12 20:36:07 UTC

[jira] Created: (LUCENE-1689) supplementary character handling

supplementary character handling
--------------------------------

                 Key: LUCENE-1689
                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Robert Muir
            Priority: Minor


for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

supplementary character support should be fixed for code that works with char/char[]

For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.

in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-1689) supplementary character handling

Posted by Simon Willnauer <si...@googlemail.com>.

+1

On 11/16/09, Robert Muir (JIRA) <ji...@apache.org> wrote:
>
>     [
> https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778378#action_12778378
> ]
>
> Robert Muir commented on LUCENE-1689:
> -------------------------------------
>
> a couple people have asked me about this issue lately, I would prefer to
> spin off smaller issues rather than create large patches that become out of
> date.
>
> also I think Simon is interested in working on some of this, so more jira
> spam but i think easier to make progress.
>
>
>> supplementary character handling
>> --------------------------------
>>
>>                 Key: LUCENE-1689
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>            Reporter: Robert Muir
>>            Priority: Minor
>>             Fix For: 3.1
>>
>>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch,
>> LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt,
>> testCurrentBehavior.txt
>>
>>
>> for Java 5. Java 5 is based on unicode 4, which means variable-width
>> encoding.
>> supplementary character support should be fixed for code that works with
>> char/char[]
>> For example:
>> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be
>> changed so they don't actually remove suppl characters, or modified to
>> look for surrogates and behave correctly.
>> LowercaseFilter should be modified to lowercase suppl. characters
>> correctly.
>> CharTokenizer should either be deprecated or changed so that isTokenChar()
>> and normalize() use int.
>> in all of these cases code should remain optimized for the BMP case, and
>> suppl characters should be the exception, but still work.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719158#action_12719158 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

i forgot to answer your question Michael: 

it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it?

I do not think it guarantees this for all future unicode versions. In my opinion, we should exploit things like this if I can show a test case that proves its true for all codepoint in the current version of unicode :)
And it should be documented that this could possibly change in some future version.
In this example, its a nice simplification because it guarantees the length (in code units) will not change!

I think for a next step on this issue I will create and upload a test case showing the issues and detailing some possible solutions.
For some of them, maybe a javadoc update is the most appropriate, but for others, maybe an API change is the right way to go.
Then we can figure out what should be done.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737576#action_12737576 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Simon, not yet. I was trying to solicit some opinions on what the best approach should be. I realize this is a very special case.

I haven't seen any questions on supp. characters on the lists, do you have a link?


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737732#action_12737732 ] 

Michael McCandless commented on LUCENE-1689:
--------------------------------------------

That sounds right.

We would also deprecate the char based methods?  And note that in 3.0 we'll remove them, and make the int-based methods abstract?

And, then fix CharTokenizer to properly handle the extended chars?

We may need to use reflection to see if the current subclass "wants" chars or ints, and fallback to the old (current) char processing if it's a subclass that doesn't define the int-based methods.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778527#action_12778527 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

bq. We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

Mark honestly, I do not yet know how this one could be reasonably done.
The problem is the behavior is dependent a lot upon the users JVM, how in the hell do we know which JVM version was used to create some given index????

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1689:
--------------------------------

    Attachment: LUCENE-1689.patch

patch with some more work in contrib for this issue.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719127#action_12719127 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Simon, I want to address your comment on performance.
I think that surrogate detection is cheap when done right and I don't think there's a ton of places that need changes.
But I don't think any indicator is really appropriate, for example my TokenFilter might want to convert one chinese character in the BMP to another one outside of the BMP. It is all unicode.

But there is more than just analysis involved here, for example I have not tested WildcardQuery: ? operator.
I'm not trying to go berzerko and be 'ultra-correct', but basic things like that should work.
For situations where its not worth it, i.e. FuzzyQuery's scoring, we should just doc that the calculation is based on 'code units', and leave it alone.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1689:
--------------------------------

    Attachment: testCurrentBehavior.txt

this is just a patch with testcases showing the existing behavior.

perhaps these should be fixed:
Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely.
LowerCaseFilter: doesn't lowercase supp. characters correctly.
WildcardQuery: ? operator does not work correctly.

perhaps these just need some javadocs:
FuzzyQuery: scoring is strange because its based upon surrogates, leave alone and javadoc it.
LengthFilter: length is calculated based on utf-16 code units, leave alone and javadoc it.


... and theres always the option to not change any code, but just javadoc all the behavior as a "fix", providing stuff in contrib or elsewhere that works correctly.
let me know what you think.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778528#action_12778528 ] 

Steven Rowe commented on LUCENE-1689:
-------------------------------------

I don't know if this is the right place to point this out, but: JFlex-generated scanners (e.g. StandardAnalyzer) do not properly handle supplementary characters.

Unfortunately, it looks like the as-yet-unreleased JFlex 1.5 will not support supplementary characters either, so this will be a gap in Lucene's Unicode handling for a while.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740896#action_12740896 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Michael, I do think that would be the simplest, but most invasive (we would have to make new duplicate classes for Arabic & Russian even!) 
i think i will try to create a patch for my 2nd method, and if it cannot be made simple, we can always do the new class/deprecation route?


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719174#action_12719174 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Yonik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends.

These are probably the worst offenders, as a lot of these tokenizers just remove >BMP chars, which is really nasty behavior for CJK. 

I agree its a collection of issues but maybe there can be an overall strategy?



> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741091#action_12741091 ] 

Uwe Schindler commented on LUCENE-1689:
---------------------------------------

For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it :-(

I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1689:
--------------------------------

    Attachment: LUCENE-1689.patch

just a dump to show where I am trying to go.
I havent done anything to Whitespace/Letter yet, but you can see the changes to CharTokenizer.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737740#action_12737740 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

michael yes thats what I had in mind.

like you mentioned, the non-final CharTokenizer-based things (Whitespace, Letter) are a problem.

The easiest way (it looks) is to put the reflection inside the non-final CharTokenizers: Whitespace and Letter.
I want them to work correctly if you have not subclassed them, but if you have, it should call your char based method (which wont work for supp characters, but it didnt before)!

thoughts?


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738312#action_12738312 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1689:
--------------------------------

    Attachment: LUCENE-1689.patch

patch with a different technique for CharTokenizer and friends. I like this much more.

for non-final subclasses that are moved to the new int api, I had to expose a protected boolean switch so that any old subclasses of them work as expected: see Arabic/Russian/Letter/Whitespace for examples.



> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719146#action_12719146 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Simon, a response to your other comment 'I think we need to act on this issue asap and release it together with 3.0. -> full support for unicode 4.0 in lucene 3.0'

Full unicode support would be great!

But this involves a whole lot more, I'm trying to create a contrib under LUCENE-1488 for "full unicode support". 
A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode normalization.

Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a start!


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741815#action_12741815 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

this is just a reminder that I think if we go this route, we will need caching similar to what Uwe added for TokenStream back compat so that construction performance is not bad.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778421#action_12778421 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Yonik, or anyone else, please let me know your thoughts on the following:

bq. I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased. To rely on that would also be relying on undocumented behavior.
bq. Ah, OK. Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).

what should be our approach here wrt index back compat?
For the issues mentioned here, I cant possibly see >BMP working currently for anyone, but you are right it will change results.

I don't want to break index back compat, just wanted to mention that introducing Unicode 4 support, still with API back compat, with no performance degradation, is going to be somewhat challenging already.
If we want to somehow support the "broken" analysis components for index back compat, then we have to also have a broken implementation available on top of the correct impl (probably using Version to handle this).
In my opinion, this would introduce a lot of complexity, I will help do it though, if we must.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741085#action_12741085 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Uwe, unfortunately Class.getMethod only works for public methods!

So I must use Class.getDeclaredMethod in this case, because the methods in question are protected. The problem with Class.getDeclaredMethod is that it does not check superclasses like Class.getMethod does, thus the loop.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741089#action_12741089 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Uwe, and I thank you for the good example! I tried to apply your same logic but was frustrated when it wasn't working only to discover this annoyance with java reflection.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778524#action_12778524 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

bq. If there is nothing we can do here, then we just have to do the best we can -

Oh there is definitely a choice. and as I said, I am willing to help either way we go.
I am just warning there will be a lot more complexity if we want to use say, Version to support both broken implementation and fixed implementation.
lots of tests, and hopefully no bugs :(

Uwe already commented he would like it controlled with Version, just saying suppl. characters are notoriously tricky as it is!


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1689:
--------------------------------

    Attachment: LUCENE-1689_lowercase_example.txt

an example of how LowercaseFilter might be fixed. 
I only changed the new-api method for demonstration purposes.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741088#action_12741088 ] 

Uwe Schindler commented on LUCENE-1689:
---------------------------------------

OK, I did not know this. I read again the Class.getMethod() javadocs and it states that it only returns public methods. In the case of TokenStream bw compatibility, only public methods are affected, so I used the "simple and faster" way.

I was already wondering why you changed this call, because the rest of the reflection code seemed to be copied from TokenStream e.g. the name of the parameter type arrays...  :-)

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737576#action_12737576 ] 

Robert Muir edited comment on LUCENE-1689 at 7/31/09 9:07 AM:
--------------------------------------------------------------

Simon, not yet. I was trying to solicit some opinions on what the best approach should be. I realize this is a very special case.



      was (Author: rcmuir):
    Simon, not yet. I was trying to solicit some opinions on what the best approach should be. I realize this is a very special case.

I haven't seen any questions on supp. characters on the lists, do you have a link?

  
> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778532#action_12778532 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Steven, no its definitely the right place to point it out! I know this is true, even with 1.5 :)

One reason I wanted to split this issue up was to try to make 'improvements', maybe we do not fix everything.
there are also other options for StandardTokenizer/Jflex
For instance, we could not break between any surrogate pairs and classify them as CJK (index individual character) for the time being.
While technically incorrect, it would handle the common cases, i.e. ideographs from Big5-HKSCS, etc.

but right now the topic is i guess on unicode and index back compat in general... trying to figure out what is the reasonable approach to handling this (supporting the old broken behavior/indexes created with them)

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741097#action_12741097 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Uwe, thanks. I will modify the loop in that way, it is much easier to read.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1689) supplementary character handling

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741091#action_12741091 ] 

Uwe Schindler edited comment on LUCENE-1689 at 8/9/09 8:30 AM:
---------------------------------------------------------------

For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it :-(

I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).

{code}
private Class getDeclaringClass(String name, Class[] params) {
  for (Class clazz = this.getClass(); clazz != null; clazz = clazz.getSuperclass()) {
    try {
      clazz.getDeclaredMethod(name, params);
      return clazz;
    } catch (NoSuchMethodException e) {}
  }
  return null; /* impossible */
}
{code}

      was (Author: thetaphi):
    For me it is strange, that getMethod() does not work for protected methods, because they can be inherited. That it does not work for private methods is clear, because they are only known to the class itsself (you cannot even override a private methods). But the JDK defines it in that way and we must life with it :-(

I would stop the iteration when getSuperclass() returns null and then return null (which cannot happen, as the method is defined at least in CharTokenizer, you wrote this in the comment).
  
> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778378#action_12778378 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

a couple people have asked me about this issue lately, I would prefer to spin off smaller issues rather than create large patches that become out of date.

also I think Simon is interested in working on some of this, so more jira spam but i think easier to make progress.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1689:
---------------------------------

    Fix Version/s:     (was: 2.9)
                   3.1

bq. I am curious how you plan on approaching backwards compat? 
I don't see a real back compat issue... I can't imagine anyone relying on the fact that >BMP chars wouldn't be lowercased.  To rely on that would also be relying on undocumented behavior.

It seems like this (handling code points beyond the BMP) is really a collection of issues that could be committed independently when ready?

Moving to 3.1 since it requires Java 1.5 for much of it.
If there's something desirable to slip in for 2.9 that doesn't depend on Java5, we can still do that.



> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778427#action_12778427 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.

if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1689:
---------------------------------------

    Fix Version/s: 2.9

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741083#action_12741083 ] 

Uwe Schindler commented on LUCENE-1689:
---------------------------------------

In your latest patch you have this method getDeclaringClass() using the while loop. To get the class declaring a method, it is simplier to call {code}this.getClass().getMethod(...).getDeclaringClass(){code}
See TokenStream.java for example where the same is done for the new TokenStream API.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719104#action_12719104 ] 

Michael McCandless commented on LUCENE-1689:
--------------------------------------------

Robert, could you flesh this patch out to a committable point?  Ie, handle unpaired surrogates, add test case that first shows that LowercaseFilter incorrectly breaks up surrogates?  Thanks!

bq. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints

Is it invalid to make this assumption?  Ie, does the unicode standard not guarantee it?

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718942#action_12718942 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

icu uses the following idiom to check if a char ch is a surrogate. might be the fastest way to ensure the performance impact is minimal:

(ch & 0xFFFFF800) == 0xD800



> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778526#action_12778526 ] 

Mark Miller commented on LUCENE-1689:
-------------------------------------

I'm speaking in regards to:

{quote}
btw, its worth mentioning that this whole concept of index back compat wrt Unicode is already completely broken in Lucene.
if someone upgrades to lucene 3.0 and uses java 1.5, Character.* operations already return different results than on java 1.4, so they are already broken from their previous index.
{quote}

We can fix that too? If so, I think we should. The tokenstream backcompat was a bitch as well - but we had to slog through it.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719125#action_12719125 ] 

Simon Willnauer commented on LUCENE-1689:
-----------------------------------------

The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code. 
I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0 
I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter.

Just a couple of random thoughts.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737747#action_12737747 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

by the way i plan to replace any scary bit operations with UnicodeUtil functions. JDK doesnt have methods to yank the lead or trail surrogate out of an int, except in sun.text.normalizer of course :)

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737679#action_12737679 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

one tricky part of this is CharTokenizer. a few things subclass this guy.

* Sometimes it is ok (WhitespaceTokenizer, because all whitespace in unicode <= 0xFFFF)
* Sometimes it is not (LetterTokenizer)

It depends how its being used. So I think instead of just
{code}
abstract boolean isTokenChar(char)
char normalize(char)
{code}

we can add additional int-based methods, where the default implementation is to defer to the char-based ones.

This would preserve backwards compatibility.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778531#action_12778531 ] 

Mark Miller commented on LUCENE-1689:
-------------------------------------

bq. Mark honestly, I do not yet know how this one could be reasonably done.

Then thats what I am saying we should be documenting.

bq. how in the hell do we know which JVM version was used to create some given index????

If something like this made sense long term, we can start putting it in as index metadata - not helpful now, but could end up being so.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778536#action_12778536 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

bq. Then thats what I am saying we should be documenting.

+1 I think it should be in the release notes that if you upgrade Java version to either 1.5 or 1.7, you should reindex because Unicode versions change.
Because of this, character properties change and some tokenizers, etc will emit different results.

This is really only somewhat related to this issue, but thats what my email thread on java-dev was all about.


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737535#action_12737535 ] 

Simon Willnauer commented on LUCENE-1689:
-----------------------------------------

Robert, are you still working on this issue? I have seen a couple of questions about supplementary character support on the mailinig-lists though - I will see how much time I have during the next weeks to get this going.

simon

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719139#action_12719139 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

I am curious how you plan on approaching backwards compat?

#1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue.
#2 The patch won't work on java 1.4 because the jdk does not supply the needed functionality
#3 Its my understanding you like to deprecate things (say CharTokenizer) and give people a transition time. How will this work?

i intend to supply patch that fixes the problems, but I wanted to bring up those issues...


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719179#action_12719179 ] 

Yonik Seeley commented on LUCENE-1689:
--------------------------------------

bq. onik, I think the problem is where method signatures must change, such as CharTokenizer, required to fix LetterTokenizer and friends. 

Ah, OK.  Actually it just occurred to me that this would also require reindexing, otherwise queries that hit documents in the past would mysteriously start missing them (for text outside the BMP).


> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718938#action_12718938 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

my example has some "issues" (i.e. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints).
it also doesn't correctly handle invalid unicode data (unpaired surrogates).

Such complexity can be all added to the slower-path supplementary case. the intent is just to show it wouldnt be a terribly invasive change.
Thanks!

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740866#action_12740866 ] 

Michael McCandless commented on LUCENE-1689:
--------------------------------------------

{quote}
I think instead of the way I prototyped in the first patch, it might be better to have the original chartokenizer incrementToken definition still available in the code.
this is some temporary code duplication but would perform better for the backwards compat case, and the backwards compatibility would be more clear to me at least.
{quote}

I suppose we could simply make an entirely new class, which properly handles surrogates, and deprecate CharTokenizer in favor of it?  Likewise we'd have to make new classes for the current subclasses of CharTokenizer (Whitespace,LetterTokenizer).  That would simplify being back compatible.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719123#action_12719123 ] 

Robert Muir commented on LUCENE-1689:
-------------------------------------

Michael: LowerCaseFilter doesn't incorrectly break surrogates, it just won't lowercase supplementary codepoints.

But I can get it in shape, sure.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778516#action_12778516 ] 

Mark Miller commented on LUCENE-1689:
-------------------------------------

If there is nothing we can do here, then we just have to do the best we can -

such as a prominent notice alerting that if you transition JVM's between building and searching the index and you are using or doing X, things will break.

We should put this in a spot that is always pretty visible - perhaps even a new readme file titlted something like IndexBackwardCompatibility or something, to which we can add other tips and gotchyas as they come up. Or MaintainingIndicesAcrossVersions, or FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent entry/sticky entry at the top of Changes.

> supplementary character handling
> --------------------------------
>
>                 Key: LUCENE-1689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1689
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689.patch, LUCENE-1689_lowercase_example.txt, testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org