You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/10/29 19:19:59 UTC

[jira] Created: (LUCENE-2019) map unicode process-internal codepoints to replacement character

map unicode process-internal codepoints to replacement character
----------------------------------------------------------------

                 Key: LUCENE-2019
                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Robert Muir
            Priority: Minor


A spinoff from LUCENE-2016.

There are several process-internal codepoints in unicode, we should not store these in the index.
Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.

An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772246#action_12772246 ] 

Earwin Burrfoot commented on LUCENE-2019:
-----------------------------------------

bq. if you disagree with this patch, then you should also disagree with treating U+FFFF special! I don't see how in the world U+FFFF is different than any other codepoint in the noncharacter category in this regard!
Yes! The best treatment a library can offer to your data if you're not explicitly requesting transformation is transparently passing it in and out.
If Lucene had cleanly separated text/binary data API, it would be okay to mangle text input. But now such mangling just messes up other people's attempts of building said type-safe API on top of Lucene.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772164#action_12772164 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

Lucene is not an application.

Again, quoting from section 16.7 (emphasis mine):

bq. *Applications* are free to use any of these noncharacter code points internally but should never attempt to exchange them.

The forbidden operation is exchanging non-characters across the *application* boundary.  

Asking Lucene to store non-characters for you is not a violation of the Unicode standard.  Lucene agreeing to do so is not a violation of the Unicode standard.

If a Lucene user later uses a Lucene index to exchange data (of whatever form) across the application boundary, that's on the user, not on Lucene.

(I'll skip the Lucene-as-a-weapon metaphor.  You can thank me later.)


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772180#action_12772180 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

bq. So you think that enforcing consistency is worth the cost of disallowing some usages, and I don't.

no, i think this a myth. I think this is the original bug that caused index corruption
* lucene used a noncharacter (happened to be U+FFFF) process-internally
* lucene also treated this noncharacter as an abstract character (it later got truncated by some encoding routine, but basically it didn't correctly handle this case)
 
by disallowing all noncharacters as term text, lucene is *more free* to use them as delimiters, and sentinel values, and such, as specified in chapter 3 of the standard.

right now you only have one you treat correctly, and thats U+FFFF.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772147#action_12772147 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

bq. And if I'm indexing to a RAM directory? The point is, the private-use char is never seen external to the process (which includes both Lucene and it's index).

whoah, lets not confuse private-use characters with non-characters. there is a huge difference!!!!

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772247#action_12772247 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Earwin, it won't mess with anyones attempt if they use IndexableBinaryStringTools.

if instead, they write some routine that generates invalid unicode, well, I think they get what they deserve.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772104#action_12772104 ] 

Michael McCandless commented on LUCENE-2019:
--------------------------------------------

The patch looks good, and is non-intrusive, but I think we need to somehow answer the larger question about whether Lucene should in fact be in the business of replacing all "invalid for interchange" characters.  Really it comes down to the semantics question of whether Lucene is in fact "process internal" to an application (or, whether we want to allow applications to treat Lucene that way).

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772153#action_12772153 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Steven, I argue that your quote from the standard agrees with this issue more than it disagrees.

again, its up for interpretation though :)

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772118#action_12772118 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

Lucene indexes can be used both process-internally and across processes (e.g. Solr).

This patch enforces the Lucene-index-as-process-external view, and excludes the possiblity that a Lucene index is used process-internally.

Since Lucene itself uses U+FFFF internally, no clients can use it for their own purposes.  This patch rationalizes handling of internal-use-only characters, such that Lucene's behavior is made consistent for all of them.

Instituting this consistency precludes Lucene-index-as-process-internal use cases.  I would argue that the price of consistency is in this case too high.

My vote: document the crap out of the U+FFFF Lucene-internal-use character and drop this patch.

If people want to use internal-use-only characters in Lucene indexes, as long as Lucene doesn't reserve them for its own use, why stop them?


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772180#action_12772180 ] 

Robert Muir edited comment on LUCENE-2019 at 10/30/09 11:41 PM:
----------------------------------------------------------------

bq. So you think that enforcing consistency is worth the cost of disallowing some usages, and I don't.

no, i think this a myth. I think this is the original bug that caused index corruption
* lucene used a noncharacter (happened to be U+FFFF) process-internally
* lucene also treated this noncharacter as an abstract character (it later got truncated by some encoding routine, but basically it didn't correctly handle this case)
 
by disallowing all noncharacters as term text, lucene is *more free* to use them as delimiters, and sentinel values, and such, as specified in chapter 3 of the standard.

right now you only have one you treat correctly, and thats U+FFFF.

<edit>

Steven, by the way, I think something i havent been able to communicate properly, is that I feel very strongly that storing noncharacters in *term text* where they are treated as abstract characters, is very different than using them as sentinel values / delimiters / etc in the index format, I think this is ok and is what they are for.

but term text is different, search engines index human language and by putting noncharacters in term text you are treating them as abstract characters.

      was (Author: rcmuir):
    bq. So you think that enforcing consistency is worth the cost of disallowing some usages, and I don't.

no, i think this a myth. I think this is the original bug that caused index corruption
* lucene used a noncharacter (happened to be U+FFFF) process-internally
* lucene also treated this noncharacter as an abstract character (it later got truncated by some encoding routine, but basically it didn't correctly handle this case)
 
by disallowing all noncharacters as term text, lucene is *more free* to use them as delimiters, and sentinel values, and such, as specified in chapter 3 of the standard.

right now you only have one you treat correctly, and thats U+FFFF.
  
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772133#action_12772133 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. Steven, the only reason I might disagree is that a Lucene Index is supposed to be portable across different languages other than Lucene Java.

Right, but not all Lucene indexes in-the-wild are accessed from more than one language.  The vast majority of Lucene index uses, I'd venture to guess, are single-language, single-process uses.

bq. in my opinion, if you are to store process-internal codepoints as abstract characters in terms, then you should not claim that Lucene indexes are in any Unicode format, because then they violate the standard.

I strongly disagree with the assumption that interchange and serialization are synonymous.

bq. By *not* storing them in terms, then you are free to use them as delimiters, or other purposes. right now U+FFFF is used as a delimiter, but who knows, maybe someday you might need more?

I actually agree with this argument.  What if Lucene needs more process-internal characters?  I don't have any way of gauging the probability that it will in the future (other than the last eight years of history, during which only one was deemed necessary).  But what does Mike M. say? "Design for now" or something like that?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772135#action_12772135 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

bq. I strongly disagree with the assumption that interchange and serialization are synonymous.

Actually I won't argue with you too much about this. i only care about lucene-java.

bq. I actually agree with this argument. What if Lucene needs more process-internal characters? I don't have any way of gauging the probability that it will in the future (other than the last eight years of history, during which only one was deemed necessary). But what does Mike M. say? "Design for now" or something like that?

right, the point is that in my processing as a user, i might need to have delimiters or whatever.
i should not have to worry about lucene treating them as an *abstract character* because the unicode standard says it should not.
so for example, if i create a MultiTermQuery, i should be able to use U+FFFE and U+FFFF both internally, perhaps to delimit things for different reasons, without any concern that they are stored in term text.
by storing them in term text, by definition they are being treated as abstract character.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772174#action_12772174 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. if you disagree with this patch, then you should also disagree with treating U+FFFF special! 

Quoting myself from an earlier comment on this issue (apoligies):

bq. Instituting this consistency precludes Lucene-index-as-process-internal use cases. I would argue that the price of consistency is in this case too high.

So you think that enforcing consistency is worth the cost of disallowing some usages, and I don't.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2019:
--------------------------------

    Attachment: LUCENE-2019.patch

here is a prototype patch.
because U+DFFE and U+DFFF are process-internal for all planes, the surrogate case is easy (just treat these process-internal points in the unpaired surrogates codepath)

for the additional bmp chars, they happen to be > UNI_SUR_HIGH_START, so again it shouldnt be in the main branch.

Mike, if you have a moment maybe check this out, if its ok ill fix formatting and add tests.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772125#action_12772125 ] 

Yonik Seeley commented on LUCENE-2019:
--------------------------------------

If someone purposefully hands lucene internal-use-only characters, doesn't that imply they are using lucene in a process-internal manner?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772170#action_12772170 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Steven, yes you are arguing the interchange part.

I am arguing the treatment of a *noncharacter* as an *abstract character*.
If a lucene index stores a noncharacter as if it were any other character (i.e. within a Term), then its treated as an abstract character.

if you disagree with this patch, then you should also disagree with treating U+FFFF special! 
I don't see how in the world U+FFFF is different than any other codepoint in the noncharacter category in this regard!


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772110#action_12772110 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Michael, well if we go by the unicode standard:
Section 3.2

C2 A process shall not interpret a noncharacter code point as an abstract character.
• The noncharacter code points may be used internally, such as for sentinel values
or delimiters, but should not be exchanged publicly.

This makes me think they should not be in terms, but i'll take anyone's interpretation.
if people disagree, just cancel the issue as not fix. i don't think this approach will hurt performance.


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771541#action_12771541 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

easiest way to get the complete list: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771541#action_12771541 ] 

Robert Muir edited comment on LUCENE-2019 at 10/29/09 6:39 PM:
---------------------------------------------------------------

easiest way to get the complete list: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:Noncharacter_Code_Point=True:%5D


      was (Author: rcmuir):
    easiest way to get the complete list: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
  
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772151#action_12772151 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. process-internal is somethign that won't be stored or interchanged in any way (internal to the process)

Right, this is the crux of the disagreement: you think storage (with the exception of in-memory usage) means interchange.  I and Yonik think that storage does not necessarily mean interchange.

Section 16.7 (_Noncharacters_) of the Unicode 5.0.0 standand (the latest version for which an electronic version of this chapter is available), says:

{quote}
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance requirements related to their use.

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not "Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any other way from the other noncharacters, except in their code point values.

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See conformance clause C7 in Section 3.2, Conformance Requirements.)

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.

*U+FFFF and U+10FFFF.*  These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.
{quote}

(I left out the last part about U+FFFE.)

Again, the crux of the matter is the definition of "open interchange".

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772138#action_12772138 ] 

Yonik Seeley commented on LUCENE-2019:
--------------------------------------

Here's a process-internal use-case (as I understand the meaning):
User hands me two tokens. I catenate and separate them with an internal-use char, then index.
Later, I get this term somehow from lucene, split on my internal-use char and hand back to the user.
If lucene converts internal-use chars, this becomes impossible.

What's the use-case for handing lucene internal-use characters and not wanting it preserved?  Couldn't you always use your internal-use characters, and then convert or remove them before handing to lucene?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772154#action_12772154 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

i guess the final thing I will say, is the inconsistency of treating U+FFFF special, without being consistent and treating the entire category (noncharacter) the same way.

This is what lead to the index corruption bug in the first place after all, if you look at it from a unicode perspective and not from a java perspective. 


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772147#action_12772147 ] 

Robert Muir edited comment on LUCENE-2019 at 10/30/09 10:53 PM:
----------------------------------------------------------------

bq. And if I'm indexing to a RAM directory? The point is, the private-use char is never seen external to the process (which includes both Lucene and it's index).

whoah, lets not confuse private-use characters with non-characters. there is a huge difference!!!!

<edit addition>

I see private-use characters as available to the end user, i.e. someone like DM Smith trying to index Myanmar encoded in private-use chars (he mentioned this before).
These are available for his use.

control characters are available for your use to concatenate or do strange things.

process-internal (non-characters) should not be stored and are available for processing, without concern that they will be treated as an abstract character.

      was (Author: rcmuir):
    bq. And if I'm indexing to a RAM directory? The point is, the private-use char is never seen external to the process (which includes both Lucene and it's index).

whoah, lets not confuse private-use characters with non-characters. there is a huge difference!!!!
  
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772128#action_12772128 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Yonik, no. this is how i see it.

process-internal means just that, internal to processing. 
So if I want to use U+FDDF in some lucene process, say as a syllable delimiter or something like that, i should not have to worry about it being stored in a *portable index* as an abstract character (inside some term text)

again this is my interpretation, so if you guys disagree, please mark the issue as not fix, but i enjoy the discussion.


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772211#action_12772211 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

{quote}
Steven, by the way, I think something i havent been able to communicate properly, is that I feel very strongly that storing noncharacters in term text where they are treated as abstract characters, is very different than using them as sentinel values / delimiters / etc in the index format, I think this is ok and is what they are for.

but term text is different, search engines index human language and by putting noncharacters in term text you are treating them as abstract characters.
{quote}

Robert, you are a proponent of the (ICU)CollationKeyFilter functionality, which uses IndexableBinaryStringTools to store arbitrary binary data in a Lucene index.  These filters store non-human-readable terms in the index.  I can think of several other examples of using Lucene indexes to store non-human-language terms.

Character data, in addition to representing characters, is *data*.  Bits.  I would argue that you *always* need context to figure out what bits represent.

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772215#action_12772215 ] 

Uwe Schindler commented on LUCENE-2019:
---------------------------------------

... the same with trie (numeric) fields ...

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772120#action_12772120 ] 

Yonik Seeley commented on LUCENE-2019:
--------------------------------------

bq. If people want to use internal-use-only characters in Lucene indexes, as long as Lucene doesn't reserve them for its own use, why stop them?

+1

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772142#action_12772142 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Yonik, i argue in your process-internal usecase, that its in fact process-external.

in this case of concatenation you should instead use a control character, or something more suitable for this purpose.

process-internal is somethign that won't be stored or interchanged in any way (internal to the process)

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772121#action_12772121 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

Steven, the only reason I might disagree is that a Lucene Index is supposed to be portable across different languages other than Lucene Java.

in my opinion, if you are to store process-internal codepoints as abstract characters in terms, then you should not claim that Lucene indexes are in any Unicode format,
because then they violate the standard.

By *not* storing them in terms, then you are free to use them as delimiters, or other purposes. right now U+FFFF is used as a delimiter, but who knows, maybe someday you might need more?

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772214#action_12772214 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

bq. Robert, you are a proponent of the (ICU)CollationKeyFilter functionality, which uses IndexableBinaryStringTools to store arbitrary binary data in a Lucene index.

I'm actually only a fan of the idea of using unicode collation for searching and sorting.
as a default, binary comparison for most languages is absolute madness.

I'm not so much a fan of how it encodes this data binary data into what people expect to be a character field, but I understand the limitations and don't blame the implementation :)
and this implementation never uses any of the characters in question, although thats not really relevant.


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771623#action_12771623 ] 

Robert Muir commented on LUCENE-2019:
-------------------------------------

I think this code won't be so intrusive or hairy.
Here is the list in surrogate pair representation.
Note that for the > BMP points, the trail surrogate is always U+DFFE or U+DFFF

BMP points:
{noformat}
\uFDD0-\uFDEF
\uFFFE
\uFFFF <-- already handled
{noformat}

> BMP points:
{noformat}
\uD83F\uDFFE
\uD83F\uDFFF
\uD87F\uDFFE
\uD87F\uDFFF
\uD8BF\uDFFE
\uD8BF\uDFFF
\uD8FF\uDFFE
\uD8FF\uDFFF
\uD93F\uDFFE
\uD93F\uDFFF
\uD97F\uDFFE
\uD97F\uDFFF
\uD9BF\uDFFE
\uD9BF\uDFFF
\uD9FF\uDFFE
\uD9FF\uDFFF
\uDA3F\uDFFE
\uDA3F\uDFFF
\uDA7F\uDFFE
\uDA7F\uDFFF
\uDABF\uDFFE
\uDABF\uDFFF
\uDAFF\uDFFE
\uDAFF\uDFFF
\uDB3F\uDFFE
\uDB3F\uDFFF
\uDB7F\uDFFE
\uDB7F\uDFFF
\uDBBF\uDFFE
\uDBBF\uDFFF
\uDBFF\uDFFE
\uDBFF\uDFFF
{noformat}


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772142#action_12772142 ] 

Robert Muir edited comment on LUCENE-2019 at 10/30/09 10:47 PM:
----------------------------------------------------------------

Yonik, i argue in your process-internal usecase, that its in fact process-external.

in this case of concatenation you should instead use a control character, or something more suitable for this purpose.
for example, you could use "information separator" controls 001C-001F, and this is especially more nice than if you were to use U+FFFE, say for this purpose.
some other process could recognize that you were separating information, rather than just using a noncharacter for process-internal use.

process-internal is somethign that won't be stored or interchanged in any way (internal to the process)

      was (Author: rcmuir):
    Yonik, i argue in your process-internal usecase, that its in fact process-external.

in this case of concatenation you should instead use a control character, or something more suitable for this purpose.

process-internal is somethign that won't be stored or interchanged in any way (internal to the process)
  
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772184#action_12772184 ] 

Steven Rowe commented on LUCENE-2019:
-------------------------------------

bq. by disallowing all noncharacters as term text, lucene is *more free* to use them as delimiters, and sentinel values, and such, as specified in chapter 3 of the standard.

Lucene is more free, but Lucene's users are not.  Quite the contrary.

IMHO, Lucene's users (applications that incorporate the Lucene library) should be able to use Unicode data in ways that the standard allows ("Applications are free to use any of these noncharacter code points internally").

U+FFFF was chosen for Lucene-internal use for reasons very similar to those you're bringing up, Robert: something like "who would ever want to use non-characters in an index?"  However, this choice does not obligate Lucene to take the same action for all other non-characters.

I think the fix here is documentation, not proscription.


> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772128#action_12772128 ] 

Robert Muir edited comment on LUCENE-2019 at 10/30/09 10:30 PM:
----------------------------------------------------------------

Yonik, no. this is how i see it.

process-internal means just that, internal to processing. 
So if I want to use U+FDDF in some lucene process, say as a syllable delimiter or something like that, i should not have to worry about it being stored in a *portable index* as an abstract character (inside some term text)

again this is my interpretation, so if you guys disagree, please mark the issue as not fix, but i enjoy the discussion.

<edit addition>

here is an example use case, perhaps i want to make a Query that needs to do some fuzzy matching or something crazy for a difficult language.
i should be able to internally use any of these process-internal codepoints as delimiters in my processing (process-internal)
without worrying that they will be in the Term text from TermEnum. 

      was (Author: rcmuir):
    Yonik, no. this is how i see it.

process-internal means just that, internal to processing. 
So if I want to use U+FDDF in some lucene process, say as a syllable delimiter or something like that, i should not have to worry about it being stored in a *portable index* as an abstract character (inside some term text)

again this is my interpretation, so if you guys disagree, please mark the issue as not fix, but i enjoy the discussion.

  
> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2019) map unicode process-internal codepoints to replacement character

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772146#action_12772146 ] 

Yonik Seeley commented on LUCENE-2019:
--------------------------------------

bq. process-internal is somethign that won't be stored or interchanged in any way (internal to the process)
And if I'm indexing to a RAM directory?  The point is, the private-use char is never seen external to the process (which includes both Lucene and it's index).

> map unicode process-internal codepoints to replacement character
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2019
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2019
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2019.patch
>
>
> A spinoff from LUCENE-2016.
> There are several process-internal codepoints in unicode, we should not store these in the index.
> Instead they should be mapped to replacement character (U+FFFD), so they can be used process-internally.
> An example of this is how Lucene Java currently uses U+FFFF process-internally, it can't be in the index or will cause problems. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org