You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2009/10/29 18:06:59 UTC

[jira] Created: (LUCENE-2016) replace invalid U+FFFF character during indexing

replace invalid U+FFFF character during indexing
------------------------------------------------

                 Key: LUCENE-2016
                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 2.9, 2.4.1, 2.4
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 2.9.1, 3.0


If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).

We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771559#action_12771559 ] 

Earwin Burrfoot commented on LUCENE-2016:
-----------------------------------------

Being one of those hit by U+FFFF earlier, I'd rather like to see remapping happen in some filter and IW throwing an exception on what it deems 'illegal'. Or at very least a big fat documentation entry, that jumps in your face somehow and lists all codepoints that will be remapped.


> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771534#action_12771534 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

Michael, one last question. is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ?

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771486#action_12771486 ] 

Robert Muir edited comment on LUCENE-2016 at 10/29/09 5:22 PM:
---------------------------------------------------------------

even disregarding the problem, I think FFFD is much better than truncation... its what i expect.

but I think we should handle U+FFFE also. (and FDD0-FDEF)

there are actually a few more 'guaranteed not to be a characters, not for interchange' outside of the BMP, but that invalid surrogate logic looks pretty hairy already :)
to see the full list, look at http://www.unicode.org/charts/

under subheading: 
Noncharacters in Charts

Reserved range

Noncharacters at end of ...

      was (Author: rcmuir):
    even disregarding the problem, I think FFFD is much better than truncation... its what i expect.

but I think we should handle U+FFFE also. 

there are actually a few more 'guaranteed not to be a characters, not for interchange' outside of the BMP, but that invalid surrogate logic looks pretty hairy already :)
to see the full list, look at http://www.unicode.org/charts/

under subheading: 
Noncharacters in Charts

Reserved range

Noncharacters at end of ...
  
> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771531#action_12771531 ] 

Yonik Seeley commented on LUCENE-2016:
--------------------------------------

bq. This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" 

Tricky semantics :-)  It rather depends on if you consider Lucene part if your "process-internally" .  Depending on the use case, it could be either.


> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771486#action_12771486 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

even disregarding the problem, I think FFFD is much better than truncation... its what i expect.

but I think we should handle U+FFFE also. 

there are actually a few more 'guaranteed not to be a characters, not for interchange' outside of the BMP, but that invalid surrogate logic looks pretty hairy already :)
to see the full list, look at http://www.unicode.org/charts/

under subheading: 
Noncharacters in Charts

Reserved range

Noncharacters at end of ...

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771565#action_12771565 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

Earwin, take a look at LUCENE-2019. I added a hyperlink to the list there... 

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2016:
---------------------------------------

    Attachment: LUCENE-2016.patch

Fix is trivial.

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771533#action_12771533 ] 

Robert Muir edited comment on LUCENE-2016 at 10/29/09 6:23 PM:
---------------------------------------------------------------

bq. Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.

Not really, Lucene-java uses U+FFFF process-internally, but wasn't mapping it to something valid in the index. So when U+FFFF was stored in the index (or rather, wasn't being stored but handled incorrectly), it created an issue. This is a perfect example of this.



      was (Author: rcmuir):
    bq. Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.

Not really, Lucene-java uses U+FFFF process-internally, but wasn't mapping it to something valid in the index. So when U+FFFF was stored in the index, it created an issue. This is a perfect example of this.


  
> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771545#action_12771545 ] 

Michael McCandless commented on LUCENE-2016:
--------------------------------------------

bq. is there a possibility with your patch still of index problems if you had foobar<U+FFFF> but also foobar<U+FFFF><U+FFFF> ??? will it create duplicate terms of foobar<U+FFFD> ?

I think this won't cause problems -- that term will just be rewritten to foobar\ufffd\ufffd.

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771517#action_12771517 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

Mike, I disagree.

Here is my reasoning: Lucene Java happens to use U+FFFF as an internal identifier for processing.
However, this is your choice, you  could have just as easily used U+FFFE, or some other codepoint even outside the BMP for this purpose. The standard gives you several options, perhaps you might need multiple process-internal characters to accomplish what you want to do internally.

Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal.

Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content.



> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771526#action_12771526 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

{quote}
But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes?
{quote}

This is not true. if you map them to replacement characters, then my app is free to use them "process-internally" as specified by the standard, without any concern that they will appear in the "interchange" (lucene index data).

I agree with you, lets open a separate "anal unicode issue". Lets go with your U+FFFF fix for Lucene 2.9, since it fixes lucene java, but correct this for 3.x in the future?



> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-2016.
----------------------------------------

    Resolution: Fixed

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771508#action_12771508 ] 

Michael McCandless commented on LUCENE-2016:
--------------------------------------------

Lucene has "traditionally" not enforced the "not for interchange"
characters, ie, just let them through.

But then with the indexing speedups (LUCENE-843), we no longer allowed
U+FFFF, and with the cutover to true UTF-8 in the index, we no longer
allowed invalid surrogate pairs.

And we know apps use these characters (because they hit problems with
U+FFFF on upgrading to 2.3).

So I think it would be too anal to suddenly replace all of these
invalid interchange chars, starting today?  (Though, it would
obviously be more "standards compliant").  Plus, it would cost us non
trivial indexing CPU to do so!!


> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771549#action_12771549 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

Michael, duh :) I think smart chinese has damaged my brain for the rest of the day.

Thanks for fixing this.

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771525#action_12771525 ] 

Michael McCandless commented on LUCENE-2016:
--------------------------------------------

bq. Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content.

OK I agree, you're right: we could in fact do this with negligible impact to performance.

bq. Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal.

But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes?

Can you open a new issue to track this?  This is a wider discussion than preventing index corruption :)

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771533#action_12771533 ] 

Robert Muir commented on LUCENE-2016:
-------------------------------------

bq. Tricky semantics It rather depends on if you consider Lucene part if your "process-internally" . Depending on the use case, it could be either.

Not really, Lucene-java uses U+FFFF process-internally, but wasn't mapping it to something valid in the index. So when U+FFFF was stored in the index, it created an issue. This is a perfect example of this.



> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>
>                 Key: LUCENE-2016
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-2016.patch
>
>
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org