You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/12/14 01:28:28 UTC

[lucy-dev] utf8proc, control chars and non-character code points

Greets,

I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op.  However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.

    http://svn.apache.org/viewvc?view=revision&revision=1213996

The test uses random UTF-8 data, generated by TestUtils_random_string().  With
the hack below my sig, the test passes.

Strings which contain control characters are valid UTF-8, as are strings which
contain noncharacters.  Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.

    http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone.  That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code
points.

Marvin Humphrey

Index: core/Lucy/Test/TestUtils.c
===================================================================
--- core/Lucy/Test/TestUtils.c  (revision 1213967)
+++ core/Lucy/Test/TestUtils.c  (working copy)
@@ -17,6 +17,7 @@
 #define C_LUCY_TESTUTILS
 #include "Lucy/Util/ToolSet.h"
 #include <string.h>
+#include <ctype.h>
 
 #include "Lucy/Test/TestUtils.h"
 #include "Lucy/Test.h"
@@ -106,6 +107,15 @@
         if (code_point > 0xD7FF && code_point < 0xE000) {
             continue; // UTF-16 surrogate.
         }
+        if (iscntrl(code_point)) {
+            continue;
+        }
+        if ((code_point & 0xFFFF) == 0xFFEF
+            || (code_point & 0xFFFF) == 0xFFFF
+            || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
+           ) {
+            continue; // Unicode non-character code point.
+        }
         break;
     }
     return code_point;



Re: [lucy-dev] utf8proc, control chars and non-character code points

Posted by Nick Wellnhofer <we...@aevum.de>.
On 14/12/2011 01:28, Marvin Humphrey wrote:
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op.  However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.

You're right that utf8proc doesn't allow non-characters but I don't 
think that control characters are blocked.

> contain noncharacters.  Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.

By that argument we could also remove the check for Unicode surrogates. 
OTOH, passing UTF-8 to a library is a kind of interchange.

> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone.  That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.

We should either remove the check for non-characters from utf8proc or 
disallow non-characters in the rest of Lucy. I'm fine with either solution.

> +        if ((code_point&  0xFFFF) == 0xFFEF

This should check for 0xFFFE.

Nick

Re: [lucy-dev] utf8proc, control chars and non-character code points

Posted by "David E. Wheeler" <da...@kineticode.com>.
On Dec 14, 2011, at 2:18 AM, Peter Karman wrote:

> Swish3 uses \003 control character as an internal field delimiter so passing
> that through is pretty vital. Are you saying that utf8proc chokes on that valid
> UTF-8 sequence?

I do the same thing to index lists of things on Lucy in PGXN:

  https://github.com/pgxn/pgxn-api/blob/master/lib/PGXN/API/Indexer.pm#L77

Best,

David


Re: [lucy-dev] utf8proc, control chars and non-character code points

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 12/13/11 6:28 PM:
> Greets,
> 
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op.  However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.
> 
>     http://svn.apache.org/viewvc?view=revision&revision=1213996
> 
> The test uses random UTF-8 data, generated by TestUtils_random_string().  With
> the hack below my sig, the test passes.
> 
> Strings which contain control characters are valid UTF-8, as are strings which
> contain noncharacters.  Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.
> 
>     http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
> 
> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone.  That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.


Swish3 uses \003 control character as an internal field delimiter so passing
that through is pretty vital. Are you saying that utf8proc chokes on that valid
UTF-8 sequence?



> 
> Index: core/Lucy/Test/TestUtils.c
> ===================================================================
> --- core/Lucy/Test/TestUtils.c  (revision 1213967)
> +++ core/Lucy/Test/TestUtils.c  (working copy)
> @@ -17,6 +17,7 @@
>  #define C_LUCY_TESTUTILS
>  #include "Lucy/Util/ToolSet.h"
>  #include <string.h>
> +#include <ctype.h>
>  
>  #include "Lucy/Test/TestUtils.h"
>  #include "Lucy/Test.h"
> @@ -106,6 +107,15 @@
>          if (code_point > 0xD7FF && code_point < 0xE000) {
>              continue; // UTF-16 surrogate.
>          }
> +        if (iscntrl(code_point)) {
> +            continue;
> +        }
> +        if ((code_point & 0xFFFF) == 0xFFEF
> +            || (code_point & 0xFFFF) == 0xFFFF
> +            || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
> +           ) {
> +            continue; // Unicode non-character code point.
> +        }
>          break;
>      }
>      return code_point;
> 
> 


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com