You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/12/14 01:28:28 UTC
[lucy-dev] utf8proc, control chars and non-character code points
Greets,
I just committed a test to trunk which verifies that utf8proc's normalization
works properly, in that normalizing a second time is a no-op. However, I had
to disable the test because utf8proc chokes when fed strings which contain
either control characters or non-character code points.
http://svn.apache.org/viewvc?view=revision&revision=1213996
The test uses random UTF-8 data, generated by TestUtils_random_string(). With
the hack below my sig, the test passes.
Strings which contain control characters are valid UTF-8, as are strings which
contain noncharacters. Noncharacters are not supposed to be used for
interchange, but Lucy is a library, not an application, and thus should pass
noncharacters cleanly.
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
reports an error, we simply leave the token alone. That seems appropriate in
the case of malformed UTF-8, but I question whether it is appropriate for
valid UTF-8 sequences containing control characters or non-character code
points.
Marvin Humphrey
Index: core/Lucy/Test/TestUtils.c
===================================================================
--- core/Lucy/Test/TestUtils.c (revision 1213967)
+++ core/Lucy/Test/TestUtils.c (working copy)
@@ -17,6 +17,7 @@
#define C_LUCY_TESTUTILS
#include "Lucy/Util/ToolSet.h"
#include <string.h>
+#include <ctype.h>
#include "Lucy/Test/TestUtils.h"
#include "Lucy/Test.h"
@@ -106,6 +107,15 @@
if (code_point > 0xD7FF && code_point < 0xE000) {
continue; // UTF-16 surrogate.
}
+ if (iscntrl(code_point)) {
+ continue;
+ }
+ if ((code_point & 0xFFFF) == 0xFFEF
+ || (code_point & 0xFFFF) == 0xFFFF
+ || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
+ ) {
+ continue; // Unicode non-character code point.
+ }
break;
}
return code_point;
Re: [lucy-dev] utf8proc, control chars and non-character code points
Posted by Nick Wellnhofer <we...@aevum.de>.
On 14/12/2011 01:28, Marvin Humphrey wrote:
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op. However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.
You're right that utf8proc doesn't allow non-characters but I don't
think that control characters are blocked.
> contain noncharacters. Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.
By that argument we could also remove the check for Unicode surrogates.
OTOH, passing UTF-8 to a library is a kind of interchange.
> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone. That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.
We should either remove the check for non-characters from utf8proc or
disallow non-characters in the rest of Lucy. I'm fine with either solution.
> + if ((code_point& 0xFFFF) == 0xFFEF
This should check for 0xFFFE.
Nick
Re: [lucy-dev] utf8proc, control chars and non-character code points
Posted by "David E. Wheeler" <da...@kineticode.com>.
On Dec 14, 2011, at 2:18 AM, Peter Karman wrote:
> Swish3 uses \003 control character as an internal field delimiter so passing
> that through is pretty vital. Are you saying that utf8proc chokes on that valid
> UTF-8 sequence?
I do the same thing to index lists of things on Lucy in PGXN:
https://github.com/pgxn/pgxn-api/blob/master/lib/PGXN/API/Indexer.pm#L77
Best,
David
Re: [lucy-dev] utf8proc, control chars and non-character code points
Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 12/13/11 6:28 PM:
> Greets,
>
> I just committed a test to trunk which verifies that utf8proc's normalization
> works properly, in that normalizing a second time is a no-op. However, I had
> to disable the test because utf8proc chokes when fed strings which contain
> either control characters or non-character code points.
>
> http://svn.apache.org/viewvc?view=revision&revision=1213996
>
> The test uses random UTF-8 data, generated by TestUtils_random_string(). With
> the hack below my sig, the test passes.
>
> Strings which contain control characters are valid UTF-8, as are strings which
> contain noncharacters. Noncharacters are not supposed to be used for
> interchange, but Lucy is a library, not an application, and thus should pass
> noncharacters cleanly.
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
> Looking at the code for Lucy::Analysis::Normalizer, it seems that if utf8proc
> reports an error, we simply leave the token alone. That seems appropriate in
> the case of malformed UTF-8, but I question whether it is appropriate for
> valid UTF-8 sequences containing control characters or non-character code
> points.
Swish3 uses \003 control character as an internal field delimiter so passing
that through is pretty vital. Are you saying that utf8proc chokes on that valid
UTF-8 sequence?
>
> Index: core/Lucy/Test/TestUtils.c
> ===================================================================
> --- core/Lucy/Test/TestUtils.c (revision 1213967)
> +++ core/Lucy/Test/TestUtils.c (working copy)
> @@ -17,6 +17,7 @@
> #define C_LUCY_TESTUTILS
> #include "Lucy/Util/ToolSet.h"
> #include <string.h>
> +#include <ctype.h>
>
> #include "Lucy/Test/TestUtils.h"
> #include "Lucy/Test.h"
> @@ -106,6 +107,15 @@
> if (code_point > 0xD7FF && code_point < 0xE000) {
> continue; // UTF-16 surrogate.
> }
> + if (iscntrl(code_point)) {
> + continue;
> + }
> + if ((code_point & 0xFFFF) == 0xFFEF
> + || (code_point & 0xFFFF) == 0xFFFF
> + || (code_point >= 0xFDD0 && code_point <= 0xFDEF)
> + ) {
> + continue; // Unicode non-character code point.
> + }
> break;
> }
> return code_point;
>
>
--
Peter Karman . http://peknet.com/ . peter@peknet.com