You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "James A. Robinson" <ji...@gmail.com> on 2019/10/29 20:13:02 UTC

Cassandra and UTF-8 BOM?

Hi folks,

I'm looking at a table that has a primary key defined as "publisher_id
text".  I've noticed some of the entries have what appears to me to be
a UTF-8 BOM marker and some do not.

https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cql_data_types_c.html
says text is a UTF-8 encoded string.  If I look at the first 3 bytes
of one of these columns:

$ dd if=~/tmp/sample.data of=/dev/stdout bs=1 count=3 2>/dev/null | hexdump
0000000 bbef 00bf
0000003

When I swap the byte order:

$ dd if=~/tmp/sample.data of=/dev/stdout bs=1 count=3 conv=swab
2>/dev/null | hexdump
0000000 efbb 00bf
0000003

And I think this matches the UTF-8 BOM.

However, not all the rows have this prefix, and I'm wondering if this
is a client issue (client being inconsistent about  how it's dealing
with strings) or if Cassandra is doing something special on its own.
The rest of the column falls within the US-ASCII codepoint compatible
range of UTF-8, e.g., something as simple as 'abc' but in some cases
it's got this marker in front of it.

Cassandra is treating '<BOM>abc' as a distinct value from 'abc' ,
which certainly makes sense, for the sake of efficiency I assume it'd
just be looking at the byte-for-byte values w/o layering meaning on
top of it.  But that means I'll need to clean the data up to be
consistent, and I need to figure out how to prevent it from being
reintroduced in the future.

Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org