You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Edmond Lau <ed...@ooyala.com> on 2009/12/07 23:43:08 UTC

cassandra mangling non-ascii keys

I'm using non-ascii keys on Cassandra, relatively close to trunk at
r880926, and my some of my keys get mangled.

As a simple test case, if I insert a one-byte key anywhere between
\200 and \377 (octal for 128 to 255) through the thrift interface, and
then query back my data with multi get, I get a hash back that has
"\357\277\275" as the key.  All those one-byte keys get mapped to the
same bucket, so if I insert with the key \205, I get the data back
when querying for \300.  So either a) there's a bug in thrift, b)
Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
my key somewhere.

Has anyone else run into this issue?

Edmond

Re: cassandra mangling non-ascii keys

Posted by Jonathan Ellis <jb...@gmail.com>.
I don't remember, but it was definitely wrong in hindsight :(

On Mon, Dec 7, 2009 at 6:22 PM, Edmond Lau <ed...@ooyala.com> wrote:
> Ok - so my understanding from reading the two jira issues is that
> python and ruby treat the "string" thrift type as unencoded bytes
> whereas java treats them as utf-8 encoded bytes.  What was the
> rationale behind declaring keys to be of type "string" rather than of
> type "binary"?  With "binary", presumably java wouldn't treat keys as
> utf-8 encoded bytes.
>
> Edmond
>
> On Mon, Dec 7, 2009 at 3:09 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> I suspect you will need to explicitly encode to UTF8 first, then.
>> (And decode when reading.)
>>
>> My reading of the relevant issues
>> (https://issues.apache.org/jira/browse/THRIFT-395,
>> https://issues.apache.org/jira/browse/THRIFT-414) is that this won't
>> be fixed any time soon.
>>
>> -Jonathan
>>
>> On Mon, Dec 7, 2009 at 4:56 PM, Edmond Lau <ed...@ooyala.com> wrote:
>>> This particular client was in Ruby.
>>>
>>> On Mon, Dec 7, 2009 at 2:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> (bugs in thrift, that is)
>>>>
>>>> On Mon, Dec 7, 2009 at 4:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>> what language are your clients in?  there are definitely some bugs
>>>>> there when communicating b/t client and server of different languages.
>>>>> :(
>>>>>
>>>>> On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
>>>>>> I'm using non-ascii keys on Cassandra, relatively close to trunk at
>>>>>> r880926, and my some of my keys get mangled.
>>>>>>
>>>>>> As a simple test case, if I insert a one-byte key anywhere between
>>>>>> \200 and \377 (octal for 128 to 255) through the thrift interface, and
>>>>>> then query back my data with multi get, I get a hash back that has
>>>>>> "\357\277\275" as the key.  All those one-byte keys get mapped to the
>>>>>> same bucket, so if I insert with the key \205, I get the data back
>>>>>> when querying for \300.  So either a) there's a bug in thrift, b)
>>>>>> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
>>>>>> my key somewhere.
>>>>>>
>>>>>> Has anyone else run into this issue?
>>>>>>
>>>>>> Edmond
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: cassandra mangling non-ascii keys

Posted by Edmond Lau <ed...@ooyala.com>.
Ok - so my understanding from reading the two jira issues is that
python and ruby treat the "string" thrift type as unencoded bytes
whereas java treats them as utf-8 encoded bytes.  What was the
rationale behind declaring keys to be of type "string" rather than of
type "binary"?  With "binary", presumably java wouldn't treat keys as
utf-8 encoded bytes.

Edmond

On Mon, Dec 7, 2009 at 3:09 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> I suspect you will need to explicitly encode to UTF8 first, then.
> (And decode when reading.)
>
> My reading of the relevant issues
> (https://issues.apache.org/jira/browse/THRIFT-395,
> https://issues.apache.org/jira/browse/THRIFT-414) is that this won't
> be fixed any time soon.
>
> -Jonathan
>
> On Mon, Dec 7, 2009 at 4:56 PM, Edmond Lau <ed...@ooyala.com> wrote:
>> This particular client was in Ruby.
>>
>> On Mon, Dec 7, 2009 at 2:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> (bugs in thrift, that is)
>>>
>>> On Mon, Dec 7, 2009 at 4:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> what language are your clients in?  there are definitely some bugs
>>>> there when communicating b/t client and server of different languages.
>>>> :(
>>>>
>>>> On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
>>>>> I'm using non-ascii keys on Cassandra, relatively close to trunk at
>>>>> r880926, and my some of my keys get mangled.
>>>>>
>>>>> As a simple test case, if I insert a one-byte key anywhere between
>>>>> \200 and \377 (octal for 128 to 255) through the thrift interface, and
>>>>> then query back my data with multi get, I get a hash back that has
>>>>> "\357\277\275" as the key.  All those one-byte keys get mapped to the
>>>>> same bucket, so if I insert with the key \205, I get the data back
>>>>> when querying for \300.  So either a) there's a bug in thrift, b)
>>>>> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
>>>>> my key somewhere.
>>>>>
>>>>> Has anyone else run into this issue?
>>>>>
>>>>> Edmond
>>>>>
>>>>
>>>
>>
>

Re: cassandra mangling non-ascii keys

Posted by Jonathan Ellis <jb...@gmail.com>.
I suspect you will need to explicitly encode to UTF8 first, then.
(And decode when reading.)

My reading of the relevant issues
(https://issues.apache.org/jira/browse/THRIFT-395,
https://issues.apache.org/jira/browse/THRIFT-414) is that this won't
be fixed any time soon.

-Jonathan

On Mon, Dec 7, 2009 at 4:56 PM, Edmond Lau <ed...@ooyala.com> wrote:
> This particular client was in Ruby.
>
> On Mon, Dec 7, 2009 at 2:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> (bugs in thrift, that is)
>>
>> On Mon, Dec 7, 2009 at 4:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> what language are your clients in?  there are definitely some bugs
>>> there when communicating b/t client and server of different languages.
>>> :(
>>>
>>> On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
>>>> I'm using non-ascii keys on Cassandra, relatively close to trunk at
>>>> r880926, and my some of my keys get mangled.
>>>>
>>>> As a simple test case, if I insert a one-byte key anywhere between
>>>> \200 and \377 (octal for 128 to 255) through the thrift interface, and
>>>> then query back my data with multi get, I get a hash back that has
>>>> "\357\277\275" as the key.  All those one-byte keys get mapped to the
>>>> same bucket, so if I insert with the key \205, I get the data back
>>>> when querying for \300.  So either a) there's a bug in thrift, b)
>>>> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
>>>> my key somewhere.
>>>>
>>>> Has anyone else run into this issue?
>>>>
>>>> Edmond
>>>>
>>>
>>
>

Re: cassandra mangling non-ascii keys

Posted by Edmond Lau <ed...@ooyala.com>.
This particular client was in Ruby.

On Mon, Dec 7, 2009 at 2:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> (bugs in thrift, that is)
>
> On Mon, Dec 7, 2009 at 4:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> what language are your clients in?  there are definitely some bugs
>> there when communicating b/t client and server of different languages.
>> :(
>>
>> On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
>>> I'm using non-ascii keys on Cassandra, relatively close to trunk at
>>> r880926, and my some of my keys get mangled.
>>>
>>> As a simple test case, if I insert a one-byte key anywhere between
>>> \200 and \377 (octal for 128 to 255) through the thrift interface, and
>>> then query back my data with multi get, I get a hash back that has
>>> "\357\277\275" as the key.  All those one-byte keys get mapped to the
>>> same bucket, so if I insert with the key \205, I get the data back
>>> when querying for \300.  So either a) there's a bug in thrift, b)
>>> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
>>> my key somewhere.
>>>
>>> Has anyone else run into this issue?
>>>
>>> Edmond
>>>
>>
>

Re: cassandra mangling non-ascii keys

Posted by Jonathan Ellis <jb...@gmail.com>.
(bugs in thrift, that is)

On Mon, Dec 7, 2009 at 4:49 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> what language are your clients in?  there are definitely some bugs
> there when communicating b/t client and server of different languages.
> :(
>
> On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
>> I'm using non-ascii keys on Cassandra, relatively close to trunk at
>> r880926, and my some of my keys get mangled.
>>
>> As a simple test case, if I insert a one-byte key anywhere between
>> \200 and \377 (octal for 128 to 255) through the thrift interface, and
>> then query back my data with multi get, I get a hash back that has
>> "\357\277\275" as the key.  All those one-byte keys get mapped to the
>> same bucket, so if I insert with the key \205, I get the data back
>> when querying for \300.  So either a) there's a bug in thrift, b)
>> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
>> my key somewhere.
>>
>> Has anyone else run into this issue?
>>
>> Edmond
>>
>

Re: cassandra mangling non-ascii keys

Posted by Jonathan Ellis <jb...@gmail.com>.
what language are your clients in?  there are definitely some bugs
there when communicating b/t client and server of different languages.
:(

On Mon, Dec 7, 2009 at 4:43 PM, Edmond Lau <ed...@ooyala.com> wrote:
> I'm using non-ascii keys on Cassandra, relatively close to trunk at
> r880926, and my some of my keys get mangled.
>
> As a simple test case, if I insert a one-byte key anywhere between
> \200 and \377 (octal for 128 to 255) through the thrift interface, and
> then query back my data with multi get, I get a hash back that has
> "\357\277\275" as the key.  All those one-byte keys get mapped to the
> same bucket, so if I insert with the key \205, I get the data back
> when querying for \300.  So either a) there's a bug in thrift, b)
> Cassandra doesn't support non-ascii keys, or c) Cassandra is mangling
> my key somewhere.
>
> Has anyone else run into this issue?
>
> Edmond
>