You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/06/24 05:49:32 UTC

Can I call getBytes on a text column to get the raw (already encoded UTF8)

I'm building a webservice whereby I read the data from cassandra, then
write it over the wire.

It's going to push LOTS of content, and encoding/decoding performance has
really bitten us in the future.  So I try to avoid transparent
encoding/decoding if I can avoid it.

So right now, I have a huge blob of text that's a 'text' column.

Logically it *should* be text, because that's what it is...

Can I just keep it as text so our normal tools work on it, but get it as
raw UTF8 if I call getBytes?

This way I can call getBytes and then send it right over the wire as
pre-encoded UTF8 data.

... and of course the question is whether it will continue working in the
future :-P

I'll write a test of it of course but I wanted to see what you guys thought
of this idea.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

Posted by Robert Stupp <sn...@snazy.de>.
You can use getBytesUnsafe on the UTF8 column

--
Sent from my iPhone 

> Am 24.06.2014 um 09:13 schrieb Olivier Michallat <ol...@datastax.com>:
> 
> Assuming we're talking about the DataStax Java driver:
> 
> getBytes will throw an exception, because it validates that the column is of type BLOB. But you can use getBytesUnsafe:
> 
>     ByteBuffer b = row.getBytesUnsafe("aTextColumn");
>     // if you want to check it:
>     Charset.forName("UTF-8").decode(b);
> 
> Regarding whether this will continue working in the future: from the driver's perspective, the fact that the native protocol uses UTF-8 is an implementation detail, but I doubt this will change any time soon.
> 
> 
> 
> 
>> On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan <do...@gmail.com> wrote:
>> Good idea, bytes are merely processed by the server so you're saving a lot of Cpu. AFAIK getBytes should work fine.
>> 
>> Le 24 juin 2014 05:50, "Kevin Burton" <bu...@spinn3r.com> a écrit :
>> 
>>> I'm building a webservice whereby I read the data from cassandra, then write it over the wire.
>>> 
>>> It's going to push LOTS of content, and encoding/decoding performance has really bitten us in the future.  So I try to avoid transparent encoding/decoding if I can avoid it.
>>> 
>>> So right now, I have a huge blob of text that's a 'text' column.
>>> 
>>> Logically it *should* be text, because that's what it is...
>>> 
>>> Can I just keep it as text so our normal tools work on it, but get it as raw UTF8 if I call getBytes?
>>> 
>>> This way I can call getBytes and then send it right over the wire as pre-encoded UTF8 data.
>>> 
>>> ... and of course the question is whether it will continue working in the future :-P
>>> 
>>> I'll write a test of it of course but I wanted to see what you guys thought of this idea.
>>> 
>>> -- 
>>> Founder/CEO Spinn3r.com
>>> Location: San Francisco, CA
>>> Skype: burtonator
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
> 

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

Posted by Kevin Burton <bu...@spinn3r.com>.
Yes… I confirmed that getBytesUnsafe works…

I also have a unit test for it so if cassandra ever changes anything we'll
pick it up.

One point in your above code.  I still think charsets are behind a
synchronized code block.

So your above code wouldn't be super fast on multi-core machines.  I
usually use guava's Charsets class since they have static references to all
of them.

… just wanted to point that out since it could bite someone :-P …




On Tue, Jun 24, 2014 at 12:13 AM, Olivier Michallat <
olivier.michallat@datastax.com> wrote:

> Assuming we're talking about the DataStax Java driver:
>
> getBytes will throw an exception, because it validates that the column is
> of type BLOB. But you can use getBytesUnsafe:
>
>     ByteBuffer b = row.getBytesUnsafe("aTextColumn");
>     // if you want to check it:
>     Charset.forName("UTF-8").decode(b);
>
> Regarding whether this will continue working in the future: from the
> driver's perspective, the fact that the native protocol uses UTF-8 is an
> implementation detail, but I doubt this will change any time soon.
>
>
>
>
> On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> Good idea, bytes are merely processed by the server so you're saving a
>> lot of Cpu. AFAIK getBytes should work fine.
>> Le 24 juin 2014 05:50, "Kevin Burton" <bu...@spinn3r.com> a écrit :
>>
>> I'm building a webservice whereby I read the data from cassandra, then
>>> write it over the wire.
>>>
>>> It's going to push LOTS of content, and encoding/decoding performance
>>> has really bitten us in the future.  So I try to avoid transparent
>>> encoding/decoding if I can avoid it.
>>>
>>> So right now, I have a huge blob of text that's a 'text' column.
>>>
>>> Logically it *should* be text, because that's what it is...
>>>
>>> Can I just keep it as text so our normal tools work on it, but get it as
>>> raw UTF8 if I call getBytes?
>>>
>>> This way I can call getBytes and then send it right over the wire as
>>> pre-encoded UTF8 data.
>>>
>>> ... and of course the question is whether it will continue working in
>>> the future :-P
>>>
>>> I'll write a test of it of course but I wanted to see what you guys
>>> thought of this idea.
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
>>> are people.
>>>
>>>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

Posted by Olivier Michallat <ol...@datastax.com>.
Assuming we're talking about the DataStax Java driver:

getBytes will throw an exception, because it validates that the column is
of type BLOB. But you can use getBytesUnsafe:

    ByteBuffer b = row.getBytesUnsafe("aTextColumn");
    // if you want to check it:
    Charset.forName("UTF-8").decode(b);

Regarding whether this will continue working in the future: from the
driver's perspective, the fact that the native protocol uses UTF-8 is an
implementation detail, but I doubt this will change any time soon.




On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan <do...@gmail.com> wrote:

> Good idea, bytes are merely processed by the server so you're saving a lot
> of Cpu. AFAIK getBytes should work fine.
> Le 24 juin 2014 05:50, "Kevin Burton" <bu...@spinn3r.com> a écrit :
>
> I'm building a webservice whereby I read the data from cassandra, then
>> write it over the wire.
>>
>> It's going to push LOTS of content, and encoding/decoding performance has
>> really bitten us in the future.  So I try to avoid transparent
>> encoding/decoding if I can avoid it.
>>
>> So right now, I have a huge blob of text that's a 'text' column.
>>
>> Logically it *should* be text, because that's what it is...
>>
>> Can I just keep it as text so our normal tools work on it, but get it as
>> raw UTF8 if I call getBytes?
>>
>> This way I can call getBytes and then send it right over the wire as
>> pre-encoded UTF8 data.
>>
>> ... and of course the question is whether it will continue working in the
>> future :-P
>>
>> I'll write a test of it of course but I wanted to see what you guys
>> thought of this idea.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>>

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

Posted by DuyHai Doan <do...@gmail.com>.
Good idea, bytes are merely processed by the server so you're saving a lot
of Cpu. AFAIK getBytes should work fine.
Le 24 juin 2014 05:50, "Kevin Burton" <bu...@spinn3r.com> a écrit :

> I'm building a webservice whereby I read the data from cassandra, then
> write it over the wire.
>
> It's going to push LOTS of content, and encoding/decoding performance has
> really bitten us in the future.  So I try to avoid transparent
> encoding/decoding if I can avoid it.
>
> So right now, I have a huge blob of text that's a 'text' column.
>
> Logically it *should* be text, because that's what it is...
>
> Can I just keep it as text so our normal tools work on it, but get it as
> raw UTF8 if I call getBytes?
>
> This way I can call getBytes and then send it right over the wire as
> pre-encoded UTF8 data.
>
> ... and of course the question is whether it will continue working in the
> future :-P
>
> I'll write a test of it of course but I wanted to see what you guys
> thought of this idea.
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>