You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ben McCann <be...@benmccann.com> on 2012/04/02 17:42:13 UTC

Compression on client side vs server side

Hi,

I was curious if I compress my data on the client side with Snappy whether
there's any difference between doing that and doing it on the server side?
 The wiki said that compression works best where each row has the same
columns.  Does this mean the compression will be more efficient on the
server side since it can look at multiple rows at once instead of only the
row being inserted?  The reason I was thinking about possibly doing it
client side was that it would save CPU on the datastore machine.  However,
does this matter?  Is CPU typically the bottleneck on a machine or is it
some other resource? (of course this will vary for each person, but
wondering if there's a rule of thumb.  I'm making a web app, which
hopefully will store about 5TB of data and have 10s of millions of page
views per month)

Thanks,
Ben

Re: Compression on client side vs server side

Posted by Віталій Тимчишин <ti...@gmail.com>.

We are using client-side compression because of next points. Can you
confirm they are valid?
1) Server-side compression uses replication factor more CPU (3 times more
with replication factor of 3).
2) Network is used more by compression factor (as you are sending
uncompressed data over the wire).
4) Any server utility operations, like repair or move (not sure for the
latter) will decompress/compress
So, client side decompression looks way cheapier and can be very efficient
for long columns.

Best regards, Vitalii Tymchyshyn

2012/4/2 Jeremiah Jordan <JE...@morningstar.com>

>  The server side compression can compress across columns/rows so it will
> most likely be more efficient.
> Whether you are CPU bound or IO bound depends on your application and node
> setup.  Unless your working set fits in memory you will be IO bound, and in
> that case server side compression helps because there is less to read from
> disk.  In many cases it is actually faster to read a compressed file from
> disk and decompress it, then to read an uncompressed file from disk.
>
> See Ed's post:
> "Cassandra compression is like more servers for free!"
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compression_is_like_getting
>
>  ------------------------------
> *From:* benjamin.j.mccann@gmail.com [benjamin.j.mccann@gmail.com] on
> behalf of Ben McCann [ben@benmccann.com]
> *Sent:* Monday, April 02, 2012 10:42 AM
> *To:* user@cassandra.apache.org
> *Subject:* Compression on client side vs server side
>
>  Hi,
>
>  I was curious if I compress my data on the client side with Snappy
> whether there's any difference between doing that and doing it on the
> server side?  The wiki said that compression works best where each row has
> the same columns.  Does this mean the compression will be more efficient on
> the server side since it can look at multiple rows at once instead of only
> the row being inserted?  The reason I was thinking about possibly doing it
> client side was that it would save CPU on the datastore machine.  However,
> does this matter?  Is CPU typically the bottleneck on a machine or is it
> some other resource? (of course this will vary for each person, but
> wondering if there's a rule of thumb.  I'm making a web app, which
> hopefully will store about 5TB of data and have 10s of millions of page
> views per month)
>
>  Thanks,
> Ben
>
>


-- 
Best regards,
 Vitalii Tymchyshyn

RE: Compression on client side vs server side

Posted by Jeremiah Jordan <JE...@morningstar.com>.

The server side compression can compress across columns/rows so it will most likely be more efficient.
Whether you are CPU bound or IO bound depends on your application and node setup.  Unless your working set fits in memory you will be IO bound, and in that case server side compression helps because there is less to read from disk.  In many cases it is actually faster to read a compressed file from disk and decompress it, then to read an uncompressed file from disk.

See Ed's post:
"Cassandra compression is like more servers for free!"
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_compression_is_like_getting

________________________________
From: benjamin.j.mccann@gmail.com [benjamin.j.mccann@gmail.com] on behalf of Ben McCann [ben@benmccann.com]
Sent: Monday, April 02, 2012 10:42 AM
To: user@cassandra.apache.org
Subject: Compression on client side vs server side

Hi,

I was curious if I compress my data on the client side with Snappy whether there's any difference between doing that and doing it on the server side?  The wiki said that compression works best where each row has the same columns.  Does this mean the compression will be more efficient on the server side since it can look at multiple rows at once instead of only the row being inserted?  The reason I was thinking about possibly doing it client side was that it would save CPU on the datastore machine.  However, does this matter?  Is CPU typically the bottleneck on a machine or is it some other resource? (of course this will vary for each person, but wondering if there's a rule of thumb.  I'm making a web app, which hopefully will store about 5TB of data and have 10s of millions of page views per month)

Thanks,
Ben

Re: Compression on client side vs server side

Posted by Ben McCann <be...@benmccann.com>.

Thanks Jeremiah, that's what I has suspected.  I appreciate the
confirmation.

Martin, there's not built-in support for doing compression client side, but
it'd be easy for me to do manually since I just have one column with all my
serialized data, which is why I was considering it.


On Mon, Apr 2, 2012 at 8:54 AM, Martin Junghanns <m.junghanns@googlemail.com
> wrote:

> Hi,
>
> how do you select between client- and serverside compression? i'm using
> hector and i set compression when creating a cf, so the compression
> executes when inserting the data "on the server" oO
>
> greetings, martin
>
> Am 02.04.2012 17:42, schrieb Ben McCann:
>
>  Hi,
>>
>> I was curious if I compress my data on the client side with Snappy
>> whether there's any difference between doing that and doing it on the
>> server side?  The wiki said that compression works best where each row has
>> the same columns.  Does this mean the compression will be more efficient on
>> the server side since it can look at multiple rows at once instead of only
>> the row being inserted?  The reason I was thinking about possibly doing it
>> client side was that it would save CPU on the datastore machine.  However,
>> does this matter?  Is CPU typically the bottleneck on a machine or is it
>> some other resource? (of course this will vary for each person, but
>> wondering if there's a rule of thumb.  I'm making a web app, which
>> hopefully will store about 5TB of data and have 10s of millions of page
>> views per month)
>>
>> Thanks,
>> Ben
>>
>>
>

Re: Compression on client side vs server side

Posted by Martin Junghanns <m....@googlemail.com>.

Hi,

how do you select between client- and serverside compression? i'm using 
hector and i set compression when creating a cf, so the compression 
executes when inserting the data "on the server" oO

greetings, martin

Am 02.04.2012 17:42, schrieb Ben McCann:
> Hi,
>
> I was curious if I compress my data on the client side with Snappy 
> whether there's any difference between doing that and doing it on the 
> server side?  The wiki said that compression works best where each row 
> has the same columns.  Does this mean the compression will be more 
> efficient on the server side since it can look at multiple rows at 
> once instead of only the row being inserted?  The reason I was 
> thinking about possibly doing it client side was that it would save 
> CPU on the datastore machine.  However, does this matter?  Is CPU 
> typically the bottleneck on a machine or is it some other resource? 
> (of course this will vary for each person, but wondering if there's a 
> rule of thumb.  I'm making a web app, which hopefully will store about 
> 5TB of data and have 10s of millions of page views per month)
>
> Thanks,
> Ben
>