You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by E R <pc...@gmail.com> on 2011/06/17 00:41:46 UTC

compression for regular column names?

Hi all,

As a way of gaining familiarity with Cassandra I am migrating a table
that is currently stored in a relational database and mapping it into
a Cassandra column family. We add about 700,000 new rows a day to this
table, and the average disk space used per row is ~ 300 bytes
including indexes.

The mapping from table to column family is straight forward - there is
a one-one relationship between table columns and column family column
names. The relational table has 19 columns. The length of the names of
the columns is nearly 200 bytes whereas the average amount of data per
row is only 130 bytes.

Initially I used the identify map for this translation - i.e. my
Cassandra column names were the same as the relational column names. I
then found out I could save a lot of disk space by using single letter
column names instead of the original relational names. I.e. use 'L'
instead of 'LINK_IDENTIFIER' for a column name.

The procedure I use to determine space used is:

1. rm -rf the cassandra var-lib directory
2. start cassandra, create keyspace, column families, etc.
3. insert records
4. stop cassandra
5. re-start cassandra
6. measure disk space with du -s the cassandra var-lib directory

This seems to replace the commit logs with .db files.

My questions are:

1. Is this a common practice (i.e. making the client responsible for
shortening the column names) when dealing with a large number of fixed
column names and a high volume of inserts? Is there any way that
Cassandra can help out here?

2. Is there another way to transform the commit logs into .db files
without stopping and starting the server?

Thanks,
ER

Re: compression for regular column names?

Posted by Ryan King <ry...@twitter.com>.
On Thu, Jun 16, 2011 at 3:41 PM, E R <pc...@gmail.com> wrote:
> Hi all,
>
> As a way of gaining familiarity with Cassandra I am migrating a table
> that is currently stored in a relational database and mapping it into
> a Cassandra column family. We add about 700,000 new rows a day to this
> table, and the average disk space used per row is ~ 300 bytes
> including indexes.
>
> The mapping from table to column family is straight forward - there is
> a one-one relationship between table columns and column family column
> names. The relational table has 19 columns. The length of the names of
> the columns is nearly 200 bytes whereas the average amount of data per
> row is only 130 bytes.
>
> Initially I used the identify map for this translation - i.e. my
> Cassandra column names were the same as the relational column names. I
> then found out I could save a lot of disk space by using single letter
> column names instead of the original relational names. I.e. use 'L'
> instead of 'LINK_IDENTIFIER' for a column name.
>
> The procedure I use to determine space used is:
>
> 1. rm -rf the cassandra var-lib directory
> 2. start cassandra, create keyspace, column families, etc.
> 3. insert records
> 4. stop cassandra
> 5. re-start cassandra
> 6. measure disk space with du -s the cassandra var-lib directory
>
> This seems to replace the commit logs with .db files.
>
> My questions are:
>
> 1. Is this a common practice (i.e. making the client responsible for
> shortening the column names) when dealing with a large number of fixed
> column names and a high volume of inserts? Is there any way that
> Cassandra can help out here?

Yes, we're working on a new, compressed format CASSANDRA-674.

> 2. Is there another way to transform the commit logs into .db files
> without stopping and starting the server?

nodetool flush.

-ryan