You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/05/16 20:41:55 UTC

Storing globally sorted data

Let's say I have an external job (MR, pig, etc) sorting a cassandra table
by some complicated mechanism.

We want to store the sorted records BACK into cassandra so that clients can
read the records sorted.

What I was just thinking of doing was storing the records as pages.

So page 0 would have records 0-999….

We would just have the key be the page ID and then the values be the
primary keys for the records so that they can be fetched. I could also
denormalize the data and store them inline as a materialized view but of
course this would require much more disk space.

Thoughts on this strategy?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profile<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Storing globally sorted data

Posted by DuyHai Doan <do...@gmail.com>.

What you show is basically the idea of bucketing data. One bucket = one
physical partition. Within each bucket, there is a fixed number of "column"
(1000 in your example).

 This strategy works fine and avoid too large partition. The only draw back
I would see is the need to fetch data over buckets but it seems that in
your case you fetch data by partition so it should be ok.

 About denormalizing, it's the way to go. Disk space is sometimes cheaper
that the high read latency caused by normalized data model.

On Fri, May 16, 2014 at 8:41 PM, Kevin Burton <bu...@spinn3r.com> wrote:

> Let's say I have an external job (MR, pig, etc) sorting a cassandra table
> by some complicated mechanism.
>
> We want to store the sorted records BACK into cassandra so that clients
> can read the records sorted.
>
> What I was just thinking of doing was storing the records as pages.
>
> So page 0 would have records 0-999….
>
> We would just have the key be the page ID and then the values be the
> primary keys for the records so that they can be fetched. I could also
> denormalize the data and store them inline as a materialized view but of
> course this would require much more disk space.
>
> Thoughts on this strategy?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile<https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>