You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Owen Davies <Ow...@logmein.com> on 2012/08/02 12:47:04 UTC

Is large number of columns per row a problem?

We want to store a large number of columns in a single row (up to about 100,000,000), where each value is roughly 10 bytes.

We also need to be able to get slices of columns from any point in the row.

We haven't found a problem with smaller amounts of data so far, but can anyone think of any reason if this is a bad idea, or would cause large performance problems?

If breaking up the row is something we should do, what is the maximum number of columns we should have?

We are not too worried if there is only a small performance decrease, adding more nodes to the cluster would be an option to help make code simpler.

Thanks,

Owen Davies

Re: Is large number of columns per row a problem?

Posted by Filippo Diotalevi <fi...@ntoklo.com>.

Hi,

On Thursday, 2 August 2012 at 11:47, Owen Davies wrote:

> We want to store a large number of columns in a single row (up to about 100,000,000), where each value is roughly 10 bytes.
>  
> We also need to be able to get slices of columns from any point in the row.
>  
> We haven't found a problem with smaller amounts of data so far, but can anyone think of any reason if this is a bad idea, or would cause large performance problems?

my experience with wide rows & cassandra is not positive. We used to have rows of a few hundred megabytes each, to be read during Map Reduce computation, and that caused many issues, especially with timeouts reading the rows (with cassandra under a medium write load) and OutOfMemory exceptions.

The solution in our case was to "shard" (timebucket) the rows into smaller pieces (a few megabytes each).

The situation might have changed with Cassandra 1.1.0, which claims to have some "wide row" support, but I haven't been able to test that.

>  
> If breaking up the row is something we should do, what is the maximum number of columns we should have?
>  
> We are not too worried if there is only a small performance decrease, adding more nodes to the cluster would be an option to help make code simpler.

I don't have a precise figure, but I'd limit row size to less than 100MB… much less, if possible. In general, my experience is that hundred of millions of small rows don't cause issues, but having just a few very wide rows will cause timeouts and, in worst cases, OOM.

--  
Filippo Diotalevi