You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kirk True <ki...@gmail.com> on 2013/07/01 18:19:33 UTC

10,000s of column families/keyspaces

Hi all,

I know it's an old topic, but I want to see if anything's changed on the
number of column families that C* supports, either in 1.2.x or 2.x.

For a number of reasons [1], we'd like to support multi-tenancy via
separate column families. The "problem" is that there are around 5,000
tenants to support and each one needs a small handful of column families
each.

The last I heard C* supports 'a couple of hundred' column families before
things start to bog down.

What will it take for C* to support 50,000 column families?

I'm about to dive into the code and run some tests, but I was curious about
how to quantify the overhead of a column family. Is the reason performance?
Memory? Does the off-heap work help here?

Thanks,
Kirk

[1] The main three reasons:


   1. ability to wholesale drop data for a given tenant via drop
   keyspace/drop CFs
   2. ability to have divergent schema for each tenant (partially effected
   by DSE Solr integration)
   3. secondary indexes per tenant (given requirement #2)

Re: 10,000s of column families/keyspaces

Posted by Edward Capriolo <ed...@gmail.com>.

There is another problem. You now need to run repair for a large number of
column families and keyspaces and manage that, look out for schema
mismatches etc.


On Mon, Jul 1, 2013 at 4:09 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Jul 1, 2013 at 9:19 AM, Kirk True <ki...@gmail.com> wrote:
>
>> What will it take for C* to support 50,000 column families?
>>
>
> As I understand it, a (the?) big problem with huge numbers of Column
> Families is that each ColumnFamily has a large number of MBeans associated
> with it, each of which consume heap. So.. a lot fewer MBeans per column
> family and/or MBean stuff not consuming heap? Then you still have the
> problem of each CF having at least one live Memtable, which even if empty
> will still consume heap...
>
> I'm thinking the real answer to "what it will take for C* to support 50k
> CFs" is "a JVM which can functionally support heap sizes over 8gb" ...
> which seems unlikely to happen any time soon.
>
> =Rob
>

Re: 10,000s of column families/keyspaces

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Jul 1, 2013 at 9:19 AM, Kirk True <ki...@gmail.com> wrote:

> What will it take for C* to support 50,000 column families?
>

As I understand it, a (the?) big problem with huge numbers of Column
Families is that each ColumnFamily has a large number of MBeans associated
with it, each of which consume heap. So.. a lot fewer MBeans per column
family and/or MBean stuff not consuming heap? Then you still have the
problem of each CF having at least one live Memtable, which even if empty
will still consume heap...

I'm thinking the real answer to "what it will take for C* to support 50k
CFs" is "a JVM which can functionally support heap sizes over 8gb" ...
which seems unlikely to happen any time soon.

=Rob

Re: 10,000s of column families/keyspaces

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh and if you are using STCS, I don't think the below is an issue at all
since that can run in parallel if needed already.

Dean

On 7/1/13 10:24 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>We use playorm to do 80,000 virtual column families(a playorm feature
>though the pattern could be copied).  We did find out later and we are
>working on this now that we wanted to map 80,000 virtual CF's into 10
>real CF's so leveled compaction can run more in parallel though or else
>we get stuck with single threaded LCS at the last tier which can take a
>while.  We are about to map/reduce our dataset into our newest format.
>
>Dean
>
>From: Kirk True <ki...@gmail.com>>
>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Date: Monday, July 1, 2013 10:19 AM
>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Subject: 10,000s of column families/keyspaces
>
>Hi all,
>
>I know it's an old topic, but I want to see if anything's changed on the
>number of column families that C* supports, either in 1.2.x or 2.x.
>
>For a number of reasons [1], we'd like to support multi-tenancy via
>separate column families. The "problem" is that there are around 5,000
>tenants to support and each one needs a small handful of column families
>each.
>
>The last I heard C* supports 'a couple of hundred' column families before
>things start to bog down.
>
>What will it take for C* to support 50,000 column families?
>
>I'm about to dive into the code and run some tests, but I was curious
>about how to quantify the overhead of a column family. Is the reason
>performance? Memory? Does the off-heap work help here?
>
>Thanks,
>Kirk
>
>[1] The main three reasons:
>
>
> 1.  ability to wholesale drop data for a given tenant via drop
>keyspace/drop CFs
> 2.  ability to have divergent schema for each tenant (partially effected
>by DSE Solr integration)
> 3.  secondary indexes per tenant (given requirement #2)

Re: 10,000s of column families/keyspaces

Posted by "Hiller, Dean" <De...@nrel.gov>.

We use playorm to do 80,000 virtual column families(a playorm feature though the pattern could be copied).  We did find out later and we are working on this now that we wanted to map 80,000 virtual CF's into 10 real CF's so leveled compaction can run more in parallel though or else we get stuck with single threaded LCS at the last tier which can take a while.  We are about to map/reduce our dataset into our newest format.

Dean

From: Kirk True <ki...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, July 1, 2013 10:19 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: 10,000s of column families/keyspaces

Hi all,

I know it's an old topic, but I want to see if anything's changed on the number of column families that C* supports, either in 1.2.x or 2.x.

For a number of reasons [1], we'd like to support multi-tenancy via separate column families. The "problem" is that there are around 5,000 tenants to support and each one needs a small handful of column families each.

The last I heard C* supports 'a couple of hundred' column families before things start to bog down.

What will it take for C* to support 50,000 column families?

I'm about to dive into the code and run some tests, but I was curious about how to quantify the overhead of a column family. Is the reason performance? Memory? Does the off-heap work help here?

Thanks,
Kirk

[1] The main three reasons:

 1.  ability to wholesale drop data for a given tenant via drop keyspace/drop CFs
 2.  ability to have divergent schema for each tenant (partially effected by DSE Solr integration)
 3.  secondary indexes per tenant (given requirement #2)