You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jack Krupansky <ja...@gmail.com> on 2016/03/01 06:23:51 UTC

Re: Practical limit on number of column families

3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family.
With CQL3 you should be creating "tables". The practical recommendation of
an upper limit of a few hundred tables across all key spaces remains.

Technically you can go higher and technically you can reduce the overhead
per table (an undocumented Jira - intentionally undocumented since it is
strongly not recommended), but... it is unlikely that you will be happy
with the results.

What is the nature of the use case?

You basically have two choices: an additional cluster column to distinguish
categories of table, or separate clusters for each few hundred of tables.


-- Jack Krupansky

On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
fernando.jimenez@wealth-port.com> wrote:

> Hi all
>
> I have a use case for Cassandra that would require creating a large number
> of column families. I have found references to early versions of Cassandra
> where each column family would require a fixed amount of memory on all
> nodes, effectively imposing an upper limit on the total number of CFs. I
> have also seen rumblings that this may have been fixed in later versions.
>
> To put the question to rest, I have setup a DSE sandbox and created some
> code to generate column families populated with 3,000 entries each.
>
> Unfortunately I have now hit this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9291
>
> So I will have to retest against Cassandra 3.0 instead
>
> However, I would like to understand the limitations regarding creation of
> column families.
>
> * Is there a practical upper limit?
> * is this a fixed limit, or does it scale as more nodes are added into the
> cluster?
> * Is there a difference between one keyspace with thousands of column
> families, vs thousands of keyspaces with only a few column families each?
>
> I haven’t found any hard evidence/documentation to help me here, but if
> you can point me in the right direction, I will oblige and RTFM away.
>
> Many thanks for your help!
>
> Cheers
> FJ
>
>
>

Re: Practical limit on number of column families

Posted by Vlad <qa...@yahoo.com>.
>If your Jira search fu is strong enoughAnd it is! )

>you should be able to find it yourselfAnd I did! )
I see that this issue originates to problem with Java GC's design, but according to date it was Java 6 time. Now we have J8 with new  GC mechanism.
Is this problem still exists with J8? Any chances to use original method to reduce overhead and "be happy with the results"?
Regards, Vlad
 

    On Tuesday, March 1, 2016 4:07 PM, Jack Krupansky <ja...@gmail.com> wrote:
 

 I'll defer to one of the senior committers as to whether they want that information disseminated any further than it already is. It was intentionally not documented since it is not recommended. If your Jira search fu is strong enough you should be able to find it yourself, but again, its use is strongly not recommended.
As the Jira notes, "having more than dozens or hundreds of tables defined is almost certainly a Bad Idea."
"Bad Idea" means not good. As in don't go there. And if you do, don't expect such a mis-adventure to be supported by the community.
-- Jack Krupansky
On Tue, Mar 1, 2016 at 8:39 AM, Vlad <qa...@yahoo.com> wrote:

Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides performance degradation, if any)?
Thanks.


    On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky <ja...@gmail.com> wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With CQL3 you should be creating "tables". The practical recommendation of an upper limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per table (an undocumented Jira - intentionally undocumented since it is strongly not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fe...@wealth-port.com> wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of column families. I have found references to early versions of Cassandra where each column family would require a fixed amount of memory on all nodes, effectively imposing an upper limit on the total number of CFs. I have also seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale as more nodes are added into the cluster?  * Is there a difference between one keyspace with thousands of column families, vs thousands of keyspaces with only a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






   



  

Re: Practical limit on number of column families

Posted by Jack Krupansky <ja...@gmail.com>.
I'll defer to one of the senior committers as to whether they want that
information disseminated any further than it already is. It was
intentionally not documented since it is not recommended. If your Jira
search fu is strong enough you should be able to find it yourself, but
again, its use is strongly not recommended.

As the Jira notes, "having more than dozens or hundreds of tables defined
is almost certainly a Bad Idea."

"Bad Idea" means not good. As in don't go there. And if you do, don't
expect such a mis-adventure to be supported by the community.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:39 AM, Vlad <qa...@yahoo.com> wrote:

> Hi Jack,
>
> >you can reduce the overhead per table  an undocumented Jira
> Can you please point to this Jira number?
>
> >it is strongly not recommended
> What is consequences of this (besides performance degradation, if any)?
>
> Thanks.
>
>
> On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky <
> jack.krupansky@gmail.com> wrote:
>
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jimenez@wealth-port.com> wrote:
>
> Hi all
>
> I have a use case for Cassandra that would require creating a large number
> of column families. I have found references to early versions of Cassandra
> where each column family would require a fixed amount of memory on all
> nodes, effectively imposing an upper limit on the total number of CFs. I
> have also seen rumblings that this may have been fixed in later versions.
>
> To put the question to rest, I have setup a DSE sandbox and created some
> code to generate column families populated with 3,000 entries each.
>
> Unfortunately I have now hit this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9291
>
> So I will have to retest against Cassandra 3.0 instead
>
> However, I would like to understand the limitations regarding creation of
> column families.
>
> * Is there a practical upper limit?
> * is this a fixed limit, or does it scale as more nodes are added into the
> cluster?
> * Is there a difference between one keyspace with thousands of column
> families, vs thousands of keyspaces with only a few column families each?
>
> I haven’t found any hard evidence/documentation to help me here, but if
> you can point me in the right direction, I will oblige and RTFM away.
>
> Many thanks for your help!
>
> Cheers
> FJ
>
>
>
>
>
>

Re: Practical limit on number of column families

Posted by Vlad <qa...@yahoo.com>.
Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides performance degradation, if any)?
Thanks.


    On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky <ja...@gmail.com> wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With CQL3 you should be creating "tables". The practical recommendation of an upper limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per table (an undocumented Jira - intentionally undocumented since it is strongly not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fe...@wealth-port.com> wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of column families. I have found references to early versions of Cassandra where each column family would require a fixed amount of memory on all nodes, effectively imposing an upper limit on the total number of CFs. I have also seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale as more nodes are added into the cluster?  * Is there a difference between one keyspace with thousands of column families, vs thousands of keyspaces with only a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






  

Re: Practical limit on number of column families

Posted by Jack Krupansky <ja...@gmail.com>.
It is the total table count, across all key spaces. Memory is memory.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:26 PM, Brian Sam-Bodden <bs...@integrallis.com>
wrote:

> Eric,
>   Is the keyspace as a multitenancy solution as bad as the many tables
> pattern? Is the memory overhead of keyspaces as heavy as that of tables?
>
> Cheers,
> Brian
>
>
> On Tuesday, March 1, 2016, Eric Stevens <mi...@gmail.com> wrote:
>
>> It's definitely not true for every use case of a large number of tables,
>> but for many uses where you'd be tempted to do that, adding whatever would
>> have driven your table naming instead as a column in your partition key on
>> a smaller number of tables will meet your needs.  This is especially true
>> if you're looking to solve multi-tenancy, unless you let your tenants
>> dynamically drive your schema (which is a separate can of worms).
>>
>> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <ja...@gmail.com>
>> wrote:
>>
>>> I don't think Cassandra was "purposefully developed" for some target
>>> number of tables - there is no evidence of any such an explicit intent.
>>> Instead, it would be fair to say that Cassandra was "not purposefully
>>> developed" with a goal of supporting "large numbers of tables." Sometimes
>>> features and capabilities come for free or as a side effect of the
>>> technologies used, but usually specific features and specific capabilities
>>> (such as large numbers of tables) require explicit intent and explicit
>>> effort.
>>>
>>> One could indeed endeavor to design a data store (I'm not even sure it
>>> would still be considered a database per se) that supported either large
>>> numbers of tables or an additional level of storage model in between table
>>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>>> not designed with that goal in mind.
>>>
>>> Traditionally, a "table" is a defined relation over a set of data.
>>> Relation and data are distinct concepts. And a relation name is not simply
>>> a Java-style "object". A relation (table) name is supposed to represent an
>>> abstraction or entity type, while essentially all of the cases I have heard
>>> of for wanting thousands (or even hundreds) of tables are trying to use
>>> table as more of a container for a group of rows for a specific entity
>>> instance rather than a distinct entity type. Granted, Cassandra is not
>>> obligated to be limited to the relational model, but Cassandra, especially
>>> CQL, is intentionally modeled reasonably closely with the relational model
>>> in terms of the data modeling abstractions even though the storage engine
>>> is designed to scale across nodes.
>>>
>>> You could file a Jira requesting such a feature improvement. And then we
>>> would see if sentiment has shifted over the years.
>>>
>>> The key thing is to offer up a use case that warrants support for large
>>> numbers of tables. So far, it has usually been the case that the perceived
>>> need for separate tables could easily be met using clustering columns of a
>>> single table.
>>>
>>> Seriously, if you guys can define a legitimate use case that can't
>>> easily be handled by a single table, that could get the discussion started.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>>> fernando.jimenez@wealth-port.com> wrote:
>>>
>>>> Hi Jack
>>>>
>>>> Being purposefully developed to only handle up to “a few hundred”
>>>> tables is reason enough. I accept that, and likely a use case with many
>>>> tables was never really considered. But I would still like to understand
>>>> the design choices made so perhaps we gain some confidence level in this
>>>> upper limit in the number of tables. The best estimate we have so far is “a
>>>> few hundred” which is a bit vague.
>>>>
>>>> Regarding scaling, I’m not talking about scaling in terms of data
>>>> volume, but on how the data is structured. One thousand tables with one row
>>>> each is the same data volume as one table with one thousand rows, excluding
>>>> any data structures required to maintain the extra tables. But whereas the
>>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>>> will run happily on a single node cluster in a low end machine.
>>>>
>>>> We will design our code to use a single table to avoid having
>>>> nightmares with this issue. But if there is any authoritative documentation
>>>> on this characteristic of Cassandra, I would love to know more.
>>>>
>>>> FJ
>>>>
>>>>
>>>> On 01 Mar 2016, at 14:23, Jack Krupansky <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think there are any "reasons behind it." It is simply empirical
>>>> experience - as reported here.
>>>>
>>>> Cassandra scales in two dimension - number of rows per node and number
>>>> of nodes. If some source of information lead you to believe otherwise,
>>>> please point out the source so that we can endeavor to correct it.
>>>>
>>>> The exact number of rows per node and tables per node will always have
>>>> to be evaluated empirically - a proof of concept implementation, since it
>>>> all depends on the mix of capabilities of your hardware combined with your
>>>> specific data model, your specific data values, your specific access
>>>> patterns, and your specific load. And it also depends on your own personal
>>>> tolerance for degradation of latency and throughput - some people might
>>>> find a given set of performance  metrics acceptable while other might not.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>>>> fernando.jimenez@wealth-port.com> wrote:
>>>>
>>>>> Hi Tommaso
>>>>>
>>>>> It’s not that I _need_ a large number of tables. This approach maps
>>>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>>>> not the right approach.
>>>>>
>>>>> At the moment I’m trying to understand the limitations in Cassandra
>>>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>>>
>>>>> FJ
>>>>>
>>>>>
>>>>>
>>>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Fernando,
>>>>>
>>>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it
>>>>> was a real pain in terms of operations. Repairs were terribly slow, boot of
>>>>> C* slowed down and in general tracking table metrics becomes bit more work.
>>>>> Why do you need this high number of tables?
>>>>>
>>>>> Tommaso
>>>>>
>>>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>>>> fernando.jimenez@wealth-port.com> wrote:
>>>>>
>>>>>> Hi Jack
>>>>>>
>>>>>> By entry I mean row
>>>>>>
>>>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>>>>> tables from now on.
>>>>>>
>>>>>> Is there any documentation about this limit? for example, I’d be keen
>>>>>> to know how much memory is consumed per table, and I’m also curious about
>>>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>>>> limitations here, rather than challenge them.
>>>>>>
>>>>>> So far I found nothing in my search, hence why I had to resort to
>>>>>> some “load testing” to see what happens when you push the table count high
>>>>>>
>>>>>> Thanks
>>>>>> FJ
>>>>>>
>>>>>>
>>>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>>>
>>>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>>>> family. With CQL3 you should be creating "tables". The practical
>>>>>> recommendation of an upper limit of a few hundred tables across all key
>>>>>> spaces remains.
>>>>>>
>>>>>> Technically you can go higher and technically you can reduce the
>>>>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>>>> happy with the results.
>>>>>>
>>>>>> What is the nature of the use case?
>>>>>>
>>>>>> You basically have two choices: an additional cluster column to
>>>>>> distinguish categories of table, or separate clusters for each few hundred
>>>>>> of tables.
>>>>>>
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>>>> fernando.jimenez@wealth-port.com> wrote:
>>>>>>
>>>>>>> Hi all
>>>>>>>
>>>>>>> I have a use case for Cassandra that would require creating a large
>>>>>>> number of column families. I have found references to early versions of
>>>>>>> Cassandra where each column family would require a fixed amount of memory
>>>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>>>> versions.
>>>>>>>
>>>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>>>
>>>>>>> Unfortunately I have now hit this issue:
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>>>
>>>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>>>
>>>>>>> However, I would like to understand the limitations regarding
>>>>>>> creation of column families.
>>>>>>>
>>>>>>> * Is there a practical upper limit?
>>>>>>> * is this a fixed limit, or does it scale as more nodes are added
>>>>>>> into the cluster?
>>>>>>> * Is there a difference between one keyspace with thousands of
>>>>>>> column families, vs thousands of keyspaces with only a few column families
>>>>>>> each?
>>>>>>>
>>>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>>>
>>>>>>> Many thanks for your help!
>>>>>>>
>>>>>>> Cheers
>>>>>>> FJ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --
> Cheers,
> Brian
> http://www.integrallis.com
>
>

Re: Practical limit on number of column families

Posted by Brian Sam-Bodden <bs...@integrallis.com>.
Eric,
  Is the keyspace as a multitenancy solution as bad as the many tables
pattern? Is the memory overhead of keyspaces as heavy as that of tables?

Cheers,
Brian

On Tuesday, March 1, 2016, Eric Stevens <mi...@gmail.com> wrote:

> It's definitely not true for every use case of a large number of tables,
> but for many uses where you'd be tempted to do that, adding whatever would
> have driven your table naming instead as a column in your partition key on
> a smaller number of tables will meet your needs.  This is especially true
> if you're looking to solve multi-tenancy, unless you let your tenants
> dynamically drive your schema (which is a separate can of worms).
>
> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <jack.krupansky@gmail.com
> <javascript:_e(%7B%7D,'cvml','jack.krupansky@gmail.com');>> wrote:
>
>> I don't think Cassandra was "purposefully developed" for some target
>> number of tables - there is no evidence of any such an explicit intent.
>> Instead, it would be fair to say that Cassandra was "not purposefully
>> developed" with a goal of supporting "large numbers of tables." Sometimes
>> features and capabilities come for free or as a side effect of the
>> technologies used, but usually specific features and specific capabilities
>> (such as large numbers of tables) require explicit intent and explicit
>> effort.
>>
>> One could indeed endeavor to design a data store (I'm not even sure it
>> would still be considered a database per se) that supported either large
>> numbers of tables or an additional level of storage model in between table
>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>> not designed with that goal in mind.
>>
>> Traditionally, a "table" is a defined relation over a set of data.
>> Relation and data are distinct concepts. And a relation name is not simply
>> a Java-style "object". A relation (table) name is supposed to represent an
>> abstraction or entity type, while essentially all of the cases I have heard
>> of for wanting thousands (or even hundreds) of tables are trying to use
>> table as more of a container for a group of rows for a specific entity
>> instance rather than a distinct entity type. Granted, Cassandra is not
>> obligated to be limited to the relational model, but Cassandra, especially
>> CQL, is intentionally modeled reasonably closely with the relational model
>> in terms of the data modeling abstractions even though the storage engine
>> is designed to scale across nodes.
>>
>> You could file a Jira requesting such a feature improvement. And then we
>> would see if sentiment has shifted over the years.
>>
>> The key thing is to offer up a use case that warrants support for large
>> numbers of tables. So far, it has usually been the case that the perceived
>> need for separate tables could easily be met using clustering columns of a
>> single table.
>>
>> Seriously, if you guys can define a legitimate use case that can't easily
>> be handled by a single table, that could get the discussion started.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>> fernando.jimenez@wealth-port.com
>> <javascript:_e(%7B%7D,'cvml','fernando.jimenez@wealth-port.com');>>
>> wrote:
>>
>>> Hi Jack
>>>
>>> Being purposefully developed to only handle up to “a few hundred” tables
>>> is reason enough. I accept that, and likely a use case with many tables was
>>> never really considered. But I would still like to understand the design
>>> choices made so perhaps we gain some confidence level in this upper limit
>>> in the number of tables. The best estimate we have so far is “a few
>>> hundred” which is a bit vague.
>>>
>>> Regarding scaling, I’m not talking about scaling in terms of data
>>> volume, but on how the data is structured. One thousand tables with one row
>>> each is the same data volume as one table with one thousand rows, excluding
>>> any data structures required to maintain the extra tables. But whereas the
>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>> will run happily on a single node cluster in a low end machine.
>>>
>>> We will design our code to use a single table to avoid having nightmares
>>> with this issue. But if there is any authoritative documentation on this
>>> characteristic of Cassandra, I would love to know more.
>>>
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupansky@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','jack.krupansky@gmail.com');>> wrote:
>>>
>>> I don't think there are any "reasons behind it." It is simply empirical
>>> experience - as reported here.
>>>
>>> Cassandra scales in two dimension - number of rows per node and number
>>> of nodes. If some source of information lead you to believe otherwise,
>>> please point out the source so that we can endeavor to correct it.
>>>
>>> The exact number of rows per node and tables per node will always have
>>> to be evaluated empirically - a proof of concept implementation, since it
>>> all depends on the mix of capabilities of your hardware combined with your
>>> specific data model, your specific data values, your specific access
>>> patterns, and your specific load. And it also depends on your own personal
>>> tolerance for degradation of latency and throughput - some people might
>>> find a given set of performance  metrics acceptable while other might not.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>>> fernando.jimenez@wealth-port.com
>>> <javascript:_e(%7B%7D,'cvml','fernando.jimenez@wealth-port.com');>>
>>> wrote:
>>>
>>>> Hi Tommaso
>>>>
>>>> It’s not that I _need_ a large number of tables. This approach maps
>>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>>> not the right approach.
>>>>
>>>> At the moment I’m trying to understand the limitations in Cassandra
>>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>>
>>>> FJ
>>>>
>>>>
>>>>
>>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbugli@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','tbarbugli@gmail.com');>> wrote:
>>>>
>>>> Hi Fernando,
>>>>
>>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it
>>>> was a real pain in terms of operations. Repairs were terribly slow, boot of
>>>> C* slowed down and in general tracking table metrics becomes bit more work.
>>>> Why do you need this high number of tables?
>>>>
>>>> Tommaso
>>>>
>>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>>> fernando.jimenez@wealth-port.com
>>>> <javascript:_e(%7B%7D,'cvml','fernando.jimenez@wealth-port.com');>>
>>>> wrote:
>>>>
>>>>> Hi Jack
>>>>>
>>>>> By entry I mean row
>>>>>
>>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>>>> tables from now on.
>>>>>
>>>>> Is there any documentation about this limit? for example, I’d be keen
>>>>> to know how much memory is consumed per table, and I’m also curious about
>>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>>> limitations here, rather than challenge them.
>>>>>
>>>>> So far I found nothing in my search, hence why I had to resort to some
>>>>> “load testing” to see what happens when you push the table count high
>>>>>
>>>>> Thanks
>>>>> FJ
>>>>>
>>>>>
>>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','jack.krupansky@gmail.com');>> wrote:
>>>>>
>>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>>
>>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>>> family. With CQL3 you should be creating "tables". The practical
>>>>> recommendation of an upper limit of a few hundred tables across all key
>>>>> spaces remains.
>>>>>
>>>>> Technically you can go higher and technically you can reduce the
>>>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>>> happy with the results.
>>>>>
>>>>> What is the nature of the use case?
>>>>>
>>>>> You basically have two choices: an additional cluster column to
>>>>> distinguish categories of table, or separate clusters for each few hundred
>>>>> of tables.
>>>>>
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>>> fernando.jimenez@wealth-port.com
>>>>> <javascript:_e(%7B%7D,'cvml','fernando.jimenez@wealth-port.com');>>
>>>>> wrote:
>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I have a use case for Cassandra that would require creating a large
>>>>>> number of column families. I have found references to early versions of
>>>>>> Cassandra where each column family would require a fixed amount of memory
>>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>>> versions.
>>>>>>
>>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>>
>>>>>> Unfortunately I have now hit this issue:
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>>
>>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>>
>>>>>> However, I would like to understand the limitations regarding
>>>>>> creation of column families.
>>>>>>
>>>>>> * Is there a practical upper limit?
>>>>>> * is this a fixed limit, or does it scale as more nodes are added
>>>>>> into the cluster?
>>>>>> * Is there a difference between one keyspace with thousands of column
>>>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>>>
>>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>>
>>>>>> Many thanks for your help!
>>>>>>
>>>>>> Cheers
>>>>>> FJ
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>

-- 
Cheers,
Brian
http://www.integrallis.com

Re: Practical limit on number of column families

Posted by Eric Stevens <mi...@gmail.com>.
It's definitely not true for every use case of a large number of tables,
but for many uses where you'd be tempted to do that, adding whatever would
have driven your table naming instead as a column in your partition key on
a smaller number of tables will meet your needs.  This is especially true
if you're looking to solve multi-tenancy, unless you let your tenants
dynamically drive your schema (which is a separate can of worms).

On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <ja...@gmail.com>
wrote:

> I don't think Cassandra was "purposefully developed" for some target
> number of tables - there is no evidence of any such an explicit intent.
> Instead, it would be fair to say that Cassandra was "not purposefully
> developed" with a goal of supporting "large numbers of tables." Sometimes
> features and capabilities come for free or as a side effect of the
> technologies used, but usually specific features and specific capabilities
> (such as large numbers of tables) require explicit intent and explicit
> effort.
>
> One could indeed endeavor to design a data store (I'm not even sure it
> would still be considered a database per se) that supported either large
> numbers of tables or an additional level of storage model in between table
> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
> not designed with that goal in mind.
>
> Traditionally, a "table" is a defined relation over a set of data.
> Relation and data are distinct concepts. And a relation name is not simply
> a Java-style "object". A relation (table) name is supposed to represent an
> abstraction or entity type, while essentially all of the cases I have heard
> of for wanting thousands (or even hundreds) of tables are trying to use
> table as more of a container for a group of rows for a specific entity
> instance rather than a distinct entity type. Granted, Cassandra is not
> obligated to be limited to the relational model, but Cassandra, especially
> CQL, is intentionally modeled reasonably closely with the relational model
> in terms of the data modeling abstractions even though the storage engine
> is designed to scale across nodes.
>
> You could file a Jira requesting such a feature improvement. And then we
> would see if sentiment has shifted over the years.
>
> The key thing is to offer up a use case that warrants support for large
> numbers of tables. So far, it has usually been the case that the perceived
> need for separate tables could easily be met using clustering columns of a
> single table.
>
> Seriously, if you guys can define a legitimate use case that can't easily
> be handled by a single table, that could get the discussion started.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
> fernando.jimenez@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> Being purposefully developed to only handle up to “a few hundred” tables
>> is reason enough. I accept that, and likely a use case with many tables was
>> never really considered. But I would still like to understand the design
>> choices made so perhaps we gain some confidence level in this upper limit
>> in the number of tables. The best estimate we have so far is “a few
>> hundred” which is a bit vague.
>>
>> Regarding scaling, I’m not talking about scaling in terms of data volume,
>> but on how the data is structured. One thousand tables with one row each is
>> the same data volume as one table with one thousand rows, excluding any
>> data structures required to maintain the extra tables. But whereas the
>> first seems likely to bring a Cassandra cluster to its knees, the second
>> will run happily on a single node cluster in a low end machine.
>>
>> We will design our code to use a single table to avoid having nightmares
>> with this issue. But if there is any authoritative documentation on this
>> characteristic of Cassandra, I would love to know more.
>>
>> FJ
>>
>>
>> On 01 Mar 2016, at 14:23, Jack Krupansky <ja...@gmail.com>
>> wrote:
>>
>> I don't think there are any "reasons behind it." It is simply empirical
>> experience - as reported here.
>>
>> Cassandra scales in two dimension - number of rows per node and number of
>> nodes. If some source of information lead you to believe otherwise, please
>> point out the source so that we can endeavor to correct it.
>>
>> The exact number of rows per node and tables per node will always have to
>> be evaluated empirically - a proof of concept implementation, since it all
>> depends on the mix of capabilities of your hardware combined with your
>> specific data model, your specific data values, your specific access
>> patterns, and your specific load. And it also depends on your own personal
>> tolerance for degradation of latency and throughput - some people might
>> find a given set of performance  metrics acceptable while other might not.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>> fernando.jimenez@wealth-port.com> wrote:
>>
>>> Hi Tommaso
>>>
>>> It’s not that I _need_ a large number of tables. This approach maps
>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>> not the right approach.
>>>
>>> At the moment I’m trying to understand the limitations in Cassandra
>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>
>>> FJ
>>>
>>>
>>>
>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tb...@gmail.com> wrote:
>>>
>>> Hi Fernando,
>>>
>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>>> a real pain in terms of operations. Repairs were terribly slow, boot of C*
>>> slowed down and in general tracking table metrics becomes bit more work.
>>> Why do you need this high number of tables?
>>>
>>> Tommaso
>>>
>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>> fernando.jimenez@wealth-port.com> wrote:
>>>
>>>> Hi Jack
>>>>
>>>> By entry I mean row
>>>>
>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>>> tables from now on.
>>>>
>>>> Is there any documentation about this limit? for example, I’d be keen
>>>> to know how much memory is consumed per table, and I’m also curious about
>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>> limitations here, rather than challenge them.
>>>>
>>>> So far I found nothing in my search, hence why I had to resort to some
>>>> “load testing” to see what happens when you push the table count high
>>>>
>>>> Thanks
>>>> FJ
>>>>
>>>>
>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com>
>>>> wrote:
>>>>
>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>
>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>> family. With CQL3 you should be creating "tables". The practical
>>>> recommendation of an upper limit of a few hundred tables across all key
>>>> spaces remains.
>>>>
>>>> Technically you can go higher and technically you can reduce the
>>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>> happy with the results.
>>>>
>>>> What is the nature of the use case?
>>>>
>>>> You basically have two choices: an additional cluster column to
>>>> distinguish categories of table, or separate clusters for each few hundred
>>>> of tables.
>>>>
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>> fernando.jimenez@wealth-port.com> wrote:
>>>>
>>>>> Hi all
>>>>>
>>>>> I have a use case for Cassandra that would require creating a large
>>>>> number of column families. I have found references to early versions of
>>>>> Cassandra where each column family would require a fixed amount of memory
>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>> versions.
>>>>>
>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>
>>>>> Unfortunately I have now hit this issue:
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>
>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>
>>>>> However, I would like to understand the limitations regarding creation
>>>>> of column families.
>>>>>
>>>>> * Is there a practical upper limit?
>>>>> * is this a fixed limit, or does it scale as more nodes are added into
>>>>> the cluster?
>>>>> * Is there a difference between one keyspace with thousands of column
>>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>>
>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>
>>>>> Many thanks for your help!
>>>>>
>>>>> Cheers
>>>>> FJ
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Practical limit on number of column families

Posted by Jack Krupansky <ja...@gmail.com>.
I don't think Cassandra was "purposefully developed" for some target number
of tables - there is no evidence of any such an explicit intent. Instead,
it would be fair to say that Cassandra was "not purposefully developed"
with a goal of supporting "large numbers of tables." Sometimes features and
capabilities come for free or as a side effect of the technologies used,
but usually specific features and specific capabilities (such as large
numbers of tables) require explicit intent and explicit effort.

One could indeed endeavor to design a data store (I'm not even sure it
would still be considered a database per se) that supported either large
numbers of tables or an additional level of storage model in between table
and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
not designed with that goal in mind.

Traditionally, a "table" is a defined relation over a set of data. Relation
and data are distinct concepts. And a relation name is not simply a
Java-style "object". A relation (table) name is supposed to represent an
abstraction or entity type, while essentially all of the cases I have heard
of for wanting thousands (or even hundreds) of tables are trying to use
table as more of a container for a group of rows for a specific entity
instance rather than a distinct entity type. Granted, Cassandra is not
obligated to be limited to the relational model, but Cassandra, especially
CQL, is intentionally modeled reasonably closely with the relational model
in terms of the data modeling abstractions even though the storage engine
is designed to scale across nodes.

You could file a Jira requesting such a feature improvement. And then we
would see if sentiment has shifted over the years.

The key thing is to offer up a use case that warrants support for large
numbers of tables. So far, it has usually been the case that the perceived
need for separate tables could easily be met using clustering columns of a
single table.

Seriously, if you guys can define a legitimate use case that can't easily
be handled by a single table, that could get the discussion started.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
fernando.jimenez@wealth-port.com> wrote:

> Hi Jack
>
> Being purposefully developed to only handle up to “a few hundred” tables
> is reason enough. I accept that, and likely a use case with many tables was
> never really considered. But I would still like to understand the design
> choices made so perhaps we gain some confidence level in this upper limit
> in the number of tables. The best estimate we have so far is “a few
> hundred” which is a bit vague.
>
> Regarding scaling, I’m not talking about scaling in terms of data volume,
> but on how the data is structured. One thousand tables with one row each is
> the same data volume as one table with one thousand rows, excluding any
> data structures required to maintain the extra tables. But whereas the
> first seems likely to bring a Cassandra cluster to its knees, the second
> will run happily on a single node cluster in a low end machine.
>
> We will design our code to use a single table to avoid having nightmares
> with this issue. But if there is any authoritative documentation on this
> characteristic of Cassandra, I would love to know more.
>
> FJ
>
>
> On 01 Mar 2016, at 14:23, Jack Krupansky <ja...@gmail.com> wrote:
>
> I don't think there are any "reasons behind it." It is simply empirical
> experience - as reported here.
>
> Cassandra scales in two dimension - number of rows per node and number of
> nodes. If some source of information lead you to believe otherwise, please
> point out the source so that we can endeavor to correct it.
>
> The exact number of rows per node and tables per node will always have to
> be evaluated empirically - a proof of concept implementation, since it all
> depends on the mix of capabilities of your hardware combined with your
> specific data model, your specific data values, your specific access
> patterns, and your specific load. And it also depends on your own personal
> tolerance for degradation of latency and throughput - some people might
> find a given set of performance  metrics acceptable while other might not.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
> fernando.jimenez@wealth-port.com> wrote:
>
>> Hi Tommaso
>>
>> It’s not that I _need_ a large number of tables. This approach maps
>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>> not the right approach.
>>
>> At the moment I’m trying to understand the limitations in Cassandra
>> regarding number of Tables and the reasons behind it. I’ve come to the
>> email list as my Google-foo is not giving me what I’m looking for :(
>>
>> FJ
>>
>>
>>
>> On 01 Mar 2016, at 09:36, tommaso barbugli <tb...@gmail.com> wrote:
>>
>> Hi Fernando,
>>
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>> a real pain in terms of operations. Repairs were terribly slow, boot of C*
>> slowed down and in general tracking table metrics becomes bit more work.
>> Why do you need this high number of tables?
>>
>> Tommaso
>>
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>> fernando.jimenez@wealth-port.com> wrote:
>>
>>> Hi Jack
>>>
>>> By entry I mean row
>>>
>>> Apologies for the “obsolete terminology”. When I first looked at
>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>> defaulted to the terms I already knew. I will bear it in mind and call them
>>> tables from now on.
>>>
>>> Is there any documentation about this limit? for example, I’d be keen to
>>> know how much memory is consumed per table, and I’m also curious about the
>>> reasons for keeping this in memory. I’m trying to understand the
>>> limitations here, rather than challenge them.
>>>
>>> So far I found nothing in my search, hence why I had to resort to some
>>> “load testing” to see what happens when you push the table count high
>>>
>>> Thanks
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com>
>>> wrote:
>>>
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>
>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>> family. With CQL3 you should be creating "tables". The practical
>>> recommendation of an upper limit of a few hundred tables across all key
>>> spaces remains.
>>>
>>> Technically you can go higher and technically you can reduce the
>>> overhead per table (an undocumented Jira - intentionally undocumented since
>>> it is strongly not recommended), but... it is unlikely that you will be
>>> happy with the results.
>>>
>>> What is the nature of the use case?
>>>
>>> You basically have two choices: an additional cluster column to
>>> distinguish categories of table, or separate clusters for each few hundred
>>> of tables.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>> fernando.jimenez@wealth-port.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I have a use case for Cassandra that would require creating a large
>>>> number of column families. I have found references to early versions of
>>>> Cassandra where each column family would require a fixed amount of memory
>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>> versions.
>>>>
>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>> some code to generate column families populated with 3,000 entries each.
>>>>
>>>> Unfortunately I have now hit this issue:
>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>
>>>> So I will have to retest against Cassandra 3.0 instead
>>>>
>>>> However, I would like to understand the limitations regarding creation
>>>> of column families.
>>>>
>>>> * Is there a practical upper limit?
>>>> * is this a fixed limit, or does it scale as more nodes are added into
>>>> the cluster?
>>>> * Is there a difference between one keyspace with thousands of column
>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>
>>>> I haven’t found any hard evidence/documentation to help me here, but if
>>>> you can point me in the right direction, I will oblige and RTFM away.
>>>>
>>>> Many thanks for your help!
>>>>
>>>> Cheers
>>>> FJ
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Practical limit on number of column families

Posted by Fernando Jimenez <fe...@wealth-port.com>.
Hi Jack

Being purposefully developed to only handle up to “a few hundred” tables is reason enough. I accept that, and likely a use case with many tables was never really considered. But I would still like to understand the design choices made so perhaps we gain some confidence level in this upper limit in the number of tables. The best estimate we have so far is “a few hundred” which is a bit vague. 

Regarding scaling, I’m not talking about scaling in terms of data volume, but on how the data is structured. One thousand tables with one row each is the same data volume as one table with one thousand rows, excluding any data structures required to maintain the extra tables. But whereas the first seems likely to bring a Cassandra cluster to its knees, the second will run happily on a single node cluster in a low end machine.

We will design our code to use a single table to avoid having nightmares with this issue. But if there is any authoritative documentation on this characteristic of Cassandra, I would love to know more.

FJ


> On 01 Mar 2016, at 14:23, Jack Krupansky <ja...@gmail.com> wrote:
> 
> I don't think there are any "reasons behind it." It is simply empirical experience - as reported here.
> 
> Cassandra scales in two dimension - number of rows per node and number of nodes. If some source of information lead you to believe otherwise, please point out the source so that we can endeavor to correct it.
> 
> The exact number of rows per node and tables per node will always have to be evaluated empirically - a proof of concept implementation, since it all depends on the mix of capabilities of your hardware combined with your specific data model, your specific data values, your specific access patterns, and your specific load. And it also depends on your own personal tolerance for degradation of latency and throughput - some people might find a given set of performance  metrics acceptable while other might not.
> 
> -- Jack Krupansky
> 
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
> Hi Tommaso
> 
> It’s not that I _need_ a large number of tables. This approach maps easily to the problem we are trying to solve, but it’s becoming clear it’s not the right approach.
> 
> At the moment I’m trying to understand the limitations in Cassandra regarding number of Tables and the reasons behind it. I’ve come to the email list as my Google-foo is not giving me what I’m looking for :(
> 
> FJ
> 
> 
> 
>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbugli@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Fernando,
>> 
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a real pain in terms of operations. Repairs were terribly slow, boot of C* slowed down and in general tracking table metrics becomes bit more work. Why do you need this high number of tables?
>> 
>> Tommaso
>> 
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
>> Hi Jack
>> 
>> By entry I mean row
>> 
>> Apologies for the “obsolete terminology”. When I first looked at Cassandra it was still on CQL2, and now that I’m looking at it again I’ve defaulted to the terms I already knew. I will bear it in mind and call them tables from now on.
>> 
>> Is there any documentation about this limit? for example, I’d be keen to know how much memory is consumed per table, and I’m also curious about the reasons for keeping this in memory. I’m trying to understand the limitations here, rather than challenge them.
>> 
>> So far I found nothing in my search, hence why I had to resort to some “load testing” to see what happens when you push the table count high
>> 
>> Thanks
>> FJ
>> 
>> 
>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>> 
>>> You are using the obsolete terminology of CQL2 and Thrift - column family. With CQL3 you should be creating "tables". The practical recommendation of an upper limit of a few hundred tables across all key spaces remains.
>>> 
>>> Technically you can go higher and technically you can reduce the overhead per table (an undocumented Jira - intentionally undocumented since it is strongly not recommended), but... it is unlikely that you will be happy with the results.
>>> 
>>> What is the nature of the use case?
>>> 
>>> You basically have two choices: an additional cluster column to distinguish categories of table, or separate clusters for each few hundred of tables.
>>> 
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
>>> Hi all
>>> 
>>> I have a use case for Cassandra that would require creating a large number of column families. I have found references to early versions of Cassandra where each column family would require a fixed amount of memory on all nodes, effectively imposing an upper limit on the total number of CFs. I have also seen rumblings that this may have been fixed in later versions.
>>> 
>>> To put the question to rest, I have setup a DSE sandbox and created some code to generate column families populated with 3,000 entries each.
>>> 
>>> Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291 <https://issues.apache.org/jira/browse/CASSANDRA-9291>
>>> 
>>> So I will have to retest against Cassandra 3.0 instead
>>> 
>>> However, I would like to understand the limitations regarding creation of column families. 
>>> 
>>> 	* Is there a practical upper limit? 
>>> 	* is this a fixed limit, or does it scale as more nodes are added into the cluster? 
>>> 	* Is there a difference between one keyspace with thousands of column families, vs thousands of keyspaces with only a few column families each?
>>> 
>>> I haven’t found any hard evidence/documentation to help me here, but if you can point me in the right direction, I will oblige and RTFM away.
>>> 
>>> Many thanks for your help!
>>> 
>>> Cheers
>>> FJ
>>> 
>>> 
>>> 
>> 
>> 
> 
> 


Re: Practical limit on number of column families

Posted by Jack Krupansky <ja...@gmail.com>.
I don't think there are any "reasons behind it." It is simply empirical
experience - as reported here.

Cassandra scales in two dimension - number of rows per node and number of
nodes. If some source of information lead you to believe otherwise, please
point out the source so that we can endeavor to correct it.

The exact number of rows per node and tables per node will always have to
be evaluated empirically - a proof of concept implementation, since it all
depends on the mix of capabilities of your hardware combined with your
specific data model, your specific data values, your specific access
patterns, and your specific load. And it also depends on your own personal
tolerance for degradation of latency and throughput - some people might
find a given set of performance  metrics acceptable while other might not.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
fernando.jimenez@wealth-port.com> wrote:

> Hi Tommaso
>
> It’s not that I _need_ a large number of tables. This approach maps easily
> to the problem we are trying to solve, but it’s becoming clear it’s not the
> right approach.
>
> At the moment I’m trying to understand the limitations in Cassandra
> regarding number of Tables and the reasons behind it. I’ve come to the
> email list as my Google-foo is not giving me what I’m looking for :(
>
> FJ
>
>
>
> On 01 Mar 2016, at 09:36, tommaso barbugli <tb...@gmail.com> wrote:
>
> Hi Fernando,
>
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
> real pain in terms of operations. Repairs were terribly slow, boot of C*
> slowed down and in general tracking table metrics becomes bit more work.
> Why do you need this high number of tables?
>
> Tommaso
>
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
> fernando.jimenez@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> By entry I mean row
>>
>> Apologies for the “obsolete terminology”. When I first looked at
>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>> defaulted to the terms I already knew. I will bear it in mind and call them
>> tables from now on.
>>
>> Is there any documentation about this limit? for example, I’d be keen to
>> know how much memory is consumed per table, and I’m also curious about the
>> reasons for keeping this in memory. I’m trying to understand the
>> limitations here, rather than challenge them.
>>
>> So far I found nothing in my search, hence why I had to resort to some
>> “load testing” to see what happens when you push the table count high
>>
>> Thanks
>> FJ
>>
>>
>> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com>
>> wrote:
>>
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>
>> You are using the obsolete terminology of CQL2 and Thrift - column
>> family. With CQL3 you should be creating "tables". The practical
>> recommendation of an upper limit of a few hundred tables across all key
>> spaces remains.
>>
>> Technically you can go higher and technically you can reduce the overhead
>> per table (an undocumented Jira - intentionally undocumented since it is
>> strongly not recommended), but... it is unlikely that you will be happy
>> with the results.
>>
>> What is the nature of the use case?
>>
>> You basically have two choices: an additional cluster column to
>> distinguish categories of table, or separate clusters for each few hundred
>> of tables.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>> fernando.jimenez@wealth-port.com> wrote:
>>
>>> Hi all
>>>
>>> I have a use case for Cassandra that would require creating a large
>>> number of column families. I have found references to early versions of
>>> Cassandra where each column family would require a fixed amount of memory
>>> on all nodes, effectively imposing an upper limit on the total number of
>>> CFs. I have also seen rumblings that this may have been fixed in later
>>> versions.
>>>
>>> To put the question to rest, I have setup a DSE sandbox and created some
>>> code to generate column families populated with 3,000 entries each.
>>>
>>> Unfortunately I have now hit this issue:
>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>
>>> So I will have to retest against Cassandra 3.0 instead
>>>
>>> However, I would like to understand the limitations regarding creation
>>> of column families.
>>>
>>> * Is there a practical upper limit?
>>> * is this a fixed limit, or does it scale as more nodes are added into
>>> the cluster?
>>> * Is there a difference between one keyspace with thousands of column
>>> families, vs thousands of keyspaces with only a few column families each?
>>>
>>> I haven’t found any hard evidence/documentation to help me here, but if
>>> you can point me in the right direction, I will oblige and RTFM away.
>>>
>>> Many thanks for your help!
>>>
>>> Cheers
>>> FJ
>>>
>>>
>>>
>>
>>
>
>

Re: Practical limit on number of column families

Posted by Fernando Jimenez <fe...@wealth-port.com>.
Hi Tommaso

It’s not that I _need_ a large number of tables. This approach maps easily to the problem we are trying to solve, but it’s becoming clear it’s not the right approach.

At the moment I’m trying to understand the limitations in Cassandra regarding number of Tables and the reasons behind it. I’ve come to the email list as my Google-foo is not giving me what I’m looking for :(

FJ



> On 01 Mar 2016, at 09:36, tommaso barbugli <tb...@gmail.com> wrote:
> 
> Hi Fernando,
> 
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a real pain in terms of operations. Repairs were terribly slow, boot of C* slowed down and in general tracking table metrics becomes bit more work. Why do you need this high number of tables?
> 
> Tommaso
> 
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
> Hi Jack
> 
> By entry I mean row
> 
> Apologies for the “obsolete terminology”. When I first looked at Cassandra it was still on CQL2, and now that I’m looking at it again I’ve defaulted to the terms I already knew. I will bear it in mind and call them tables from now on.
> 
> Is there any documentation about this limit? for example, I’d be keen to know how much memory is consumed per table, and I’m also curious about the reasons for keeping this in memory. I’m trying to understand the limitations here, rather than challenge them.
> 
> So far I found nothing in my search, hence why I had to resort to some “load testing” to see what happens when you push the table count high
> 
> Thanks
> FJ
> 
> 
>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupansky@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>> 
>> You are using the obsolete terminology of CQL2 and Thrift - column family. With CQL3 you should be creating "tables". The practical recommendation of an upper limit of a few hundred tables across all key spaces remains.
>> 
>> Technically you can go higher and technically you can reduce the overhead per table (an undocumented Jira - intentionally undocumented since it is strongly not recommended), but... it is unlikely that you will be happy with the results.
>> 
>> What is the nature of the use case?
>> 
>> You basically have two choices: an additional cluster column to distinguish categories of table, or separate clusters for each few hundred of tables.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
>> Hi all
>> 
>> I have a use case for Cassandra that would require creating a large number of column families. I have found references to early versions of Cassandra where each column family would require a fixed amount of memory on all nodes, effectively imposing an upper limit on the total number of CFs. I have also seen rumblings that this may have been fixed in later versions.
>> 
>> To put the question to rest, I have setup a DSE sandbox and created some code to generate column families populated with 3,000 entries each.
>> 
>> Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291 <https://issues.apache.org/jira/browse/CASSANDRA-9291>
>> 
>> So I will have to retest against Cassandra 3.0 instead
>> 
>> However, I would like to understand the limitations regarding creation of column families. 
>> 
>> 	* Is there a practical upper limit? 
>> 	* is this a fixed limit, or does it scale as more nodes are added into the cluster? 
>> 	* Is there a difference between one keyspace with thousands of column families, vs thousands of keyspaces with only a few column families each?
>> 
>> I haven’t found any hard evidence/documentation to help me here, but if you can point me in the right direction, I will oblige and RTFM away.
>> 
>> Many thanks for your help!
>> 
>> Cheers
>> FJ
>> 
>> 
>> 
> 
> 


Re: Practical limit on number of column families

Posted by tommaso barbugli <tb...@gmail.com>.
Hi Fernando,

I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
real pain in terms of operations. Repairs were terribly slow, boot of C*
slowed down and in general tracking table metrics becomes bit more work.
Why do you need this high number of tables?

Tommaso

On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
fernando.jimenez@wealth-port.com> wrote:

> Hi Jack
>
> By entry I mean row
>
> Apologies for the “obsolete terminology”. When I first looked at Cassandra
> it was still on CQL2, and now that I’m looking at it again I’ve defaulted
> to the terms I already knew. I will bear it in mind and call them tables
> from now on.
>
> Is there any documentation about this limit? for example, I’d be keen to
> know how much memory is consumed per table, and I’m also curious about the
> reasons for keeping this in memory. I’m trying to understand the
> limitations here, rather than challenge them.
>
> So far I found nothing in my search, hence why I had to resort to some
> “load testing” to see what happens when you push the table count high
>
> Thanks
> FJ
>
>
> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com> wrote:
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jimenez@wealth-port.com> wrote:
>
>> Hi all
>>
>> I have a use case for Cassandra that would require creating a large
>> number of column families. I have found references to early versions of
>> Cassandra where each column family would require a fixed amount of memory
>> on all nodes, effectively imposing an upper limit on the total number of
>> CFs. I have also seen rumblings that this may have been fixed in later
>> versions.
>>
>> To put the question to rest, I have setup a DSE sandbox and created some
>> code to generate column families populated with 3,000 entries each.
>>
>> Unfortunately I have now hit this issue:
>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>
>> So I will have to retest against Cassandra 3.0 instead
>>
>> However, I would like to understand the limitations regarding creation of
>> column families.
>>
>> * Is there a practical upper limit?
>> * is this a fixed limit, or does it scale as more nodes are added into
>> the cluster?
>> * Is there a difference between one keyspace with thousands of column
>> families, vs thousands of keyspaces with only a few column families each?
>>
>> I haven’t found any hard evidence/documentation to help me here, but if
>> you can point me in the right direction, I will oblige and RTFM away.
>>
>> Many thanks for your help!
>>
>> Cheers
>> FJ
>>
>>
>>
>
>

Re: Practical limit on number of column families

Posted by Fernando Jimenez <fe...@wealth-port.com>.
Hi Jack

By entry I mean row

Apologies for the “obsolete terminology”. When I first looked at Cassandra it was still on CQL2, and now that I’m looking at it again I’ve defaulted to the terms I already knew. I will bear it in mind and call them tables from now on.

Is there any documentation about this limit? for example, I’d be keen to know how much memory is consumed per table, and I’m also curious about the reasons for keeping this in memory. I’m trying to understand the limitations here, rather than challenge them.

So far I found nothing in my search, hence why I had to resort to some “load testing” to see what happens when you push the table count high

Thanks
FJ


> On 01 Mar 2016, at 06:23, Jack Krupansky <ja...@gmail.com> wrote:
> 
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
> 
> You are using the obsolete terminology of CQL2 and Thrift - column family. With CQL3 you should be creating "tables". The practical recommendation of an upper limit of a few hundred tables across all key spaces remains.
> 
> Technically you can go higher and technically you can reduce the overhead per table (an undocumented Jira - intentionally undocumented since it is strongly not recommended), but... it is unlikely that you will be happy with the results.
> 
> What is the nature of the use case?
> 
> You basically have two choices: an additional cluster column to distinguish categories of table, or separate clusters for each few hundred of tables.
> 
> 
> -- Jack Krupansky
> 
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <fernando.jimenez@wealth-port.com <ma...@wealth-port.com>> wrote:
> Hi all
> 
> I have a use case for Cassandra that would require creating a large number of column families. I have found references to early versions of Cassandra where each column family would require a fixed amount of memory on all nodes, effectively imposing an upper limit on the total number of CFs. I have also seen rumblings that this may have been fixed in later versions.
> 
> To put the question to rest, I have setup a DSE sandbox and created some code to generate column families populated with 3,000 entries each.
> 
> Unfortunately I have now hit this issue: https://issues.apache.org/jira/browse/CASSANDRA-9291 <https://issues.apache.org/jira/browse/CASSANDRA-9291>
> 
> So I will have to retest against Cassandra 3.0 instead
> 
> However, I would like to understand the limitations regarding creation of column families. 
> 
> 	* Is there a practical upper limit? 
> 	* is this a fixed limit, or does it scale as more nodes are added into the cluster? 
> 	* Is there a difference between one keyspace with thousands of column families, vs thousands of keyspaces with only a few column families each?
> 
> I haven’t found any hard evidence/documentation to help me here, but if you can point me in the right direction, I will oblige and RTFM away.
> 
> Many thanks for your help!
> 
> Cheers
> FJ
> 
> 
>