You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ben Hood <0x...@gmail.com> on 2013/11/12 10:19:12 UTC

Modeling multi-tenanted Cassandra schema

Hi,

I've just received a requirement to make a Cassandra app
multi-tenanted, where we'll have up to 100 tenants.

Most of the tables are timestamped wide row tables with a natural
application key for the partitioning key and a timestamp key as a
cluster key.

So I was considering the options:

(a) Add a tenant column to each table and stick a secondary index on
that column;
(b) Add a tenant column to each table and maintain index tables that
use the tenant id as a partitioning key;
(c) Decompose the partitioning key of each table and add the tenant
and the leading component of the key;
(d) Add the tenant as a separate clustering key;
(e) Replicate the schema in separate tenant specific key spaces;
(f) Something I may have missed;

Option (a) seems the easiest, but I'm wary of just adding secondary
indexes without thinking about it.

Option (b) seems to have the least impact of the layout of the
storage, but a cost of maintaining each index table, both code wise
and in terms of performance.

Option (c) seems quite straight forward, but I feel it might have a
significant effect on the distribution of the rows, if the cardinality
of the tenants is low.

Option (d) seems simple enough, but it would mean that you couldn't
query for a range of tenants without supplying a range of natural
application keys, through which you would need to iterate (under the
assumption that you don't use an ordered partitioner).

Option (e) appears relatively straight forward, but it does mean that
the application CQL client needs to maintain separate cluster
connections for each tenant. Also I'm not sure to what extent key
spaces were designed to partition identically structured data.

Does anybody have any experience with running a multi-tenanted
Cassandra app, or does this just depend too much on the specifics of
the application?

Cheers,

Ben

Re: Modeling multi-tenanted Cassandra schema

Posted by Ben Hood <0x...@gmail.com>.

OK, so in the end I elected to go for option (c), which makes my table
definition look like this:

create table tenanted_foo_table (
    tenant ascii,
    application_key bigint,
    timestamp timestamp,
    .... other non-key columns
    PRIMARY KEY ((tenant, application_key), timestamp)
)

such that on disk the row keys are effectively tenant:application_key
concatenations.

Thanks for your input,

Ben

On Wed, Nov 13, 2013 at 2:43 PM, Nate McCall <na...@thelastpickle.com> wrote:
> Astyanax and/or the DS Java client depending on your use case. (Emphasis on
> the "and" - really no reason you can't use both - even on the same schema -
> depending on what you are doing as they both have their strengths and
> weaknesses).
>
> To be clear, Hector is not going away. We are still accepting patches and
> updates, but there is no active feature development.
>
> Any other hector specific questions, please start a thread over on
> hector-users@googlegroups.com
>
>
> On Wed, Nov 13, 2013 at 8:35 AM, Shahab Yunus <sh...@gmail.com>
> wrote:
>>
>> Nate,
>>
>> (slightly OT), what client API/library is recommended now that Hector is
>> sunsetting? Thanks.
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Nov 13, 2013 at 9:28 AM, Nate McCall <na...@thelastpickle.com>
>> wrote:
>>>
>>> You basically want option (c). Option (d) might work, but you would be
>>> bending the paradigm a bit, IMO. Certainly do not use dedicated column
>>> families or keyspaces per tennant. That never works. The list history will
>>> show that with a few google searches and we've seen it fail badly with
>>> several clients.
>>>
>>> Overall, option (c) would be difficult to do in CQL without some very
>>> well thought out abstractions and/or a deep hack on the Java driver (not
>>> in-ellegant or impossible, just lots of moving parts to get your head around
>>> if you are new to such). That said, depending on the size of your project
>>> and skill of your team, this direction might be worth considering.
>>>
>>> Usergrid (just accepted for incubation at Apache) functions this way via
>>> the Thrift API: https://github.com/apigee/usergrid-stack
>>>
>>> The commercial version of Usergrid has "tens of thousands" of active
>>> tennants on a single cluster (same code base at the service layer as the
>>> open source version). It uses Hector's built in virtual keyspaces:
>>> https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE: though
>>> Hector is sunsetting/in patch maintenance, the approach is certainly
>>> legitimate - but I'd recommend you *not* start a new project on Hector).
>>>
>>> In short, Usergrid is the only project I know of that has a well-proven
>>> tenant model that functions at scale, though I'm sure there are others
>>> around, just not open sourced or actually running large deployments.
>>>
>>> Astyanax can do this as well albeit with a little more work required:
>>>
>>> https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns
>>>
>>> Happy to clarify any of the above.
>>>
>>>
>>> On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood <0x...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've just received a requirement to make a Cassandra app
>>>> multi-tenanted, where we'll have up to 100 tenants.
>>>>
>>>> Most of the tables are timestamped wide row tables with a natural
>>>> application key for the partitioning key and a timestamp key as a
>>>> cluster key.
>>>>
>>>> So I was considering the options:
>>>>
>>>> (a) Add a tenant column to each table and stick a secondary index on
>>>> that column;
>>>> (b) Add a tenant column to each table and maintain index tables that
>>>> use the tenant id as a partitioning key;
>>>> (c) Decompose the partitioning key of each table and add the tenant
>>>> and the leading component of the key;
>>>> (d) Add the tenant as a separate clustering key;
>>>> (e) Replicate the schema in separate tenant specific key spaces;
>>>> (f) Something I may have missed;
>>>>
>>>> Option (a) seems the easiest, but I'm wary of just adding secondary
>>>> indexes without thinking about it.
>>>>
>>>> Option (b) seems to have the least impact of the layout of the
>>>> storage, but a cost of maintaining each index table, both code wise
>>>> and in terms of performance.
>>>>
>>>> Option (c) seems quite straight forward, but I feel it might have a
>>>> significant effect on the distribution of the rows, if the cardinality
>>>> of the tenants is low.
>>>>
>>>> Option (d) seems simple enough, but it would mean that you couldn't
>>>> query for a range of tenants without supplying a range of natural
>>>> application keys, through which you would need to iterate (under the
>>>> assumption that you don't use an ordered partitioner).
>>>>
>>>> Option (e) appears relatively straight forward, but it does mean that
>>>> the application CQL client needs to maintain separate cluster
>>>> connections for each tenant. Also I'm not sure to what extent key
>>>> spaces were designed to partition identically structured data.
>>>>
>>>> Does anybody have any experience with running a multi-tenanted
>>>> Cassandra app, or does this just depend too much on the specifics of
>>>> the application?
>>>>
>>>> Cheers,
>>>>
>>>> Ben
>>>
>>>
>>>
>>>
>>> --
>>> -----------------
>>> Nate McCall
>>> Austin, TX
>>> @zznate
>>>
>>> Co-Founder & Sr. Technical Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>
>>
>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com

Re: Modeling multi-tenanted Cassandra schema

Posted by Nate McCall <na...@thelastpickle.com>.

Astyanax and/or the DS Java client depending on your use case. (Emphasis on
the "and" - really no reason you can't use both - even on the same schema -
depending on what you are doing as they both have their strengths and
weaknesses).

To be clear, Hector is not going away. We are still accepting patches and
updates, but there is no active feature development.

Any other hector specific questions, please start a thread over on
hector-users@googlegroups.com


On Wed, Nov 13, 2013 at 8:35 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Nate,
>
> (slightly OT), what client API/library is recommended now that Hector is
> sunsetting? Thanks.
>
> Regards,
> Shahab
>
>
> On Wed, Nov 13, 2013 at 9:28 AM, Nate McCall <na...@thelastpickle.com>wrote:
>
>> You basically want option (c). Option (d) might work, but you would be
>> bending the paradigm a bit, IMO. Certainly do not use dedicated column
>> families or keyspaces per tennant. That never works. The list history will
>> show that with a few google searches and we've seen it fail badly with
>> several clients.
>>
>> Overall, option (c) would be difficult to do in CQL without some very
>> well thought out abstractions and/or a deep hack on the Java driver (not
>> in-ellegant or impossible, just lots of moving parts to get your head
>> around if you are new to such). That said, depending on the size of your
>> project and skill of your team, this direction might be worth considering.
>>
>> Usergrid (just accepted for incubation at Apache) functions this way via
>> the Thrift API: https://github.com/apigee/usergrid-stack
>>
>> The commercial version of Usergrid has "tens of thousands" of active
>> tennants on a single cluster (same code base at the service layer as the
>> open source version). It uses Hector's built in virtual keyspaces:
>> https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE:
>> though Hector is sunsetting/in patch maintenance, the approach is certainly
>> legitimate - but I'd recommend you *not* start a new project on Hector).
>>
>> In short, Usergrid is the only project I know of that has a well-proven
>> tenant model that functions at scale, though I'm sure there are others
>> around, just not open sourced or actually running large deployments.
>>
>> Astyanax can do this as well albeit with a little more work required:
>>
>> https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns
>>
>>
>> Happy to clarify any of the above.
>>
>>
>> On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood <0x...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've just received a requirement to make a Cassandra app
>>> multi-tenanted, where we'll have up to 100 tenants.
>>>
>>> Most of the tables are timestamped wide row tables with a natural
>>> application key for the partitioning key and a timestamp key as a
>>> cluster key.
>>>
>>> So I was considering the options:
>>>
>>> (a) Add a tenant column to each table and stick a secondary index on
>>> that column;
>>> (b) Add a tenant column to each table and maintain index tables that
>>> use the tenant id as a partitioning key;
>>> (c) Decompose the partitioning key of each table and add the tenant
>>> and the leading component of the key;
>>> (d) Add the tenant as a separate clustering key;
>>> (e) Replicate the schema in separate tenant specific key spaces;
>>> (f) Something I may have missed;
>>>
>>> Option (a) seems the easiest, but I'm wary of just adding secondary
>>> indexes without thinking about it.
>>>
>>> Option (b) seems to have the least impact of the layout of the
>>> storage, but a cost of maintaining each index table, both code wise
>>> and in terms of performance.
>>>
>>> Option (c) seems quite straight forward, but I feel it might have a
>>> significant effect on the distribution of the rows, if the cardinality
>>> of the tenants is low.
>>>
>>> Option (d) seems simple enough, but it would mean that you couldn't
>>> query for a range of tenants without supplying a range of natural
>>> application keys, through which you would need to iterate (under the
>>> assumption that you don't use an ordered partitioner).
>>>
>>> Option (e) appears relatively straight forward, but it does mean that
>>> the application CQL client needs to maintain separate cluster
>>> connections for each tenant. Also I'm not sure to what extent key
>>> spaces were designed to partition identically structured data.
>>>
>>> Does anybody have any experience with running a multi-tenanted
>>> Cassandra app, or does this just depend too much on the specifics of
>>> the application?
>>>
>>> Cheers,
>>>
>>> Ben
>>>
>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Modeling multi-tenanted Cassandra schema

Posted by Shahab Yunus <sh...@gmail.com>.

Nate,

(slightly OT), what client API/library is recommended now that Hector is
sunsetting? Thanks.

Regards,
Shahab


On Wed, Nov 13, 2013 at 9:28 AM, Nate McCall <na...@thelastpickle.com> wrote:

> You basically want option (c). Option (d) might work, but you would be
> bending the paradigm a bit, IMO. Certainly do not use dedicated column
> families or keyspaces per tennant. That never works. The list history will
> show that with a few google searches and we've seen it fail badly with
> several clients.
>
> Overall, option (c) would be difficult to do in CQL without some very well
> thought out abstractions and/or a deep hack on the Java driver (not
> in-ellegant or impossible, just lots of moving parts to get your head
> around if you are new to such). That said, depending on the size of your
> project and skill of your team, this direction might be worth considering.
>
> Usergrid (just accepted for incubation at Apache) functions this way via
> the Thrift API: https://github.com/apigee/usergrid-stack
>
> The commercial version of Usergrid has "tens of thousands" of active
> tennants on a single cluster (same code base at the service layer as the
> open source version). It uses Hector's built in virtual keyspaces:
> https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE:
> though Hector is sunsetting/in patch maintenance, the approach is certainly
> legitimate - but I'd recommend you *not* start a new project on Hector).
>
> In short, Usergrid is the only project I know of that has a well-proven
> tenant model that functions at scale, though I'm sure there are others
> around, just not open sourced or actually running large deployments.
>
> Astyanax can do this as well albeit with a little more work required:
>
> https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns
>
>
> Happy to clarify any of the above.
>
>
> On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood <0x...@gmail.com> wrote:
>
>> Hi,
>>
>> I've just received a requirement to make a Cassandra app
>> multi-tenanted, where we'll have up to 100 tenants.
>>
>> Most of the tables are timestamped wide row tables with a natural
>> application key for the partitioning key and a timestamp key as a
>> cluster key.
>>
>> So I was considering the options:
>>
>> (a) Add a tenant column to each table and stick a secondary index on
>> that column;
>> (b) Add a tenant column to each table and maintain index tables that
>> use the tenant id as a partitioning key;
>> (c) Decompose the partitioning key of each table and add the tenant
>> and the leading component of the key;
>> (d) Add the tenant as a separate clustering key;
>> (e) Replicate the schema in separate tenant specific key spaces;
>> (f) Something I may have missed;
>>
>> Option (a) seems the easiest, but I'm wary of just adding secondary
>> indexes without thinking about it.
>>
>> Option (b) seems to have the least impact of the layout of the
>> storage, but a cost of maintaining each index table, both code wise
>> and in terms of performance.
>>
>> Option (c) seems quite straight forward, but I feel it might have a
>> significant effect on the distribution of the rows, if the cardinality
>> of the tenants is low.
>>
>> Option (d) seems simple enough, but it would mean that you couldn't
>> query for a range of tenants without supplying a range of natural
>> application keys, through which you would need to iterate (under the
>> assumption that you don't use an ordered partitioner).
>>
>> Option (e) appears relatively straight forward, but it does mean that
>> the application CQL client needs to maintain separate cluster
>> connections for each tenant. Also I'm not sure to what extent key
>> spaces were designed to partition identically structured data.
>>
>> Does anybody have any experience with running a multi-tenanted
>> Cassandra app, or does this just depend too much on the specifics of
>> the application?
>>
>> Cheers,
>>
>> Ben
>>
>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Modeling multi-tenanted Cassandra schema

Posted by Nate McCall <na...@thelastpickle.com>.

You basically want option (c). Option (d) might work, but you would be
bending the paradigm a bit, IMO. Certainly do not use dedicated column
families or keyspaces per tennant. That never works. The list history will
show that with a few google searches and we've seen it fail badly with
several clients.

Overall, option (c) would be difficult to do in CQL without some very well
thought out abstractions and/or a deep hack on the Java driver (not
in-ellegant or impossible, just lots of moving parts to get your head
around if you are new to such). That said, depending on the size of your
project and skill of your team, this direction might be worth considering.

Usergrid (just accepted for incubation at Apache) functions this way via
the Thrift API: https://github.com/apigee/usergrid-stack

The commercial version of Usergrid has "tens of thousands" of active
tennants on a single cluster (same code base at the service layer as the
open source version). It uses Hector's built in virtual keyspaces:
https://github.com/hector-client/hector/wiki/Virtual-Keyspaces (NOTE:
though Hector is sunsetting/in patch maintenance, the approach is certainly
legitimate - but I'd recommend you *not* start a new project on Hector).

In short, Usergrid is the only project I know of that has a well-proven
tenant model that functions at scale, though I'm sure there are others
around, just not open sourced or actually running large deployments.

Astyanax can do this as well albeit with a little more work required:
https://github.com/Netflix/astyanax/wiki/Composite-columns#how-to-use-the-prefixedserializer-but-you-really-should-use-composite-columns

Happy to clarify any of the above.

On Tue, Nov 12, 2013 at 3:19 AM, Ben Hood <0x...@gmail.com> wrote:

> Hi,
>
> I've just received a requirement to make a Cassandra app
> multi-tenanted, where we'll have up to 100 tenants.
>
> Most of the tables are timestamped wide row tables with a natural
> application key for the partitioning key and a timestamp key as a
> cluster key.
>
> So I was considering the options:
>
> (a) Add a tenant column to each table and stick a secondary index on
> that column;
> (b) Add a tenant column to each table and maintain index tables that
> use the tenant id as a partitioning key;
> (c) Decompose the partitioning key of each table and add the tenant
> and the leading component of the key;
> (d) Add the tenant as a separate clustering key;
> (e) Replicate the schema in separate tenant specific key spaces;
> (f) Something I may have missed;
>
> Option (a) seems the easiest, but I'm wary of just adding secondary
> indexes without thinking about it.
>
> Option (b) seems to have the least impact of the layout of the
> storage, but a cost of maintaining each index table, both code wise
> and in terms of performance.
>
> Option (c) seems quite straight forward, but I feel it might have a
> significant effect on the distribution of the rows, if the cardinality
> of the tenants is low.
>
> Option (d) seems simple enough, but it would mean that you couldn't
> query for a range of tenants without supplying a range of natural
> application keys, through which you would need to iterate (under the
> assumption that you don't use an ordered partitioner).
>
> Option (e) appears relatively straight forward, but it does mean that
> the application CQL client needs to maintain separate cluster
> connections for each tenant. Also I'm not sure to what extent key
> spaces were designed to partition identically structured data.
>
> Does anybody have any experience with running a multi-tenanted
> Cassandra app, or does this just depend too much on the specifics of
> the application?
>
> Cheers,
>
> Ben
>

-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com