You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David Boxenhorn <da...@lookin2.com> on 2011/02/21 13:28:19 UTC

Distribution Factor: part of the solution to many-CF problem?

Cassandra is both distributed and replicated. We have Replication Factor but
no Distribution Factor!

Distribution Factor would define over how many nodes a CF should be
distributed.

Say you want to support millions of multi-tenant users in clusters with
thousands of nodes, where you don't know the user's schema in advance, so
you can't have users share CFs.

In this case you wouldn't want to spread out each user's Column Families
over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
distribute each CF over 10 nodes, within those nodes replicate 3 times.

One implementation of DF would be to hash the CF name, and use the same
strategies defined for RF to choose the N nodes in DF=N.

Re: Distribution Factor: part of the solution to many-CF problem?

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton <aa...@thelastpickle.com> wrote:
>> The single partitioner is "baked in"
> That was my point.
>
> You could perhaps write a partitioner that considers the CF when deciding what nodes to put data on. Off the top of my head the partitioner is not told about the  CF the key is storing in.
>
> Aaron
>
> On 23/02/2011, at 6:01 AM, Edward Capriolo <ed...@gmail.com> wrote:
>
>> On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <da...@lookin2.com> wrote:
>>> No, that's not what I mean at all.
>>>
>>> That message is about the ability to use different partitioners for
>>> different CFs, say, RandomPartitioner for one, OPP for another.
>>>
>>> I'm talking about defining how many nodes a CF should be distributed over,
>>> which would be useful if you have a lot of nodes and a lot of small CFs
>>> (small relative to the total amount of data).
>>>
>>>
>>> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com>
>>> wrote:
>>>>
>>>> Sounds a bit like this idea
>>>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>>>>
>>>> Aaron
>>>>
>>>> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:
>>>>
>>>>> Cassandra is both distributed and replicated. We have Replication Factor
>>>>> but no Distribution Factor!
>>>>>
>>>>> Distribution Factor would define over how many nodes a CF should be
>>>>> distributed.
>>>>>
>>>>> Say you want to support millions of multi-tenant users in clusters with
>>>>> thousands of nodes, where you don't know the user's schema in advance, so
>>>>> you can't have users share CFs.
>>>>>
>>>>> In this case you wouldn't want to spread out each user's Column Families
>>>>> over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
>>>>> distribute each CF over 10 nodes, within those nodes replicate 3 times.
>>>>>
>>>>> One implementation of DF would be to hash the CF name, and use the same
>>>>> strategies defined for RF to choose the N nodes in DF=N.
>>>>>
>>>
>>>
>>
>> The single partitioner is "baked in"
>>
>> Here is a possible solution. Use OOP, but md5 hash your keys client side.
>>
>> This solves that, but when you have keyspaces using OOP but with
>> different key distributions this falls apart.
>


Not to say that this is a bad idea but it breaks the #1 Cassandra law
of Cassandra "keep everything balanced". That routine that calculates
natural endpoints does not take the CF into account.

Regarding multi-tenancy, I do not think there is a line in the sand
between "running N clusters " and multi-tenancy.

"Multi-tenancy" is also ambiguous like "real time". Does multi-tenancy
mean efficiently supporting 10-20 CFs or 20,000?. I do not see the
cassandra code base supporting a very large number of cf's since it
was designed around a low number of CFs!

Some who may have who have moved from a RDBMS background where a
"table" looks/works like a "columnfamily".  But if that is probably
not denormalized enough. Many in fact advocate "You only need 1 CF!"

Re: Distribution Factor: part of the solution to many-CF problem?

Posted by Aaron Morton <aa...@thelastpickle.com>.
> The single partitioner is "baked in"
That was my point.

You could perhaps write a partitioner that considers the CF when deciding what nodes to put data on. Off the top of my head the partitioner is not told about the  CF the key is storing in. 

Aaron

On 23/02/2011, at 6:01 AM, Edward Capriolo <ed...@gmail.com> wrote:

> On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <da...@lookin2.com> wrote:
>> No, that's not what I mean at all.
>> 
>> That message is about the ability to use different partitioners for
>> different CFs, say, RandomPartitioner for one, OPP for another.
>> 
>> I'm talking about defining how many nodes a CF should be distributed over,
>> which would be useful if you have a lot of nodes and a lot of small CFs
>> (small relative to the total amount of data).
>> 
>> 
>> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com>
>> wrote:
>>> 
>>> Sounds a bit like this idea
>>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>>> 
>>> Aaron
>>> 
>>> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:
>>> 
>>>> Cassandra is both distributed and replicated. We have Replication Factor
>>>> but no Distribution Factor!
>>>> 
>>>> Distribution Factor would define over how many nodes a CF should be
>>>> distributed.
>>>> 
>>>> Say you want to support millions of multi-tenant users in clusters with
>>>> thousands of nodes, where you don't know the user's schema in advance, so
>>>> you can't have users share CFs.
>>>> 
>>>> In this case you wouldn't want to spread out each user's Column Families
>>>> over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
>>>> distribute each CF over 10 nodes, within those nodes replicate 3 times.
>>>> 
>>>> One implementation of DF would be to hash the CF name, and use the same
>>>> strategies defined for RF to choose the N nodes in DF=N.
>>>> 
>> 
>> 
> 
> The single partitioner is "baked in"
> 
> Here is a possible solution. Use OOP, but md5 hash your keys client side.
> 
> This solves that, but when you have keyspaces using OOP but with
> different key distributions this falls apart.

Re: Distribution Factor: part of the solution to many-CF problem?

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <da...@lookin2.com> wrote:
> No, that's not what I mean at all.
>
> That message is about the ability to use different partitioners for
> different CFs, say, RandomPartitioner for one, OPP for another.
>
> I'm talking about defining how many nodes a CF should be distributed over,
> which would be useful if you have a lot of nodes and a lot of small CFs
> (small relative to the total amount of data).
>
>
> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com>
> wrote:
>>
>> Sounds a bit like this idea
>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>>
>> Aaron
>>
>> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:
>>
>> > Cassandra is both distributed and replicated. We have Replication Factor
>> > but no Distribution Factor!
>> >
>> > Distribution Factor would define over how many nodes a CF should be
>> > distributed.
>> >
>> > Say you want to support millions of multi-tenant users in clusters with
>> > thousands of nodes, where you don't know the user's schema in advance, so
>> > you can't have users share CFs.
>> >
>> > In this case you wouldn't want to spread out each user's Column Families
>> > over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
>> > distribute each CF over 10 nodes, within those nodes replicate 3 times.
>> >
>> > One implementation of DF would be to hash the CF name, and use the same
>> > strategies defined for RF to choose the N nodes in DF=N.
>> >
>
>

The single partitioner is "baked in"

Here is a possible solution. Use OOP, but md5 hash your keys client side.

This solves that, but when you have keyspaces using OOP but with
different key distributions this falls apart.

Re: Distribution Factor: part of the solution to many-CF problem?

Posted by David Boxenhorn <da...@lookin2.com>.
No, that's not what I mean at all.

That message is about the ability to use different partitioners for
different CFs, say, RandomPartitioner for one, OPP for another.

I'm talking about defining how many nodes a CF should be distributed over,
which would be useful if you have a lot of nodes and a lot of small CFs
(small relative to the total amount of data).


On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com>wrote:

> Sounds a bit like this idea
> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>
> Aaron
>
> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:
>
> > Cassandra is both distributed and replicated. We have Replication Factor
> but no Distribution Factor!
> >
> > Distribution Factor would define over how many nodes a CF should be
> distributed.
> >
> > Say you want to support millions of multi-tenant users in clusters with
> thousands of nodes, where you don't know the user's schema in advance, so
> you can't have users share CFs.
> >
> > In this case you wouldn't want to spread out each user's Column Families
> over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
> distribute each CF over 10 nodes, within those nodes replicate 3 times.
> >
> > One implementation of DF would be to hash the CF name, and use the same
> strategies defined for RF to choose the N nodes in DF=N.
> >
>

Re: Distribution Factor: part of the solution to many-CF problem?

Posted by Aaron Morton <aa...@thelastpickle.com>.
Sounds a bit like this idea http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html

Aaron

On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:

> Cassandra is both distributed and replicated. We have Replication Factor but no Distribution Factor!
> 
> Distribution Factor would define over how many nodes a CF should be distributed. 
> 
> Say you want to support millions of multi-tenant users in clusters with thousands of nodes, where you don't know the user's schema in advance, so you can't have users share CFs.
> 
> In this case you wouldn't want to spread out each user's Column Families over thousands of nodes! You would want something like: RF=3, DF=10 i.e. distribute each CF over 10 nodes, within those nodes replicate 3 times.
> 
> One implementation of DF would be to hash the CF name, and use the same strategies defined for RF to choose the N nodes in DF=N.
>