You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Hiller, Dean" <De...@nrel.gov> on 2012/10/03 22:10:35 UTC

1000's of CF's. PlayOrm solves the cassandra limit on #ColFamily

Okay, so it only took me two solid days not a week.  PlayOrm in master branch now supports virtual CF's or virtual tables in ONE CF, so you can have 1000's or millions of virtual CF's in one CF now.  It works with all the Scalable-SQL, works with the joins, and works with the PlayOrm command line tool.

Two ways to do it, if you are using the ORM half, you just annotate

@NoSqlEntity("MyVirtualCfName")
@NoSqlVirtualCf(storedInCf="sharedCf")

So it's stored in sharedCf with the table name of MyVirtualCfName(in command line tool, use MyVirtualCfName to query the table).

Then if you don't know your meta data ahead of time, you need to create DboTableMeta and DboColumnMeta objects and save them for every table you create and can use TypedRow to read and persist (which is what we have a project doing).

If you try it out let me know.  We usually get bug fixes in pretty fast if you run into anything.  (more and more questions are forming on stack overflow as well ;) ).

Later,
Dean



Re: 1000's of CF's.

Posted by "Hiller, Dean" <De...@nrel.gov>.
I do believe they could solve this if they wanted to.  We are now streaming 5000 virtual CF's into one CF with PlayOrm.  Our plan now is to use storm to do the processing in place of map/reduce.  Each virtual CF can also be partitioned(you choose the column that is the partition key).

So I would love to see cassandra have a way to create virtual CF with a row key prefix identifying that virtual CF.  Right now however, until cassandra has something, we are moving forward with our solution as it seems to work great so far.  And we don't have time to wait either.

In PlayOrm, each index of each partition has the full list of keys so I will probably just have storm work off the indices of every partition in the virtual CF so I can map/reduce a virtual CF just fine.

Later,
Dean

From: Vanger <di...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, October 10, 2012 3:37 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of CF's.

Main problem that this "sweet spot" is very narrow. We can't have lots of CF, we can't have long rows and we end up with enormous amount of huge composite row keys and stored metadata about that keys (keep in mind overhead on such scheme, but looks like that nobody really cares about it anymore). And this approach is bad for running Hadoop jobs on it (for now i'm pointing at this as main problem for me right now) and for creating secondary indices (lots of rows - high cardinality, right?), also some 'per-CF option' could become a limitation factor. And bad thing about it - this just doesn't look extendable, you just must end up with 'not-so-many' big CFs - that's a dead end. Maybe it wouldn't look that bad if you try not to associate CF with any real entity and call them 'Random stuff store'.
I just hope that i'm wrong and there's some good compromise between three ways of storing data - long rows, many 'very-composite' rows and partitioning by CF. Which way is preferable to run complicated analytics queries on top of it in fair amount of time? How people handle this?

--
W/ best regards,
Sergey.

On 10.10.2012 2:15, Ben Hood wrote:

I'm not a Cassandra dev, so take what I say with a lot of salt, but
AFAICT, there is a certain amount of overhead in maintaining a CF, so
when you have large numbers of CFs, this adds up. From a layperson's
perspective, this observation sounds reasonable, since zero-cost CFs
would be tantamount to being able to implement secondary indexes by
just adding CFs. So instead of paying the for the overhead (or
ineffectiveness of high-cardinaility secondary indexes, which ever way
you want to look at it), you are expecting a free lunch by just
scaling out in terms on new CFs. I would imagine that under the
covers, the layout of Cassandra has a sweet spot of a smallish number
of CFs (i.e. 10s),  but these can practically have as many rows as you
like.

On Mon, Oct 8, 2012 at 11:02 AM, Vanger <di...@gmail.com> wrote:


So what solution should be for cassandra architecture when we need to make
Hadoop M\R jobs and not be restricted by number of CF?
What we have now is fair amount of CFs  (> 2K) and this number is slowlygrowing so we already planing to merge partitioned CFs. But our next goal is
to run hadoop tasks on those CFs. All we have is plain Hector and custom ORM
on top of it. As far as i understand VirtualKeyspace doesn't help in our
case.
Also i dont understand why not implement support for many CF ( or build-in
partitioning ) on cassandra side. Anybody can explain why this can or cannot
be done in cassandra?

Just in case:
We're using cassandra 1.0.11 on 30 nodes (planning upgrade on 1.1.* soon).

--
W/ best regards,
Sergey.

On 04.10.2012 0:10, Hiller, Dean wrote:

Okay, so it only took me two solid days not a week.  PlayOrm in master
branch now supports virtual CF's or virtual tables in ONE CF, so you can
have 1000's or millions of virtual CF's in one CF now.  It works with all
the Scalable-SQL, works with the joins, and works with the PlayOrm command
line tool.

Two ways to do it, if you are using the ORM half, you just annotate

@NoSqlEntity("MyVirtualCfName")
@NoSqlVirtualCf(storedInCf="sharedCf")

So it's stored in sharedCf with the table name of MyVirtualCfName(in command
line tool, use MyVirtualCfName to query the table).

Then if you don't know your meta data ahead of time, you need to create
DboTableMeta and DboColumnMeta objects and save them for every table you
create and can use TypedRow to read and persist (which is what we have a
project doing).

If you try it out let me know.  We usually get bug fixes in pretty fast if
you run into anything.  (more and more questions are forming on stack
overflow as well ;) ).

Later,
Dean







Re: 1000's of CF's.

Posted by Vanger <di...@gmail.com>.
Main problem that this "sweet spot" is very narrow. We can't have lots 
of CF, we can't have long rows and we end up with enormous amount of 
huge composite row keys and stored metadata about that keys (keep in 
mind overhead on such scheme, but looks like that nobody really cares 
about it anymore). And this approach is bad for running Hadoop jobs on 
it (for now i'm pointing at this as main problem for me right now) and 
for creating secondary indices (lots of rows - high cardinality, 
right?), also some 'per-CF option' could become a limitation factor. And 
bad thing about it - this just doesn't look extendable, you just must 
end up with 'not-so-many' big CFs - that's a dead end. Maybe it wouldn't 
look that bad if you try not to associate CF with any real entity and 
call them 'Random stuff store'.
I just hope that i'm wrong and there's some good compromise between 
three ways of storing data - long rows, many 'very-composite' rows and 
partitioning by CF. Which way is preferable to run complicated analytics 
queries on top of it in fair amount of time? How people handle this?

*/--
W/ best regards,
Sergey.

/*
On 10.10.2012 2:15, Ben Hood wrote:
> I'm not a Cassandra dev, so take what I say with a lot of salt, but
> AFAICT, there is a certain amount of overhead in maintaining a CF, so
> when you have large numbers of CFs, this adds up. From a layperson's
> perspective, this observation sounds reasonable, since zero-cost CFs
> would be tantamount to being able to implement secondary indexes by
> just adding CFs. So instead of paying the for the overhead (or
> ineffectiveness of high-cardinaility secondary indexes, which ever way
> you want to look at it), you are expecting a free lunch by just
> scaling out in terms on new CFs. I would imagine that under the
> covers, the layout of Cassandra has a sweet spot of a smallish number
> of CFs (i.e. 10s),  but these can practically have as many rows as you
> like.
>
> On Mon, Oct 8, 2012 at 11:02 AM, Vanger <di...@gmail.com> wrote:
>> So what solution should be for cassandra architecture when we need to make
>> Hadoop M\R jobs and not be restricted by number of CF?
>> What we have now is fair amount of CFs  (> 2K) and this number is slowly
>> growing so we already planing to merge partitioned CFs. But our next goal is
>> to run hadoop tasks on those CFs. All we have is plain Hector and custom ORM
>> on top of it. As far as i understand VirtualKeyspace doesn't help in our
>> case.
>> Also i dont understand why not implement support for many CF ( or build-in
>> partitioning ) on cassandra side. Anybody can explain why this can or cannot
>> be done in cassandra?
>>
>> Just in case:
>> We're using cassandra 1.0.11 on 30 nodes (planning upgrade on 1.1.* soon).
>>
>> --
>> W/ best regards,
>> Sergey.
>>
>> On 04.10.2012 0:10, Hiller, Dean wrote:
>>
>> Okay, so it only took me two solid days not a week.  PlayOrm in master
>> branch now supports virtual CF's or virtual tables in ONE CF, so you can
>> have 1000's or millions of virtual CF's in one CF now.  It works with all
>> the Scalable-SQL, works with the joins, and works with the PlayOrm command
>> line tool.
>>
>> Two ways to do it, if you are using the ORM half, you just annotate
>>
>> @NoSqlEntity("MyVirtualCfName")
>> @NoSqlVirtualCf(storedInCf="sharedCf")
>>
>> So it's stored in sharedCf with the table name of MyVirtualCfName(in command
>> line tool, use MyVirtualCfName to query the table).
>>
>> Then if you don't know your meta data ahead of time, you need to create
>> DboTableMeta and DboColumnMeta objects and save them for every table you
>> create and can use TypedRow to read and persist (which is what we have a
>> project doing).
>>
>> If you try it out let me know.  We usually get bug fixes in pretty fast if
>> you run into anything.  (more and more questions are forming on stack
>> overflow as well ;) ).
>>
>> Later,
>> Dean
>>
>>
>>
>>


Re: 1000's of CF's.

Posted by Ben Hood <0x...@gmail.com>.
I'm not a Cassandra dev, so take what I say with a lot of salt, but
AFAICT, there is a certain amount of overhead in maintaining a CF, so
when you have large numbers of CFs, this adds up. From a layperson's
perspective, this observation sounds reasonable, since zero-cost CFs
would be tantamount to being able to implement secondary indexes by
just adding CFs. So instead of paying the for the overhead (or
ineffectiveness of high-cardinaility secondary indexes, which ever way
you want to look at it), you are expecting a free lunch by just
scaling out in terms on new CFs. I would imagine that under the
covers, the layout of Cassandra has a sweet spot of a smallish number
of CFs (i.e. 10s),  but these can practically have as many rows as you
like.

On Mon, Oct 8, 2012 at 11:02 AM, Vanger <di...@gmail.com> wrote:
> So what solution should be for cassandra architecture when we need to make
> Hadoop M\R jobs and not be restricted by number of CF?
> What we have now is fair amount of CFs  (> 2K) and this number is slowly
> growing so we already planing to merge partitioned CFs. But our next goal is
> to run hadoop tasks on those CFs. All we have is plain Hector and custom ORM
> on top of it. As far as i understand VirtualKeyspace doesn't help in our
> case.
> Also i dont understand why not implement support for many CF ( or build-in
> partitioning ) on cassandra side. Anybody can explain why this can or cannot
> be done in cassandra?
>
> Just in case:
> We're using cassandra 1.0.11 on 30 nodes (planning upgrade on 1.1.* soon).
>
> --
> W/ best regards,
> Sergey.
>
> On 04.10.2012 0:10, Hiller, Dean wrote:
>
> Okay, so it only took me two solid days not a week.  PlayOrm in master
> branch now supports virtual CF's or virtual tables in ONE CF, so you can
> have 1000's or millions of virtual CF's in one CF now.  It works with all
> the Scalable-SQL, works with the joins, and works with the PlayOrm command
> line tool.
>
> Two ways to do it, if you are using the ORM half, you just annotate
>
> @NoSqlEntity("MyVirtualCfName")
> @NoSqlVirtualCf(storedInCf="sharedCf")
>
> So it's stored in sharedCf with the table name of MyVirtualCfName(in command
> line tool, use MyVirtualCfName to query the table).
>
> Then if you don't know your meta data ahead of time, you need to create
> DboTableMeta and DboColumnMeta objects and save them for every table you
> create and can use TypedRow to read and persist (which is what we have a
> project doing).
>
> If you try it out let me know.  We usually get bug fixes in pretty fast if
> you run into anything.  (more and more questions are forming on stack
> overflow as well ;) ).
>
> Later,
> Dean
>
>
>
>

Re: 1000's of CF's.

Posted by Vanger <di...@gmail.com>.
So what solution should be for cassandra architecture when we need to make Hadoop M\R jobs and not be restricted by number of CF?
What we have now is fair amount of CFs  (> 2K) and this number is slowly growing so we already planing to merge partitioned CFs. But our next goal is to run hadoop tasks on those CFs. All we have is plain Hector and custom ORM on top of it. As far as i understand VirtualKeyspace doesn't help in our case. 
Also i dont understand why not implement support for many CF ( or build-in  partitioning ) on cassandra side. Anybody can explain why this can or cannot be done in cassandra?

Just in case:
We're using cassandra 1.0.11 on 30 nodes (planning upgrade on 1.1.* soon).

--
W/ best regards, 
Sergey.

On 04.10.2012 0:10, Hiller, Dean wrote:
> Okay, so it only took me two solid days not a week.  PlayOrm in master branch now supports virtual CF's or virtual tables in ONE CF, so you can have 1000's or millions of virtual CF's in one CF now.  It works with all the Scalable-SQL, works with the joins, and works with the PlayOrm command line tool.
> 
> Two ways to do it, if you are using the ORM half, you just annotate
> 
> @NoSqlEntity("MyVirtualCfName")
> @NoSqlVirtualCf(storedInCf="sharedCf")
> 
> So it's stored in sharedCf with the table name of MyVirtualCfName(in command line tool, use MyVirtualCfName to query the table).
> 
> Then if you don't know your meta data ahead of time, you need to create DboTableMeta and DboColumnMeta objects and save them for every table you create and can use TypedRow to read and persist (which is what we have a project doing).
> 
> If you try it out let me know.  We usually get bug fixes in pretty fast if you run into anything.  (more and more questions are forming on stack overflow as well ;) ).
> 
> Later,
> Dean
> 
>