You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Hiller, Dean" <De...@nrel.gov> on 2012/09/27 00:13:43 UTC

1000's of column families

We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly, Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons of funs with limitations on the tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such).

Thanks,
Dean

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

PlayOrm DOES support inheritance mapping but only supports single table right now.  In fact, DboColumnMeta.java  has 4 subclasses that all map to that one ColumnFamily so we already support and heavily use the inheritance feature.

That said, I am more concerned with scalability.  The more you stuff into a table, the more partitions you need….as an example, I really have a choice

Have this in a partition
device1 datapoint1
device2 datapoint1
device1 datapoint2
device2 datapoint2
device1 datapoint3

OR have just this in a partition
device1 datapoint1
device1 datapoint1
device1 datapoint1

If I use the latter approach, I can have more points for device1 in one partition.  I could use inheritance but then I can't fit as many data points for device 1 in a partition.

Does that make more sense?

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, September 27, 2012 8:45 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of column families

Dean,

     I was used, in the relational world, to use hibernate and O/R mapping. There were times when I used 3 classes (2 inheriting from 1 another) and mapped all of the to 1 table. The common part was in the super class and each sub class had it's own columns. The table, however, use to have all the columns and this design was hard because of that, as creating more subclasses would need changes in the table.
     However, if you use playOrm and if playOrm has/had a feature to allow inheritance mapping to a CF, it would solve your problem, wouldn't it? Of course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean <De...@nrel.gov>>
We have 1000's of different building devices and we stream data from these devices.  The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables.  Every device is unique and streams it's own data.  We dynamically discover devices and register them.  Basically, one CF or table per thing really makes sense in this environment.  While we could try to find out which devices "are" similar, this would really be a pain and some devices add some new variable into the equation.  NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups.  We dynamically create CF's on the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions.  We could create one CF and put all the data and have a partition per device, but then a time partition will contain "multiple" devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where xxxxx. Which we can do today.

Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Thursday, September 27, 2012 8:01 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer.

[]s

2012/9/26 Hiller, Dean <De...@nrel.gov>>>
We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly, Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons of funs with limitations on the tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such).

Thanks,
Dean



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Dean,

     I was used, in the relational world, to use hibernate and O/R mapping.
There were times when I used 3 classes (2 inheriting from 1 another) and
mapped all of the to 1 table. The common part was in the super class and
each sub class had it's own columns. The table, however, use to have all
the columns and this design was hard because of that, as creating more
subclasses would need changes in the table.
     However, if you use playOrm and if playOrm has/had a feature to allow
inheritance mapping to a CF, it would solve your problem, wouldn't it? Of
course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean <De...@nrel.gov>

> We have 1000's of different building devices and we stream data from these
> devices.  The format and data from each one varies so one device has
> temperature at timeX with some other variables, another device has CO2
> percentage and other variables.  Every device is unique and streams it's
> own data.  We dynamically discover devices and register them.  Basically,
> one CF or table per thing really makes sense in this environment.  While we
> could try to find out which devices "are" similar, this would really be a
> pain and some devices add some new variable into the equation.  NOT only
> that but researchers can register new datasets and upload them as well and
> each dataset they have they do NOT want to share with other researches
> necessarily so we have security groups and each CF belongs to security
> groups.  We dynamically create CF's on the fly as people register new
> datasets.
>
> On top of that, when the data sets get too large, we probably want to
> partition a single CF into time partitions.  We could create one CF and put
> all the data and have a partition per device, but then a time partition
> will contain "multiple" devices of data meaning we need to shrink our time
> partition size where if we have CF per device, the time partition can be
> larger as it is only for that one device.
>
> THEN, on top of that, we have a meta CF for these devices so some people
> want to query for streams that match criteria AND which returns a CF name
> and they query that CF name so we almost need a query with variables like
> select cfName from Meta where x = y and then select * from cfName where
> xxxxx. Which we can do today.
>
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Thursday, September 27, 2012 8:01 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: 1000's of column families
>
> Out of curiosity, is it really necessary to have that amount of CFs?
> I am probably still used to relational databases, where you would use a
> new table just in case you need to store different kinds of data. As
> Cassandra stores anything in each CF, it might probably make sense to have
> a lot of CFs to store your data...
> But why wouldn't you use a single CF with partitions in these case?
> Wouldn't it be the same thing? I am asking because I might learn a new
> modeling technique with the answer.
>
> []s
>
> 2012/9/26 Hiller, Dean <De...@nrel.gov>>
> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>  When using the tools they are all geared to analyzing ONE column family at
> a time :(.  If I remember correctly, Cassandra supports as many CF's as you
> want, correct?  Even though I am going to have tons of funs with
> limitations on the tools, correct?
>
> (I may end up wrapping the node tool with my own aggregate calls if needed
> to sum up multiple column families and such).
>
> Thanks,
> Dean
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

Posted by Robin Verlangen <ro...@us2.nl>.

I think I misunderstood your "all data in one location" note. I thought you
meant to store it all in one CF.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

<http://goo.gl/Lt7BC>

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/28 Hiller, Dean <De...@nrel.gov>

> I thought someone was saying each column family added to RAM on every node
> not RAM on a single node.  It adds RAM on every node???  So eventually, I
> will run out?  Was that person wrong?  This would mean adding nodes does
> not help if he is right.  Can anyone confirm this?
>
> Thanks,
> Dean
>
> From: Robin Verlangen <ro...@us2.nl>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Thursday, September 27, 2012 11:52 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: 1000's of column families
>
> "so if you add up all the applications
> which would be huge and then all the tables which is large, it just keeps
> growing.  It is a very nice concept(all data in one location), though we
> will see how implementing it goes."
>
> This shouldn't be a real problem for Cassandra. Just add more nodes and
> ever node contains a smaller piece of the cake (~ring).
>
> Best regards,
>
> Robin Verlangen
> Software engineer
>
> W http://www.robinverlangen.nl
> E robin@us2.nl<ma...@us2.nl>
>
> [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]<
> http://goo.gl/Lt7BC>
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/27 Hiller, Dean <De...@nrel.gov>>
> Unfortunately, the security aspect is very strict.  Some make their data
> public but there are many projects where due to client contracts, they
> cannot make their data public within our company(ie. Other groups in our
> company are not allowed to see the data).
>
> Also, currently, we have researchers upload their own datasets as well.
> Ideally, they want to see this one noSQL store as the place where all data
> for the company livesŠALL of it so if you add up all the applications
> which would be huge and then all the tables which is large, it just keeps
> growing.  It is a very nice concept(all data in one location), though we
> will see how implementing it goes.
>
> How much overhead per column family in RAM?  So far we have around 4000
> Cfs with no issue that I see yet.
>
> Dean
>
> On 9/27/12 11:10 AM, "Aaron Turner" <synfinatic@gmail.com<mailto:
> synfinatic@gmail.com>> wrote:
>
> >On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <Dean.Hiller@nrel.gov
> <ma...@nrel.gov>>
> >wrote:
> >> We have 1000's of different building devices and we stream data from
> >>these devices.  The format and data from each one varies so one device
> >>has temperature at timeX with some other variables, another device has
> >>CO2 percentage and other variables.  Every device is unique and streams
> >>it's own data.  We dynamically discover devices and register them.
> >>Basically, one CF or table per thing really makes sense in this
> >>environment.  While we could try to find out which devices "are"
> >>similar, this would really be a pain and some devices add some new
> >>variable into the equation.  NOT only that but researchers can register
> >>new datasets and upload them as well and each dataset they have they do
> >>NOT want to share with other researches necessarily so we have security
> >>groups and each CF belongs to security groups.  We dynamically create
> >>CF's on the fly as people register new datasets.
> >>
> >> On top of that, when the data sets get too large, we probably want to
> >>partition a single CF into time partitions.  We could create one CF and
> >>put all the data and have a partition per device, but then a time
> >>partition will contain "multiple" devices of data meaning we need to
> >>shrink our time partition size where if we have CF per device, the time
> >>partition can be larger as it is only for that one device.
> >>
> >> THEN, on top of that, we have a meta CF for these devices so some
> >>people want to query for streams that match criteria AND which returns a
> >>CF name and they query that CF name so we almost need a query with
> >>variables like select cfName from Meta where x = y and then select *
> >>from cfName where xxxxx. Which we can do today.
> >
> >How strict are your security requirements?  If it wasn't for that,
> >you'd be much better off storing data on a per-statistic basis then
> >per-device.  Hell, you could store everything in a single CF by using
> >a composite row key:
> >
> ><devicename>|<stat type>|<instance>
> >
> >But yeah, there isn't a hard limit for the number of CF's, but there
> >is overhead associated with each one and so I wouldn't consider your
> >design as scalable.  Generally speaking, hundreds are ok, but
> >thousands is pushing it.
> >
> >
> >
> >--
> >Aaron Turner
> >http://synfin.net/         Twitter: @synfinatic
> >http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
> >Windows
> >Those who would give up essential Liberty, to purchase a little temporary
> >Safety, deserve neither Liberty nor Safety.
> >    -- Benjamin Franklin
> >"carpe diem quam minimum credula postero"
>
>
>

Re: 1000's of column families

Posted by Aaron Turner <sy...@gmail.com>.

Yeah, you can't scale the number of CF's by adding new nodes to a
cluster.  You have to create multiple clusters.

Anyways, I was thinking about your problem and the solution seems to
be give each team/project their own CF and have them use composite row
keys as I wrote about earlier.  Yes that may mean you store the data
for the same node multiple times, but that's pretty typical with
Cassandra where you're de-normalizing your data to meet your query
needs and Cassandra does seem to scale that way.   Also, if you're not
already using compression, you should.  My experience with compression
and time series data has been pretty amazing, especially with my CF's
where I store a days worth of data in a single column as a vector.

That gives you the best of both worlds: You get your per-team security
and I'd assume (ha!) that would dramatically reduce the number of CF's
you have to deal with since it's per-team/project and not per-device.

On Fri, Sep 28, 2012 at 12:14 PM, Hiller, Dean <De...@nrel.gov> wrote:
> I thought someone was saying each column family added to RAM on every node not RAM on a single node.  It adds RAM on every node???  So eventually, I will run out?  Was that person wrong?  This would mean adding nodes does not help if he is right.  Can anyone confirm this?
>
> Thanks,
> Dean
>
> From: Robin Verlangen <ro...@us2.nl>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Thursday, September 27, 2012 11:52 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: 1000's of column families
>
> "so if you add up all the applications
> which would be huge and then all the tables which is large, it just keeps
> growing.  It is a very nice concept(all data in one location), though we
> will see how implementing it goes."
>
> This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring).
>
> Best regards,
>
> Robin Verlangen
> Software engineer
>
> W http://www.robinverlangen.nl
> E robin@us2.nl<ma...@us2.nl>
>
> [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]<http://goo.gl/Lt7BC>
>
> Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
>
>
>
> 2012/9/27 Hiller, Dean <De...@nrel.gov>>
> Unfortunately, the security aspect is very strict.  Some make their data
> public but there are many projects where due to client contracts, they
> cannot make their data public within our company(ie. Other groups in our
> company are not allowed to see the data).
>
> Also, currently, we have researchers upload their own datasets as well.
> Ideally, they want to see this one noSQL store as the place where all data
> for the company livesŠALL of it so if you add up all the applications
> which would be huge and then all the tables which is large, it just keeps
> growing.  It is a very nice concept(all data in one location), though we
> will see how implementing it goes.
>
> How much overhead per column family in RAM?  So far we have around 4000
> Cfs with no issue that I see yet.
>
> Dean
>
> On 9/27/12 11:10 AM, "Aaron Turner" <sy...@gmail.com>> wrote:
>
>>On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov>>
>>wrote:
>>> We have 1000's of different building devices and we stream data from
>>>these devices.  The format and data from each one varies so one device
>>>has temperature at timeX with some other variables, another device has
>>>CO2 percentage and other variables.  Every device is unique and streams
>>>it's own data.  We dynamically discover devices and register them.
>>>Basically, one CF or table per thing really makes sense in this
>>>environment.  While we could try to find out which devices "are"
>>>similar, this would really be a pain and some devices add some new
>>>variable into the equation.  NOT only that but researchers can register
>>>new datasets and upload them as well and each dataset they have they do
>>>NOT want to share with other researches necessarily so we have security
>>>groups and each CF belongs to security groups.  We dynamically create
>>>CF's on the fly as people register new datasets.
>>>
>>> On top of that, when the data sets get too large, we probably want to
>>>partition a single CF into time partitions.  We could create one CF and
>>>put all the data and have a partition per device, but then a time
>>>partition will contain "multiple" devices of data meaning we need to
>>>shrink our time partition size where if we have CF per device, the time
>>>partition can be larger as it is only for that one device.
>>>
>>> THEN, on top of that, we have a meta CF for these devices so some
>>>people want to query for streams that match criteria AND which returns a
>>>CF name and they query that CF name so we almost need a query with
>>>variables like select cfName from Meta where x = y and then select *
>>>from cfName where xxxxx. Which we can do today.
>>
>>How strict are your security requirements?  If it wasn't for that,
>>you'd be much better off storing data on a per-statistic basis then
>>per-device.  Hell, you could store everything in a single CF by using
>>a composite row key:
>>
>><devicename>|<stat type>|<instance>
>>
>>But yeah, there isn't a hard limit for the number of CF's, but there
>>is overhead associated with each one and so I wouldn't consider your
>>design as scalable.  Generally speaking, hundreds are ok, but
>>thousands is pushing it.
>>
>>
>>
>>--
>>Aaron Turner
>>http://synfin.net/         Twitter: @synfinatic
>>http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
>>Windows
>>Those who would give up essential Liberty, to purchase a little temporary
>>Safety, deserve neither Liberty nor Safety.
>>    -- Benjamin Franklin
>>"carpe diem quam minimum credula postero"
>
>



-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

I thought someone was saying each column family added to RAM on every node not RAM on a single node.  It adds RAM on every node???  So eventually, I will run out?  Was that person wrong?  This would mean adding nodes does not help if he is right.  Can anyone confirm this?

Thanks,
Dean

From: Robin Verlangen <ro...@us2.nl>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, September 27, 2012 11:52 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of column families

"so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes."

This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E robin@us2.nl<ma...@us2.nl>

[http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]<http://goo.gl/Lt7BC>

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.

2012/9/27 Hiller, Dean <De...@nrel.gov>>
Unfortunately, the security aspect is very strict.  Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesŠALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM?  So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, "Aaron Turner" <sy...@gmail.com>> wrote:

>On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov>>
>wrote:
>> We have 1000's of different building devices and we stream data from
>>these devices.  The format and data from each one varies so one device
>>has temperature at timeX with some other variables, another device has
>>CO2 percentage and other variables.  Every device is unique and streams
>>it's own data.  We dynamically discover devices and register them.
>>Basically, one CF or table per thing really makes sense in this
>>environment.  While we could try to find out which devices "are"
>>similar, this would really be a pain and some devices add some new
>>variable into the equation.  NOT only that but researchers can register
>>new datasets and upload them as well and each dataset they have they do
>>NOT want to share with other researches necessarily so we have security
>>groups and each CF belongs to security groups.  We dynamically create
>>CF's on the fly as people register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to
>>partition a single CF into time partitions.  We could create one CF and
>>put all the data and have a partition per device, but then a time
>>partition will contain "multiple" devices of data meaning we need to
>>shrink our time partition size where if we have CF per device, the time
>>partition can be larger as it is only for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some
>>people want to query for streams that match criteria AND which returns a
>>CF name and they query that CF name so we almost need a query with
>>variables like select cfName from Meta where x = y and then select *
>>from cfName where xxxxx. Which we can do today.
>
>How strict are your security requirements?  If it wasn't for that,
>you'd be much better off storing data on a per-statistic basis then
>per-device.  Hell, you could store everything in a single CF by using
>a composite row key:
>
><devicename>|<stat type>|<instance>
>
>But yeah, there isn't a hard limit for the number of CF's, but there
>is overhead associated with each one and so I wouldn't consider your
>design as scalable.  Generally speaking, hundreds are ok, but
>thousands is pushing it.
>
>
>
>--
>Aaron Turner
>http://synfin.net/         Twitter: @synfinatic
>http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
>Windows
>Those who would give up essential Liberty, to purchase a little temporary
>Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
>"carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by Robin Verlangen <ro...@us2.nl>.

"so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes."

This shouldn't be a real problem for Cassandra. Just add more nodes and
ever node contains a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

<http://goo.gl/Lt7BC>

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/27 Hiller, Dean <De...@nrel.gov>

> Unfortunately, the security aspect is very strict.  Some make their data
> public but there are many projects where due to client contracts, they
> cannot make their data public within our company(ie. Other groups in our
> company are not allowed to see the data).
>
> Also, currently, we have researchers upload their own datasets as well.
> Ideally, they want to see this one noSQL store as the place where all data
> for the company livesŠALL of it so if you add up all the applications
> which would be huge and then all the tables which is large, it just keeps
> growing.  It is a very nice concept(all data in one location), though we
> will see how implementing it goes.
>
> How much overhead per column family in RAM?  So far we have around 4000
> Cfs with no issue that I see yet.
>
> Dean
>
> On 9/27/12 11:10 AM, "Aaron Turner" <sy...@gmail.com> wrote:
>
> >On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov>
> >wrote:
> >> We have 1000's of different building devices and we stream data from
> >>these devices.  The format and data from each one varies so one device
> >>has temperature at timeX with some other variables, another device has
> >>CO2 percentage and other variables.  Every device is unique and streams
> >>it's own data.  We dynamically discover devices and register them.
> >>Basically, one CF or table per thing really makes sense in this
> >>environment.  While we could try to find out which devices "are"
> >>similar, this would really be a pain and some devices add some new
> >>variable into the equation.  NOT only that but researchers can register
> >>new datasets and upload them as well and each dataset they have they do
> >>NOT want to share with other researches necessarily so we have security
> >>groups and each CF belongs to security groups.  We dynamically create
> >>CF's on the fly as people register new datasets.
> >>
> >> On top of that, when the data sets get too large, we probably want to
> >>partition a single CF into time partitions.  We could create one CF and
> >>put all the data and have a partition per device, but then a time
> >>partition will contain "multiple" devices of data meaning we need to
> >>shrink our time partition size where if we have CF per device, the time
> >>partition can be larger as it is only for that one device.
> >>
> >> THEN, on top of that, we have a meta CF for these devices so some
> >>people want to query for streams that match criteria AND which returns a
> >>CF name and they query that CF name so we almost need a query with
> >>variables like select cfName from Meta where x = y and then select *
> >>from cfName where xxxxx. Which we can do today.
> >
> >How strict are your security requirements?  If it wasn't for that,
> >you'd be much better off storing data on a per-statistic basis then
> >per-device.  Hell, you could store everything in a single CF by using
> >a composite row key:
> >
> ><devicename>|<stat type>|<instance>
> >
> >But yeah, there isn't a hard limit for the number of CF's, but there
> >is overhead associated with each one and so I wouldn't consider your
> >design as scalable.  Generally speaking, hundreds are ok, but
> >thousands is pushing it.
> >
> >
> >
> >--
> >Aaron Turner
> >http://synfin.net/         Twitter: @synfinatic
> >http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
> >Windows
> >Those who would give up essential Liberty, to purchase a little temporary
> >Safety, deserve neither Liberty nor Safety.
> >    -- Benjamin Franklin
> >"carpe diem quam minimum credula postero"
>
>

Re: 1000's of column families

Posted by Aaron Turner <sy...@gmail.com>.

On Thu, Sep 27, 2012 at 7:35 PM, Marcelo Elias Del Valle
<mv...@gmail.com> wrote:
>
>
> 2012/9/27 Aaron Turner <sy...@gmail.com>
>>
>> How strict are your security requirements?  If it wasn't for that,
>> you'd be much better off storing data on a per-statistic basis then
>> per-device.  Hell, you could store everything in a single CF by using
>> a composite row key:
>>
>> <devicename>|<stat type>|<instance>
>>
>> But yeah, there isn't a hard limit for the number of CF's, but there
>> is overhead associated with each one and so I wouldn't consider your
>> design as scalable.  Generally speaking, hundreds are ok, but
>> thousands is pushing it.
>
>
> Aaron,
>
>     Imagine that instead of using a composite key in this case, you use a
> simple row key "instance_uuid". Then, to index data by devicename |
> start_type|instance you use another CF with this composite key or several
> CFs to index it.
>     Do you see any drawbacks in terms of performance?

Really that depends on the client side I think.  Ideally, you'd like
to have the client be able to be able to directly access the row by
name without looking it up in some index.  Basically if you have to
lookup up the instance_uuid that's another call to some datastore
which takes more time then generating the row key via it's composites.
 At least that's my opinion...

Of course there are times where using an instance_uuid makes a lot of
sense... like if you rename a device and want all your stats to move
to the new name.  Much easier to just update the mapping record then
reading & rewriting all your rows for that device.

In my project, we use a device_uuid (just a primary key stored in an
Oracle DB... long story!), but everything else is by name in our
composite row keys.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/27 Aaron Turner <sy...@gmail.com>
>
> How strict are your security requirements?  If it wasn't for that,
> you'd be much better off storing data on a per-statistic basis then
> per-device.  Hell, you could store everything in a single CF by using
> a composite row key:
>
> <devicename>|<stat type>|<instance>
>
> But yeah, there isn't a hard limit for the number of CF's, but there
> is overhead associated with each one and so I wouldn't consider your
> design as scalable.  Generally speaking, hundreds are ok, but
> thousands is pushing it.


Aaron,

    Imagine that instead of using a composite key in this case, you use a
simple row key "instance_uuid". Then, to index data by devicename |
start_type|instance you use another CF with this composite key or several
CFs to index it.
    Do you see any drawbacks in terms of performance?

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

Posted by Edward Capriolo <ed...@gmail.com>.

Hector also offers support for 'Virtual Keyspaces' which you might
want to look at.


On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner <sy...@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov> wrote:
>> We have 1000's of different building devices and we stream data from these devices.  The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables.  Every device is unique and streams it's own data.  We dynamically discover devices and register them.  Basically, one CF or table per thing really makes sense in this environment.  While we could try to find out which devices "are" similar, this would really be a pain and some devices add some new variable into the equation.  NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups.  We dynamically create CF's on the fly as people register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions.  We could create one CF and put all the data and have a partition per device, but then a time partition will contain "multiple" devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where xxxxx. Which we can do today.
>
> How strict are your security requirements?  If it wasn't for that,
> you'd be much better off storing data on a per-statistic basis then
> per-device.  Hell, you could store everything in a single CF by using
> a composite row key:
>
> <devicename>|<stat type>|<instance>
>
> But yeah, there isn't a hard limit for the number of CF's, but there
> is overhead associated with each one and so I wouldn't consider your
> design as scalable.  Generally speaking, hundreds are ok, but
> thousands is pushing it.
>
>
>
> --
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Unfortunately, the security aspect is very strict.  Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesŠALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM?  So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, "Aaron Turner" <sy...@gmail.com> wrote:

>On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov>
>wrote:
>> We have 1000's of different building devices and we stream data from
>>these devices.  The format and data from each one varies so one device
>>has temperature at timeX with some other variables, another device has
>>CO2 percentage and other variables.  Every device is unique and streams
>>it's own data.  We dynamically discover devices and register them.
>>Basically, one CF or table per thing really makes sense in this
>>environment.  While we could try to find out which devices "are"
>>similar, this would really be a pain and some devices add some new
>>variable into the equation.  NOT only that but researchers can register
>>new datasets and upload them as well and each dataset they have they do
>>NOT want to share with other researches necessarily so we have security
>>groups and each CF belongs to security groups.  We dynamically create
>>CF's on the fly as people register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to
>>partition a single CF into time partitions.  We could create one CF and
>>put all the data and have a partition per device, but then a time
>>partition will contain "multiple" devices of data meaning we need to
>>shrink our time partition size where if we have CF per device, the time
>>partition can be larger as it is only for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some
>>people want to query for streams that match criteria AND which returns a
>>CF name and they query that CF name so we almost need a query with
>>variables like select cfName from Meta where x = y and then select *
>>from cfName where xxxxx. Which we can do today.
>
>How strict are your security requirements?  If it wasn't for that,
>you'd be much better off storing data on a per-statistic basis then
>per-device.  Hell, you could store everything in a single CF by using
>a composite row key:
>
><devicename>|<stat type>|<instance>
>
>But yeah, there isn't a hard limit for the number of CF's, but there
>is overhead associated with each one and so I wouldn't consider your
>design as scalable.  Generally speaking, hundreds are ok, but
>thousands is pushing it.
>
>
>
>-- 
>Aaron Turner
>http://synfin.net/         Twitter: @synfinatic
>http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix &
>Windows
>Those who would give up essential Liberty, to purchase a little temporary
>Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
>"carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by Aaron Turner <sy...@gmail.com>.

On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean <De...@nrel.gov> wrote:
> We have 1000's of different building devices and we stream data from these devices.  The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables.  Every device is unique and streams it's own data.  We dynamically discover devices and register them.  Basically, one CF or table per thing really makes sense in this environment.  While we could try to find out which devices "are" similar, this would really be a pain and some devices add some new variable into the equation.  NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups.  We dynamically create CF's on the fly as people register new datasets.
>
> On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions.  We could create one CF and put all the data and have a partition per device, but then a time partition will contain "multiple" devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device.
>
> THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where xxxxx. Which we can do today.

How strict are your security requirements?  If it wasn't for that,
you'd be much better off storing data on a per-statistic basis then
per-device.  Hell, you could store everything in a single CF by using
a composite row key:

<devicename>|<stat type>|<instance>

But yeah, there isn't a hard limit for the number of CF's, but there
is overhead associated with each one and so I wouldn't consider your
design as scalable.  Generally speaking, hundreds are ok, but
thousands is pushing it.



-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: 1000's of column families

Posted by Flavio Baronti <f....@list-group.com>.

We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe 
that's not an issue any more.
Our problems were two:
- You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, 
we had to use distributed locks (Hazelcast) to avoid concurrency.
- You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of 
the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the 
updates must be applied).

As I said, this applied to version 0.7, maybe current versions solved these problems.

Flavio


Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
> We have 1000's of different building devices and we stream data from these devices.  The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables.  Every device is unique and streams it's own data.  We dynamically discover devices and register them.  Basically, one CF or table per thing really makes sense in this environment.  While we could try to find out which devices "are" similar, this would really be a pain and some devices add some new variable into the equation.  NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups.  We dynamically create CF's on the fly as people register new datasets.
>
> On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions.  We could create one CF and put all the data and have a partition per device, but then a time partition will contain "multiple" devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device.
>
> THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where xxxxx. Which we can do today.
>
> Dean
>
> From: Marcelo Elias Del Valle <mv...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Thursday, September 27, 2012 8:01 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: 1000's of column families
>
> Out of curiosity, is it really necessary to have that amount of CFs?
> I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data...
> But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer.
>
> []s
>
> 2012/9/26 Hiller, Dean <De...@nrel.gov>>
> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly, Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons of funs with limitations on the tools, correct?
>
> (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such).
>
> Thanks,
> Dean
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>

Re: 1000's of column families

Posted by Jeremy Hanna <je...@gmail.com>.

It's always had data locality (since hadoop support was added in 0.6).

You don't need to specify a partition, you specify the input predicate with ConfigHelper or the cassandra.input.predicate property.

On Oct 2, 2012, at 2:26 PM, "Hiller, Dean" <De...@nrel.gov> wrote:

> So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows?
> 
> Yes, that is how I "believe" it works.  I am by no means an expert.
> 
> I also wanted to fire off a MR to process matching rows in the "virtual" CF ideally running on the nodes where it reads data in.  In 0.7, I thought the M/R jobs did not run locally with the data like hadoop does???  Anyone know if that is still true or does it run locally to the data now?
> 
> Thanks,
> Dean
> 
> From: Ben Hood <0x...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Tuesday, October 2, 2012 1:01 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: 1000's of column families
> 
> Dean,
> 
> On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:
> 
> Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous.
> 
> The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous.
> 
> 
> 
> 
> 
> QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there.
> 
> Maybe my question was just silly - I wasn't referring to CQL.
> 
> As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data.
> 
> But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query.
> 
> Cheers,
> 
> Ben
>

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows?

Yes, that is how I "believe" it works.  I am by no means an expert.

I also wanted to fire off a MR to process matching rows in the "virtual" CF ideally running on the nodes where it reads data in.  In 0.7, I thought the M/R jobs did not run locally with the data like hadoop does???  Anyone know if that is still true or does it run locally to the data now?

Thanks,
Dean

From: Ben Hood <0x...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 2, 2012 1:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of column families

Dean,

On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous.





QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there.

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query.

Cheers,

Ben

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

Dean, 

On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

> Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous.
> 
> The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous.
> 

So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows?

> 
> QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there.
> 
> 

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query.

Cheers,

Ben

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Because the data for an index is not all together(ie. Need a multi get to get the data).  It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous.

QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct?  Or did I miss something there.

Thanks,
Dean

From: Ben Hood <0x...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, October 2, 2012 11:18 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of column families

Jeremy,

On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote:

Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family.

Interesting. This is probably a stupid question but why shouldn't you be able to use the secondary index to go straight to the slices that belong to the attribute you are searching by? Is this something to do with the way Cassandra is exposed as an InputFormat for Hadoop or is this a general property for searching by secondary index?

Ben

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

Jeremy, 

On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote:

> Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family.
> 

Interesting. This is probably a stupid question but why shouldn't you be able to use the secondary index to go straight to the slices that belong to the attribute you are searching by? Is this something to do with the way Cassandra is exposed as an InputFormat for Hadoop or is this a general property for searching by secondary index?

Ben

Re: 1000's of column families

Posted by Jeremy Hanna <je...@gmail.com>.

Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job.  What you might do is add a field to the column family that represents which virtual column family that it is part of.  Then when doing mapreduce jobs, you could use that field as the secondary index limiter.  Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job.  However, it's another option for not scanning the whole column family.

On Oct 2, 2012, at 10:09 AM, Ben Hood <0x...@gmail.com> wrote:

> On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill <bo...@gmail.com> wrote:
>> Exactly.
> 
> So you're back to the deliberation between using multiple CFs
> (potentially with some known working upper bound*) or feeding your map
> reduce in some other way (as you decided to do with Storm). In my
> particular scenario I'd like to be able to do a combination of some
> batch processing on top of less frequently changing data (hence why I
> was looking at Hadoop) and some real time analytics.
> 
> Cheers,
> 
> Ben
> 
> (*) Not sure whether this applies to an individual keyspace or an
> entire cluster.

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Ben, Brian, 
   By the way, PlayOrm offers a NoSqlTypedSession that is different than
the ORM half of PlayOrm dealing in raw stuff that does indexing(so you can
do Scalable SQL on data that has no ORM on top of it).  That is what we
use for our 1000's of CF's as we don't know the format of any of those
tables ahead of time(in our world, users tell us the format and wire in
streams through an api we expose AND they tell PlayOrm which columns to
index).  That layer deals with BigInteger, BigDecimal, String and I think
byte[].

So, I am going to add virtual CF's to PlayOrm in the coming week and we
are going to feed in streams and partition the virtual CF's which sit in a
single real CF using PlayOrm partitioning and then we can then query into
each partition.  

The only issue is really what partitions exist and that is left to the
client to keep track of, but if your app knows all the partitions(and that
could be saved to some rows in the nosql store), then I will probably try
out storm after that.

Later,
Dean

On 10/2/12 9:09 AM, "Ben Hood" <0x...@gmail.com> wrote:

>On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill <bo...@gmail.com> wrote:
>> Exactly.
>
>So you're back to the deliberation between using multiple CFs
>(potentially with some known working upper bound*) or feeding your map
>reduce in some other way (as you decided to do with Storm). In my
>particular scenario I'd like to be able to do a combination of some
>batch processing on top of less frequently changing data (hence why I
>was looking at Hadoop) and some real time analytics.
>
>Cheers,
>
>Ben
>
>(*) Not sure whether this applies to an individual keyspace or an
>entire cluster.

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill <bo...@gmail.com> wrote:
> Exactly.

So you're back to the deliberation between using multiple CFs
(potentially with some known working upper bound*) or feeding your map
reduce in some other way (as you decided to do with Storm). In my
particular scenario I'd like to be able to do a combination of some
batch processing on top of less frequently changing data (hence why I
was looking at Hadoop) and some real time analytics.

Cheers,

Ben

(*) Not sure whether this applies to an individual keyspace or an
entire cluster.

Re: 1000's of column families

Posted by Brian O'Neill <bo...@gmail.com>.

Exactly.

---
Brian O'Neill
Lead Architect, Software Development

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.

On 10/2/12 9:55 AM, "Ben Hood" <0x...@gmail.com> wrote:

>Brian,
>
>On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill <bo...@gmail.com> wrote:
>>
>> Without putting too much thought into it...
>>
>> Given the underlying architecture, I think you could/would have to write
>> your own partitioner, which would partition based on the prefix/virtual
>> keyspace.
>
>I might be barking up the wrong tree here, but looking at source of
>ColumnFamilyInputFormat, it seems that you can specify a KeyRange for
>the input, but only when you use an order preserving partitioner. So I
>presume that if you are using the RandomPartitioner, you are
>effectively doing a full CF scan (i.e. including all tenants in your
>system).
>
>Ben

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

Brian,

On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill <bo...@gmail.com> wrote:
>
> Without putting too much thought into it...
>
> Given the underlying architecture, I think you could/would have to write
> your own partitioner, which would partition based on the prefix/virtual
> keyspace.

I might be barking up the wrong tree here, but looking at source of
ColumnFamilyInputFormat, it seems that you can specify a KeyRange for
the input, but only when you use an order preserving partitioner. So I
presume that if you are using the RandomPartitioner, you are
effectively doing a full CF scan (i.e. including all tenants in your
system).

Ben

Re: 1000's of column families

Posted by Brian O'Neill <bo...@gmail.com>.

Agreed. 

Do we know yet what the overhead is for each column family?  What is the
limit?
If you have a SINGLE keyspace w/ 20000+ CF's, what happens?  Anyone know?

-brian


---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 <http://www.twitter.com/boneill42>  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:28 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>Thanks for the idea but…(but please keep thinking on it)...
>
>100% what we don't want since partitioned data resides on the same node.
>I want to map/reduce the column families and leverage the parallel disks
>
>:( :(
>
>I am sure others would want to do the same…..We almost need a feature of
>virtual Column Families and column family should really not be column
>family but should be called ReplicationGroup or something where
>replication is configured for all CF's in that group.
>
>ANYONE have any other ideas???
>
>Dean
>
>On 10/2/12 7:20 AM, "Brian O'Neill" <bo...@gmail.com> wrote:
>
>>
>>Without putting too much thought into it...
>>
>>Given the underlying architecture, I think you could/would have to write
>>your own partitioner, which would partition based on the prefix/virtual
>>keyspace.  
>>
>>-brian
>>
>>---
>>Brian O'Neill
>>Lead Architect, Software Development
>> 
>>Health Market Science
>>The Science of Better Results
>>2700 Horizon Drive € King of Prussia, PA € 19406
>>M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
>>healthmarketscience.com
>>
>>This information transmitted in this email message is for the intended
>>recipient only and may contain confidential and/or privileged material.
>>If
>>you received this email in error and are not the intended recipient, or
>>the person responsible to deliver it to the intended recipient, please
>>contact the sender at the email above and delete this email and any
>>attachments and destroy any copies thereof. Any review, retransmission,
>>dissemination, copying or other use of, or taking any action in reliance
>>upon, this information by persons or entities other than the intended
>>recipient is strictly prohibited.
>> 
>>
>>
>>
>>
>>
>>
>>On 10/2/12 9:00 AM, "Ben Hood" <0x...@gmail.com> wrote:
>>
>>>Dean,
>>>
>>>On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean <De...@nrel.gov>
>>>wrote:
>>>> Ben,
>>>>   to address your question, read my last post but to summarize, yes,
>>>>there
>>>> is less overhead in memory to prefix keys than manage multiple Cfs
>>>>EXCEPT
>>>> when doing map/reduce.  Doing map/reduce, you will now have HUGE
>>>>overhead
>>>> in reading a whole slew of rows you don't care about as you can't
>>>> map/reduce a single virtual CF but must map/reduce the whole CF
>>>>wasting
>>>> TONS of resources.
>>>
>>>That's a good point that I hadn't considered beforehand, especially as
>>>I'd like to run MR jobs against these CFs.
>>>
>>>Is this limitation inherent in the way that Cassandra is modelled as
>>>input for Hadoop or could you write a custom slice query to only feed
>>>in one particular prefix into Hadoop?
>>>
>>>Cheers,
>>>
>>>Ben
>>
>>
>

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Thanks for the idea but…(but please keep thinking on it)...

100% what we don't want since partitioned data resides on the same node.
I want to map/reduce the column families and leverage the parallel disks

:( :(

I am sure others would want to do the same…..We almost need a feature of
virtual Column Families and column family should really not be column
family but should be called ReplicationGroup or something where
replication is configured for all CF's in that group.

ANYONE have any other ideas???

Dean

On 10/2/12 7:20 AM, "Brian O'Neill" <bo...@gmail.com> wrote:

>
>Without putting too much thought into it...
>
>Given the underlying architecture, I think you could/would have to write
>your own partitioner, which would partition based on the prefix/virtual
>keyspace.  
>
>-brian
>
>---
>Brian O'Neill
>Lead Architect, Software Development
> 
>Health Market Science
>The Science of Better Results
>2700 Horizon Drive € King of Prussia, PA € 19406
>M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
>healthmarketscience.com
>
>This information transmitted in this email message is for the intended
>recipient only and may contain confidential and/or privileged material. If
>you received this email in error and are not the intended recipient, or
>the person responsible to deliver it to the intended recipient, please
>contact the sender at the email above and delete this email and any
>attachments and destroy any copies thereof. Any review, retransmission,
>dissemination, copying or other use of, or taking any action in reliance
>upon, this information by persons or entities other than the intended
>recipient is strictly prohibited.
> 
>
>
>
>
>
>
>On 10/2/12 9:00 AM, "Ben Hood" <0x...@gmail.com> wrote:
>
>>Dean,
>>
>>On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean <De...@nrel.gov>
>>wrote:
>>> Ben,
>>>   to address your question, read my last post but to summarize, yes,
>>>there
>>> is less overhead in memory to prefix keys than manage multiple Cfs
>>>EXCEPT
>>> when doing map/reduce.  Doing map/reduce, you will now have HUGE
>>>overhead
>>> in reading a whole slew of rows you don't care about as you can't
>>> map/reduce a single virtual CF but must map/reduce the whole CF wasting
>>> TONS of resources.
>>
>>That's a good point that I hadn't considered beforehand, especially as
>>I'd like to run MR jobs against these CFs.
>>
>>Is this limitation inherent in the way that Cassandra is modelled as
>>input for Hadoop or could you write a custom slice query to only feed
>>in one particular prefix into Hadoop?
>>
>>Cheers,
>>
>>Ben
>
>

Re: 1000's of column families

Posted by Brian O'Neill <bo...@gmail.com>.

Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.

On 10/2/12 9:00 AM, "Ben Hood" <0x...@gmail.com> wrote:

>Dean,
>
>On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean <De...@nrel.gov> wrote:
>> Ben,
>>   to address your question, read my last post but to summarize, yes,
>>there
>> is less overhead in memory to prefix keys than manage multiple Cfs
>>EXCEPT
>> when doing map/reduce.  Doing map/reduce, you will now have HUGE
>>overhead
>> in reading a whole slew of rows you don't care about as you can't
>> map/reduce a single virtual CF but must map/reduce the whole CF wasting
>> TONS of resources.
>
>That's a good point that I hadn't considered beforehand, especially as
>I'd like to run MR jobs against these CFs.
>
>Is this limitation inherent in the way that Cassandra is modelled as
>input for Hadoop or could you write a custom slice query to only feed
>in one particular prefix into Hadoop?
>
>Cheers,
>
>Ben

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean <De...@nrel.gov> wrote:
> Ben,
>   to address your question, read my last post but to summarize, yes, there
> is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT
> when doing map/reduce.  Doing map/reduce, you will now have HUGE overhead
> in reading a whole slew of rows you don't care about as you can't
> map/reduce a single virtual CF but must map/reduce the whole CF wasting
> TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben

Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...

Posted by Brian O'Neill <bo...@gmail.com>.

Dean,

We moved away from Hadoop and M/R, and instead we are using Storm as our
compute grid.  We queue keys in Kafka, then Storm distributes the work to
the grid.  Its working well so far, but we haven't taken it to prod yet.
Data is read from Cassandra using a Cassandra-bolt.

If you end up using Storm, let me know.  We have an unreleased version of
the bolt that you probably want to use.  (we're waiting on Nathan/Storm to
fix some classpath loading issues)

RE: a customer virtual keyspace Partitioner, point well taken

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive ? King of Prussia, PA ? 19406
M: 215.588.6024 ? @boneill42 <http://www.twitter.com/boneill42>  ?
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:33 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>Well, I think I know the direction we may follow so we can
>1. Have Virtual CF's
>2. Be able to map/reduce ONE Virtual CF
>
>Well, not map/reduce exactly but really really close.  We use PlayOrm with
>it's partitioning so I am now thinking what we will do is have a compute
>grid  where we can have each node doing a findAll query into the
>partitions it is responsible for.  In this way, I think we can 1000's of
>virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves
>the rows for that partition of one virtual CF.
>
>Anyone know of a computer grid we can dish out work to?  That would be my
>only missing piece (well, that and the PlayOrm virtual CF feature but I
>can add that within a week probably though I am on vacation this Thursday
>to monday).
>
>Later,
>Dean
>
>
>On 10/2/12 6:35 AM, "Hiller, Dean" <De...@nrel.gov> wrote:
>
>>So basically, with moving towards the 1000's of CF all being put in one
>>CF, our performance is going to tank on map/reduce, correct?  I mean,
>>from
>>what I remember we could do map/reduce on a single CF, but by stuffing
>>1000's of virtual Cf's into one CF, our map/reduce will have to read in
>>all 999 virtual CF's rows that we don't want just to map/reduce the ONE
>>CF.
>>
>>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.
>>
>>Is this correct?  This really sounds like highly undesirable behavior.
>>There needs to be a way for people with 1000's of CF's to also run
>>map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
>>rows will be 1000 times slowerŠ.and of course, we will most likely get up
>>to 20,000 tables from my most recent projectionsŠ.our last test load, we
>>ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
>>started getting really REALLY slow when we got up to 15k+ CF's in the
>>system though I didn't look into why.
>>
>>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
>>map/reduce "just" the virtual CF!!!!!  Ugh.
>>
>>Thanks,
>>Dean
>>
>>On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:
>>
>>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>>>wrote:
>>>> Its just a convenient way of prefixing:
>>>> 
>>>>http://hector-client.github.com/hector/build/html/content/virtual_keysp
>>>>a
>>>>c
>>>>es.html
>>>
>>>So given that it is possible to use a CF per tenant, should we assume
>>>that there at sufficient scale that there is less overhead to prefix
>>>keys than there is to manage multiple CFs?
>>>
>>>Ben
>>
>

Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...

Posted by "Hiller, Dean" <De...@nrel.gov>.

Well, I think I know the direction we may follow so we can
1. Have Virtual CF's
2. Be able to map/reduce ONE Virtual CF

Well, not map/reduce exactly but really really close.  We use PlayOrm with
it's partitioning so I am now thinking what we will do is have a compute
grid  where we can have each node doing a findAll query into the
partitions it is responsible for.  In this way, I think we can 1000's of
virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves
the rows for that partition of one virtual CF.

Anyone know of a computer grid we can dish out work to?  That would be my
only missing piece (well, that and the PlayOrm virtual CF feature but I
can add that within a week probably though I am on vacation this Thursday
to monday).

Later,
Dean

On 10/2/12 6:35 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>So basically, with moving towards the 1000's of CF all being put in one
>CF, our performance is going to tank on map/reduce, correct?  I mean, from
>what I remember we could do map/reduce on a single CF, but by stuffing
>1000's of virtual Cf's into one CF, our map/reduce will have to read in
>all 999 virtual CF's rows that we don't want just to map/reduce the ONE
>CF.
>
>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.
>
>Is this correct?  This really sounds like highly undesirable behavior.
>There needs to be a way for people with 1000's of CF's to also run
>map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
>rows will be 1000 times slowerŠ.and of course, we will most likely get up
>to 20,000 tables from my most recent projectionsŠ.our last test load, we
>ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
>started getting really REALLY slow when we got up to 15k+ CF's in the
>system though I didn't look into why.
>
>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
>map/reduce "just" the virtual CF!!!!!  Ugh.
>
>Thanks,
>Dean
>
>On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:
>
>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>>wrote:
>>> Its just a convenient way of prefixing:
>>> 
>>>http://hector-client.github.com/hector/build/html/content/virtual_keyspa
>>>c
>>>es.html
>>
>>So given that it is possible to use a CF per tenant, should we assume
>>that there at sufficient scale that there is less overhead to prefix
>>keys than there is to manage multiple CFs?
>>
>>Ben
>

Re: 1000's of CF's. virtual CFs do NOTworkŠ..map/reduce

Posted by Brian O'Neill <bo...@gmail.com>.

Dean,

Great point.  I hadn't considered that either.  Per my other email, think
we would need a custom partitioner for this? (a mix of
OrderPreservingPartitioner and RandomPartitioner, OPP for the prefix)

-brian

---
Brian O'Neill
Lead Architect, Software Development

Health Market Science
The Science of Better Results
2700 Horizon Drive ? King of Prussia, PA ? 19406
M: 215.588.6024 ? @boneill42 <http://www.twitter.com/boneill42>  ?
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.

On 10/2/12 8:35 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>So basically, with moving towards the 1000's of CF all being put in one
>CF, our performance is going to tank on map/reduce, correct?  I mean, from
>what I remember we could do map/reduce on a single CF, but by stuffing
>1000's of virtual Cf's into one CF, our map/reduce will have to read in
>all 999 virtual CF's rows that we don't want just to map/reduce the ONE
>CF.
>
>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.
>
>Is this correct?  This really sounds like highly undesirable behavior.
>There needs to be a way for people with 1000's of CF's to also run
>map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
>rows will be 1000 times slowerŠ.and of course, we will most likely get up
>to 20,000 tables from my most recent projectionsŠ.our last test load, we
>ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
>started getting really REALLY slow when we got up to 15k+ CF's in the
>system though I didn't look into why.
>
>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
>map/reduce "just" the virtual CF!!!!!  Ugh.
>
>Thanks,
>Dean
>
>On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:
>
>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>>wrote:
>>> Its just a convenient way of prefixing:
>>> 
>>>http://hector-client.github.com/hector/build/html/content/virtual_keyspa
>>>c
>>>es.html
>>
>>So given that it is possible to use a CF per tenant, should we assume
>>that there at sufficient scale that there is less overhead to prefix
>>keys than there is to manage multiple CFs?
>>
>>Ben
>

1000's of CF's. virtual CFs do NOT workŠ..map/reduce

Posted by "Hiller, Dean" <De...@nrel.gov>.

So basically, with moving towards the 1000's of CF all being put in one
CF, our performance is going to tank on map/reduce, correct?  I mean, from
what I remember we could do map/reduce on a single CF, but by stuffing
1000's of virtual Cf's into one CF, our map/reduce will have to read in
all 999 virtual CF's rows that we don't want just to map/reduce the ONE CF.

Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.

Is this correct?  This really sounds like highly undesirable behavior.
There needs to be a way for people with 1000's of CF's to also run
map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
rows will be 1000 times slowerŠ.and of course, we will most likely get up
to 20,000 tables from my most recent projectionsŠ.our last test load, we
ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
started getting really REALLY slow when we got up to 15k+ CF's in the
system though I didn't look into why.

I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
map/reduce "just" the virtual CF!!!!!  Ugh.

Thanks,
Dean

On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:

>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>wrote:
>> Its just a convenient way of prefixing:
>> 
>>http://hector-client.github.com/hector/build/html/content/virtual_keyspac
>>es.html
>
>So given that it is possible to use a CF per tenant, should we assume
>that there at sufficient scale that there is less overhead to prefix
>keys than there is to manage multiple CFs?
>
>Ben

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Ben, 
  to address your question, read my last post but to summarize, yes, there
is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT
when doing map/reduce.  Doing map/reduce, you will now have HUGE overhead
in reading a whole slew of rows you don't care about as you can't
map/reduce a single virtual CF but must map/reduce the whole CF wasting
TONS of resources.

Thanks,
Dean

On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:

>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>wrote:
>> Its just a convenient way of prefixing:
>> 
>>http://hector-client.github.com/hector/build/html/content/virtual_keyspac
>>es.html
>
>So given that it is possible to use a CF per tenant, should we assume
>that there at sufficient scale that there is less overhead to prefix
>keys than there is to manage multiple CFs?
>
>Ben

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu> wrote:
> Its just a convenient way of prefixing:
> http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html

So given that it is possible to use a CF per tenant, should we assume
that there at sufficient scale that there is less overhead to prefix
keys than there is to manage multiple CFs?

Ben

Re: 1000's of column families

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Its just a convenient way of prefixing:
http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html

-brian

On Mon, Oct 1, 2012 at 4:22 PM, Ben Hood <0x...@gmail.com> wrote:
> Brian,
>
> On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill <bo...@alumni.brown.edu> wrote:
>> We haven't committed either way yet, but given Ed Anuff's presentation
>> on virtual keyspaces, we were leaning towards a single column family
>> approach:
>> http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?
>
> Is this doing something special or is this just a convenience way of
> prefixing keys to make the storage space multi-tenanted?
>
> Cheers,
>
> Ben



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)

mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Re: 1000's of column families

Posted by Ben Hood <0x...@gmail.com>.

Brian,

On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill <bo...@alumni.brown.edu> wrote:
> We haven't committed either way yet, but given Ed Anuff's presentation
> on virtual keyspaces, we were leaning towards a single column family
> approach:
> http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Is this doing something special or is this just a convenience way of
prefixing keys to make the storage space multi-tenanted?

Cheers,

Ben

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Well, I am now thinking of adding a virtual capability to PlayOrm which we
currently use to allow grouping entities into one column family.  Right
now the CF creation comes from a single entity so this then may change for
those entities that define they are in a single CF groupŠ.This should not
be a very hard change if we decide to do that.

This makes us rely even more on PlayOrm's command line tool(instead of
cassandra-cli) as I can't stand reading hex all the time nor do I like
switching my "assume validator to utf8 to decimal, to integer just so I
can read stuff".

Later,
Dean

On 10/1/12 9:22 AM, "Brian O'Neill" <bo...@alumni.brown.edu> wrote:

>Dean,
>
>We have the same question...
>
>We have thousands of separate feeds of data as well (20,000+).  To
>date, we've been using a CF per feed strategy, but as we scale this
>thing out to accommodate all of those feeds, we're trying to figure
>out if we're going to blow out the memory.
>
>The initial documentation for heap sizing had column families in the
>equation:
>http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing
>
>But in the more recent documentation, it looks like they removed the
>column family variable with the introduction of the universal
>key_cache_size.
>http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size
>
>We haven't committed either way yet, but given Ed Anuff's presentation
>on virtual keyspaces, we were leaning towards a single column family
>approach:
>http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassand
>ra_-_apigee_under_the_hood/?
>
>Definitely let us know what you decide.
>
>-brian
>
>On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
><f....@list-group.com> wrote:
>> We had some serious trouble with dynamically adding CFs, although last
>>time
>> we tried we were using version 0.7, so maybe
>> that's not an issue any more.
>> Our problems were two:
>> - You are (were?) not supposed to add CFs concurrently. Since we had
>>more
>> servers talking to the same Cassandra cluster,
>> we had to use distributed locks (Hazelcast) to avoid concurrency.
>> - You must be very careful to add new CFs to different Cassandra nodes.
>>If
>> you do that fast enough, and the clocks of
>> the two servers are skewed, you will severely compromise your schema
>> (Cassandra will not understand in which order the
>> updates must be applied).
>>
>> As I said, this applied to version 0.7, maybe current versions solved
>>these
>> problems.
>>
>> Flavio
>>
>>
>> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
>>> We have 1000's of different building devices and we stream data from
>>>these
>> devices.  The format and data from each one varies so one device has
>>temperature
>> at timeX with some other variables, another device has CO2 percentage
>>and other
>> variables.  Every device is unique and streams it's own data.  We
>>dynamically
>> discover devices and register them.  Basically, one CF or table per
>>thing really
>> makes sense in this environment.  While we could try to find out which
>>devices
>> "are" similar, this would really be a pain and some devices add some new
>> variable into the equation.  NOT only that but researchers can register
>>new
>> datasets and upload them as well and each dataset they have they do NOT
>>want to
>> share with other researches necessarily so we have security groups and
>>each CF
>> belongs to security groups.  We dynamically create CF's on the fly as
>>people
>> register new datasets.
>>>
>>> On top of that, when the data sets get too large, we probably want to
>> partition a single CF into time partitions.  We could create one CF and
>>put all
>> the data and have a partition per device, but then a time partition
>>will contain
>> "multiple" devices of data meaning we need to shrink our time partition
>>size
>> where if we have CF per device, the time partition can be larger as it
>>is only
>> for that one device.
>>>
>>> THEN, on top of that, we have a meta CF for these devices so some
>>>people want
>> to query for streams that match criteria AND which returns a CF name
>>and they
>> query that CF name so we almost need a query with variables like select
>>cfName
>> from Meta where x = y and then select * from cfName where xxxxx. Which
>>we can do
>> today.
>>>
>>> Dean
>>>
>>> From: Marcelo Elias Del Valle
>>><mv...@gmail.com>>
>>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>> <us...@cassandra.apache.org>>
>>> Date: Thursday, September 27, 2012 8:01 AM
>>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
>> <us...@cassandra.apache.org>>
>>> Subject: Re: 1000's of column families
>>>
>>> Out of curiosity, is it really necessary to have that amount of CFs?
>>> I am probably still used to relational databases, where you would use
>>>a new
>> table just in case you need to store different kinds of data. As
>>Cassandra
>> stores anything in each CF, it might probably make sense to have a lot
>>of CFs to
>> store your data...
>>> But why wouldn't you use a single CF with partitions in these case?
>>>Wouldn't
>> it be the same thing? I am asking because I might learn a new modeling
>>technique
>> with the answer.
>>>
>>> []s
>>>
>>> 2012/9/26 Hiller, Dean
>>><De...@nrel.gov>>
>>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>>> When
>> using the tools they are all geared to analyzing ONE column family at a
>>time :(.
>> If I remember correctly, Cassandra supports as many CF's as you want,
>>correct?
>> Even though I am going to have tons of funs with limitations on the
>>tools,
>> correct?
>>>
>>> (I may end up wrapping the node tool with my own aggregate calls if
>>>needed to
>> sum up multiple column families and such).
>>>
>>> Thanks,
>>> Dean
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>
>>
>
>
>
>-- 
>Brian ONeill
>Lead Architect, Health Market Science (http://healthmarketscience.com)
>Apache Cassandra MVP
>mobile:215.588.6024
>blog: http://brianoneill.blogspot.com/
>twitter: @boneill42

Re: 1000's of column families

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Dean,

We have the same question...

We have thousands of separate feeds of data as well (20,000+).  To
date, we've been using a CF per feed strategy, but as we scale this
thing out to accommodate all of those feeds, we're trying to figure
out if we're going to blow out the memory.

The initial documentation for heap sizing had column families in the equation:
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing

But in the more recent documentation, it looks like they removed the
column family variable with the introduction of the universal
key_cache_size.
http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size

We haven't committed either way yet, but given Ed Anuff's presentation
on virtual keyspaces, we were leaning towards a single column family
approach:
http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Definitely let us know what you decide.

-brian

On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
<f....@list-group.com> wrote:
> We had some serious trouble with dynamically adding CFs, although last time
> we tried we were using version 0.7, so maybe
> that's not an issue any more.
> Our problems were two:
> - You are (were?) not supposed to add CFs concurrently. Since we had more
> servers talking to the same Cassandra cluster,
> we had to use distributed locks (Hazelcast) to avoid concurrency.
> - You must be very careful to add new CFs to different Cassandra nodes. If
> you do that fast enough, and the clocks of
> the two servers are skewed, you will severely compromise your schema
> (Cassandra will not understand in which order the
> updates must be applied).
>
> As I said, this applied to version 0.7, maybe current versions solved these
> problems.
>
> Flavio
>
>
> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
>> We have 1000's of different building devices and we stream data from these
> devices.  The format and data from each one varies so one device has temperature
> at timeX with some other variables, another device has CO2 percentage and other
> variables.  Every device is unique and streams it's own data.  We dynamically
> discover devices and register them.  Basically, one CF or table per thing really
> makes sense in this environment.  While we could try to find out which devices
> "are" similar, this would really be a pain and some devices add some new
> variable into the equation.  NOT only that but researchers can register new
> datasets and upload them as well and each dataset they have they do NOT want to
> share with other researches necessarily so we have security groups and each CF
> belongs to security groups.  We dynamically create CF's on the fly as people
> register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to
> partition a single CF into time partitions.  We could create one CF and put all
> the data and have a partition per device, but then a time partition will contain
> "multiple" devices of data meaning we need to shrink our time partition size
> where if we have CF per device, the time partition can be larger as it is only
> for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some people want
> to query for streams that match criteria AND which returns a CF name and they
> query that CF name so we almost need a query with variables like select cfName
> from Meta where x = y and then select * from cfName where xxxxx. Which we can do
> today.
>>
>> Dean
>>
>> From: Marcelo Elias Del Valle <mv...@gmail.com>>
>> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> <us...@cassandra.apache.org>>
>> Date: Thursday, September 27, 2012 8:01 AM
>> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
> <us...@cassandra.apache.org>>
>> Subject: Re: 1000's of column families
>>
>> Out of curiosity, is it really necessary to have that amount of CFs?
>> I am probably still used to relational databases, where you would use a new
> table just in case you need to store different kinds of data. As Cassandra
> stores anything in each CF, it might probably make sense to have a lot of CFs to
> store your data...
>> But why wouldn't you use a single CF with partitions in these case? Wouldn't
> it be the same thing? I am asking because I might learn a new modeling technique
> with the answer.
>>
>> []s
>>
>> 2012/9/26 Hiller, Dean <De...@nrel.gov>>
>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When
> using the tools they are all geared to analyzing ONE column family at a time :(.
> If I remember correctly, Cassandra supports as many CF's as you want, correct?
> Even though I am going to have tons of funs with limitations on the tools,
> correct?
>>
>> (I may end up wrapping the node tool with my own aggregate calls if needed to
> sum up multiple column families and such).
>>
>> Thanks,
>> Dean
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

We have 1000's of different building devices and we stream data from these devices.  The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables.  Every device is unique and streams it's own data.  We dynamically discover devices and register them.  Basically, one CF or table per thing really makes sense in this environment.  While we could try to find out which devices "are" similar, this would really be a pain and some devices add some new variable into the equation.  NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups.  We dynamically create CF's on the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions.  We could create one CF and put all the data and have a partition per device, but then a time partition will contain "multiple" devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where xxxxx. Which we can do today.

Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, September 27, 2012 8:01 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer.

[]s

2012/9/26 Hiller, Dean <De...@nrel.gov>>
We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly, Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons of funs with limitations on the tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such).

Thanks,
Dean



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new
table just in case you need to store different kinds of data. As Cassandra
stores anything in each CF, it might probably make sense to have a lot of
CFs to store your data...
But why wouldn't you use a single CF with partitions in these case?
Wouldn't it be the same thing? I am asking because I might learn a new
modeling technique with the answer.

[]s

2012/9/26 Hiller, Dean <De...@nrel.gov>

> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>  When using the tools they are all geared to analyzing ONE column family at
> a time :(.  If I remember correctly, Cassandra supports as many CF's as you
> want, correct?  Even though I am going to have tons of funs with
> limitations on the tools, correct?
>
> (I may end up wrapping the node tool with my own aggregate calls if needed
> to sum up multiple column families and such).
>
> Thanks,
> Dean
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

Posted by Robin Verlangen <ro...@us2.nl>.

Every CF adds some overhead (in memory) to each node. This is something you
should really keep in mind.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

<http://goo.gl/Lt7BC>

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/27 Sylvain Lebresne <sy...@datastax.com>

> On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean <De...@nrel.gov>
> wrote:
> > We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>  When using the tools they are all geared to analyzing ONE column family at
> a time :(.  If I remember correctly, Cassandra supports as many CF's as you
> want, correct?  Even though I am going to have tons of funs with
> limitations on the tools, correct?
> >
> > (I may end up wrapping the node tool with my own aggregate calls if
> needed to sum up multiple column families and such).
>
> Is there a non rhetorical question in there? Maybe is that a feature
> request in disguise?
>
> --
> Sylvain
>

Re: 1000's of column families

Posted by "Hiller, Dean" <De...@nrel.gov>.

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

The question was basically, Is Cassandra ok with as many CF's as you want?
 It sounds like it is not based on the email that every CF causes a bit
more RAM to be used though.  So if cassandra is not ok with as many CF's
as you want, does anyone know what that limit would be for 16G of RAM or
something I could calculate with?

Thanks,
Dean

On 9/27/12 2:37 AM, "Sylvain Lebresne" <sy...@datastax.com> wrote:

>On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean <De...@nrel.gov>
>wrote:
>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
>>When using the tools they are all geared to analyzing ONE column family
>>at a time :(.  If I remember correctly, Cassandra supports as many CF's
>>as you want, correct?  Even though I am going to have tons of funs with
>>limitations on the tools, correct?
>>
>> (I may end up wrapping the node tool with my own aggregate calls if
>>needed to sum up multiple column families and such).
>
>Is there a non rhetorical question in there? Maybe is that a feature
>request in disguise?
>
>--
>Sylvain

Re: 1000's of column families

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean <De...@nrel.gov> wrote:
> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When using the tools they are all geared to analyzing ONE column family at a time :(.  If I remember correctly, Cassandra supports as many CF's as you want, correct?  Even though I am going to have tons of funs with limitations on the tools, correct?
>
> (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such).

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

--
Sylvain