You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shalom <cs...@gmail.com> on 2011/04/11 17:01:33 UTC

Cassandra Database Modeling

I would like to save statistics on 10,000,000 (ten millions) pairs of
particles, how they relate to one another in any given space in time.

So suppose that within a total experiment time of T1..T1000 (assume that T1
is when the experiment starts, and T1000 is the time when the experiment
ends) I would like, per each pair of particles, to measure the relationship
between every Tn -- T(n+1) interval:

T1..T2 (this is the first interval)

T2..T3

T3..T4

......

......

T9,999,999..T10,000,000 (this is the last interval)

For each such a particle pair (there are 10,000,000 pairs) I would like to
save some figures (such as distance, angel etc) on each interval of [
Tn..T(n+1) ]

Once saved, the query I will be using to retrieve this data is as follows:
"give me all particle pairs on time interval [ Tn..T(n+1) ] where the
distance between the two particles is smaller than X and the angle between
the two particles is greater than Y". Meaning, the query will always take
place for all particle pairs on a certain interval of time.

How would you model this in Cassandra, so that the writes/reads are
optimized? given the database size involved, can you recommend on a suitable
solution? (I have been recommended to both MongoDB / Cassandra).

I should mention that the data does change often -- we run many such
experiments (different particle sets / thousands of experiments) and would
need a very decent performance of reads/writes.

Is Cassandra suitable for this time of work?


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Cassandra Database Modeling

Posted by Aaron Morton <aa...@thelastpickle.com>.

There rows can have 2 billion columns, max column size is 2 GB . But less than 10 mb sounds like a sane limit for a single column.

For the serialisation it depends on what your data looks like, point is that json is not space efficient. You may get away with just compressing it (gzip, lzo...), or you may need to create you own space efficient binary format. Start with compressing and use the c accelerated simplejson package.

Struct.pack is a way to encode bytes typically to exchange with other programs.

Good luck.
Aaron

On 15/04/2011, at 3:59 PM, csharpplusproject <cs...@gmail.com> wrote:

> Aaron,
> 
> Thank you so much.
> 
> So, the way things appear, it is definitely possible that I could be making queries that would return all 10M particle pairs (at least, I should plan for it). What would be the best design in such a case?
> I read somewhere that the recommended maximum size of a row (meaning, including all columns) should be around 10[MB], and better not to exceed that. Is that correct?
> 
> As per packing data "efficiently", what would be the best way? would packing the data using say (in python terms) struct.pack( ... ) be at all helpful?
> 
> Thanks,
> Shalom.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Thu, 14 Apr 2011 20:54:43 +1200
> 
> WRT your query, it depends on how big a slice you want to get how time critical it is. e.g. Could you be making queries that would return all 10M pairs ? Or would the queries generally want to get some small fraction of the data set? Again, depends on how the sim runs. 
> 
> If you sim has stop the world pauses were you have a full view of the data space, then you could grab all the points at a certain distance and efficiently pack them up. Where "efficiently" means not using JSON. 
> 
> http://wiki.apache.org/cassandra/LargeDataSetConsiderations http://wiki.apache.org/cassandra/CassandraLimitations   Aaron 
> 
> On 13 Apr 2011, at 15:48, csharpplusproject wrote: 
>> Aaron,
>> 
>> Thank you so much for your help. It is greatly appreciated!
>> 
>> Looking at the design of the particle pairs:
>>> 
>>> - key: expriement_id.time_interval 
>>> - column name: pair_id 
>>> - column value: distance, angle, other data packed together as JSON or some other format
>> 
>> You wrote that retrieving millions of columns (I will have about 10,000,000 particles pairs) would be slow. You are also right that the retrieval of millions of columns into Python, won't be fast.
>> 
>> If my desired query is to get "all particle pairs on time interval [ Tn..T(n+1) ] where the distance between the two particles is smaller than X and the angle between the two particles is greater than Y".
>> 
>> In such a query (as the above), given the fact that retrieving millions of columns could be slow, would it be best to say 'concatenate' all values for all particle pairs for a given 'expriement_id.time_interval' into one column?
>> 
>> If data is stored in this way, I will be getting from Cassandra a binary string / JSON Object that I will have to 'unpack' in my application. Is this a recommended approach? are there better approaches?
>> 
>> Is there a limit to the size that can be stored in one 'cell' (by 'cell' I mean the intersection between a key and a data column)? is there a limit to the size of data of one key?  one data column?
>> 
>> Thanks in advance for any help / guidance.
>> 
>> -----Original Message-----
>> From: aaron morton <aa...@thelastpickle.com>
>> Reply-to: user@cassandra.apache.org
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra Database Modeling
>> Date: Wed, 13 Apr 2011 10:14:21 +1200
>> 
>> Yes for  interactive == real time queries.  Hadoop based techniques are non time critical queries, but they do have greater analytical capabilities.  
>> 
>> particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api call will be used by your client library to pull back chunks of (ordered) columns. Most client libraries abstract away the chunking for you.  
>> 
>> 2) If you are using a packed structure like JSON then no, Cassandra will have no idea what you've put in the columns other than bytes . It really depends on how much data you have per pair, but generally it's easier to pull back more data than try to get exactly what you need. Downside is you have to update all the data.  
>> 
>> 3) No, you would need to update all the data for the pair. I was assuming most of the data was written once, and that your simulation had something like a stop-the-world phase between time slices where state was dumped and then read to start the next interval. You could either read it first, or we can come up with something else. 
>> 
>> distance_cf 1) the query would return an list of columns, which have a name and value (as well as a timestamp and ttl). 2) depends on the client library, if using python go for https://github.com/pycassa/pycassa It will return objects  3) returning millions of columns is going to be slow, would also be slow using a RDBMS. Creating millions objects in python is going to be slow. You would need to have a better idea of what queries you will actually want to run to see if it's *too* slow. If it is one approach is to store the particles at the same distance in the same column, so you need to read less columns. Again depends on how your sim works.     Time complexity depends on the number of columns read. Finding a row will not be O(1) as it it may have to read from several files. Writes are more constant than reads. But remember, you can have a lot of io and cpu power in your cluster. 
>> 
>> Best advice is to jump in and see if the data model works for you at a small single node scale, most performance issues can be solved.  
>> 
>> Aaron 
>> On 12 Apr 2011, at 15:34, csharpplusproject wrote: 
>>> Hi Aaron,
>>> 
>>> Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank you very much!
>>> 
>>> First of all, by 'interactive' queries, are you referring to 'real-time' queries? (meaning, where experiments data is 'streaming', data needs to be stored and following that, the query needs to be run in real time)?
>>> 
>>> Looking at the design of the particle pairs:
>>> 
>>> - key: expriement_id.time_interval 
>>> - column name: pair_id 
>>> - column value: distance, angle, other data packed together as JSON or some other format
>>> 
>>> A couple of questions:
>>> 
>>> (1) Will a query such as pairID[ expriement_id.time_interval ] will basically return an array of all paidIDs for the experiment, where each item is a 'packed' JSON?
>>> (2) Would it be possible, rather than returning the whole JSON object per every pairID, to get (say) only the distance?
>>> (3) Would it be possible to easily update certain 'pairIDs' with new values (for example, update pairIDs = {2389, 93434} with new distance values)? 
>>> 
>>> Looking at the design of the distance CF (for example):
>>> 
>>> this is VERY INTERESTING. basically you are suggesting a design that will save the actual distance between each pair of particles, and will allow queries where we can find all pairIDs (for an experiment, on time_interval) that meet a certain distance criteria. VERY, VERY INTERESTING!
>>> 
>>> A couple of questions:
>>> 
>>> (1) Will a query such as distanceCF[ expriement_id.time_interval ] will basically return an array of all 'zero_padded_distance.pair_id' elements for the experiment?
>>> (2) In such a case, I will get (presumably) a python list where every item is a string (and I will need to process it)?
>>> (3) Given the fact that we're doing a slice on millions of columns (?), any idea how fast such an operation would be?
>>> 
>>> 
>>> Just to make sure I understand, is it true that in both situations, the query complexity is basically O(1) since it's simply a HASH?
>>> 
>>> 
>>> Thank you for all of your help!
>>> 
>>> Shalom.
>>> 
>>> -----Original Message-----
>>> From: aaron morton <aa...@thelastpickle.com>
>>> Reply-to: user@cassandra.apache.org
>>> To: user@cassandra.apache.org
>>> Subject: Re: Cassandra Database Modeling
>>> Date: Tue, 12 Apr 2011 10:43:42 +1200
>>> 
>>> The tricky part here is the level of flexibility you want for the querying. In general you will want to denormalise to support the read queries.   
>>> 
>>> If your queries are not interactive you may be able to use Hadoop / Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you can probably have a simpler data model where you spend less effort supporting the queries. But it sounds like you need interactive queries as part of the experiment. 
>>> 
>>> You could store the data per pair in a standard CF (lets call it the pair cf) as follows: 
>>> 
>>> - key: expriement_id.time_interval - column name: pair_id - column value: distance, angle, other data packed together as JSON or some other format 
>>> 
>>> This would support a basic record of what happened, for each time interval you can get the list of all pairs and read their data.  
>>> 
>>> To support your spatial queries you could use two standard standard CFs as follows: 
>>> 
>>> distance CF: - key: experiment_id.time_interval - colunm name: zero_padded_distance.pair_id - column value: empty or the angle  
>>> 
>>> angle CF : - key: experiment_id.time_interval - colunm name: zero_padded_angle.pair_id - column value: empty or the distance 
>>> 
>>> (two pairs can have the same distance and/or angle in same time slice) 
>>> 
>>> Here we are using the column name as a compound value, and am assuming they can be byte ordered. So for distance the column name looks something like 000500.123456789. You would then use the Byte comparator (or similar) for the columns.   
>>> 
>>> To find all of the particles for experiment 2 at t5 where distance is < 100 you would use a get_slice (see http://wiki.apache.org/cassandra/API or your higher level client docs) against the key "2.5" with a SliceRange start at "000000.000000000" and finish at "000100.999999999". Once you have this list of columns you can either filter client side for the angle or issue another query for the particles inside the angle range. Then join the two results client side using the pair_id returned in the column names.  
>>> 
>>> By using the same key for all 3 CF's all the data for a time slice will be stored on the same nodes. You can potentially spread this around by using slightly different keys so they may hash to different areas of the cluster. e.g. expriement_id.time_interval."distance" 
>>> 
>>> Data volume is not a concern, and it's not possible to talk about performance until you have an idea of the workload and required throughput. But writes are fast and I think your reads would be fast as well as the row data for distance and angle will not change so caches will be be useful.    
>>> 
>>> Hope that helps.  Aaron 
>>> 
>>> On 12 Apr 2011, at 03:01, Shalom wrote: 
>>>> I would like to save statistics on 10,000,000 (ten millions) pairs of
>>>> particles, how they relate to one another in any given space in time.
>>>> 
>>>> So suppose that within a total experiment time of T1..T1000 (assume that T1
>>>> is when the experiment starts, and T1000 is the time when the experiment
>>>> ends) I would like, per each pair of particles, to measure the relationship
>>>> between every Tn -- T(n+1) interval:
>>>> 
>>>> T1..T2 (this is the first interval)
>>>> 
>>>> T2..T3
>>>> 
>>>> T3..T4
>>>> 
>>>> ......
>>>> 
>>>> ......
>>>> 
>>>> T9,999,999..T10,000,000 (this is the last interval)
>>>> 
>>>> For each such a particle pair (there are 10,000,000 pairs) I would like to
>>>> save some figures (such as distance, angel etc) on each interval of [
>>>> Tn..T(n+1) ]
>>>> 
>>>> Once saved, the query I will be using to retrieve this data is as follows:
>>>> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
>>>> distance between the two particles is smaller than X and the angle between
>>>> the two particles is greater than Y". Meaning, the query will always take
>>>> place for all particle pairs on a certain interval of time.
>>>> 
>>>> How would you model this in Cassandra, so that the writes/reads are
>>>> optimized? given the database size involved, can you recommend on a suitable
>>>> solution? (I have been recommended to both MongoDB / Cassandra).
>>>> 
>>>> I should mention that the data does change often -- we run many such
>>>> experiments (different particle sets / thousands of experiments) and would
>>>> need a very decent performance of reads/writes.
>>>> 
>>>> Is Cassandra suitable for this time of work?
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
>>>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: Cassandra Database Modeling

Posted by csharpplusproject <cs...@gmail.com>.

Aaron,

Thank you so much.

So, the way things appear, it is definitely possible that I could be
making queries that would return all 10M particle pairs (at least, I
should plan for it). What would be the best design in such a case?
I read somewhere that the recommended maximum size of a row (meaning,
including all columns) should be around 10[MB], and better not to exceed
that. Is that correct?

As per packing data "efficiently", what would be the best way? would
packing the data using say (in python terms) struct.pack( ... ) be at
all helpful?

Thanks,
Shalom.

-----Original Message-----
From: aaron morton <aa...@thelastpickle.com>
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Thu, 14 Apr 2011 20:54:43 +1200

WRT your query, it depends on how big a slice you want to get how time
critical it is. e.g. Could you be making queries that would return all
10M pairs ? Or would the queries generally want to get some small
fraction of the data set? Again, depends on how the sim runs.


If you sim has stop the world pauses were you have a full view of the
data space, then you could grab all the points at a certain distance and
efficiently pack them up. Where "efficiently" means not using JSON.


http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://wiki.apache.org/cassandra/CassandraLimitations
 
Aaron


On 13 Apr 2011, at 15:48, csharpplusproject wrote:

> Aaron,
> 
> Thank you so much for your help. It is greatly appreciated!
> 
> Looking at the design of the particle pairs:
> 
> > 
> > - key: expriement_id.time_interval 
> > - column name: pair_id 
> > - column value: distance, angle, other data packed together as JSON
> > or some other format
> 
> 
> You wrote that retrieving millions of columns (I will have about
> 10,000,000 particles pairs) would be slow. You are also right that the
> retrieval of millions of columns into Python, won't be fast.
> 
> If my desired query is to get "all particle pairs on time interval
> [ Tn..T(n+1) ] where the distance between the two particles is smaller
> than X and the angle between the two particles is greater than Y".
> 
> In such a query (as the above), given the fact that retrieving
> millions of columns could be slow, would it be best to say
> 'concatenate' all values for all particle pairs for a given
> 'expriement_id.time_interval' into one column?
> 
> If data is stored in this way, I will be getting from Cassandra a
> binary string / JSON Object that I will have to 'unpack' in my
> application. Is this a recommended approach? are there better
> approaches?
> 
> Is there a limit to the size that can be stored in one 'cell' (by
> 'cell' I mean the intersection between a key and a data column)? is
> there a limit to the size of data of one key?  one data column?
> 
> Thanks in advance for any help / guidance.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Wed, 13 Apr 2011 10:14:21 +1200
> 
> Yes for  interactive == real time queries.  Hadoop based techniques
> are non time critical queries, but they do have greater analytical
> capabilities.  
> 
> particle_pairs: 1) Yes and no and sort of. Under the hood the
> get_slice api call will be used by your client library to pull back
> chunks of (ordered) columns. Most client libraries abstract away the
> chunking for you.  
> 
> 2) If you are using a packed structure like JSON then no, Cassandra
> will have no idea what you've put in the columns other than bytes . It
> really depends on how much data you have per pair, but generally it's
> easier to pull back more data than try to get exactly what you need.
> Downside is you have to update all the data.  
> 
> 3) No, you would need to update all the data for the pair. I was
> assuming most of the data was written once, and that your simulation
> had something like a stop-the-world phase between time slices where
> state was dumped and then read to start the next interval. You could
> either read it first, or we can come up with something else. 
> 
> distance_cf 1) the query would return an list of columns, which have a
> name and value (as well as a timestamp and ttl). 2) depends on the
> client library, if using python go
> for https://github.com/pycassa/pycassa It will return objects  3)
> returning millions of columns is going to be slow, would also be slow
> using a RDBMS. Creating millions objects in python is going to be
> slow. You would need to have a better idea of what queries you will
> actually want to run to see if it's *too* slow. If it is one approach
> is to store the particles at the same distance in the same column, so
> you need to read less columns. Again depends on how your sim works.
> Time complexity depends on the number of columns read. Finding a row
> will not be O(1) as it it may have to read from several files. Writes
> are more constant than reads. But remember, you can have a lot of io
> and cpu power in your cluster. 
> 
> Best advice is to jump in and see if the data model works for you at a
> small single node scale, most performance issues can be solved.  
> 
> Aaron 
> On 12 Apr 2011, at 15:34, csharpplusproject wrote: 
> 
> > Hi Aaron,
> > 
> > Yes, of course it helps, I am starting to get a flavor of Cassandra
> > -- thank you very much!
> > 
> > First of all, by 'interactive' queries, are you referring to
> > 'real-time' queries? (meaning, where experiments data is
> > 'streaming', data needs to be stored and following that, the query
> > needs to be run in real time)?
> > 
> > Looking at the design of the particle pairs:
> > 
> > - key: expriement_id.time_interval 
> > - column name: pair_id 
> > - column value: distance, angle, other data packed together as JSON
> > or some other format
> > 
> > A couple of questions:
> > 
> > (1) Will a query such as pairID[ expriement_id.time_interval ] will
> > basically return an array of all paidIDs for the experiment, where
> > each item is a 'packed' JSON?
> > (2) Would it be possible, rather than returning the whole JSON
> > object per every pairID, to get (say) only the distance?
> > (3) Would it be possible to easily update certain 'pairIDs' with new
> > values (for example, update pairIDs = {2389, 93434} with new
> > distance values)? 
> > 
> > Looking at the design of the distance CF (for example):
> > 
> > this is VERY INTERESTING. basically you are suggesting a design that
> > will save the actual distance between each pair of particles, and
> > will allow queries where we can find all pairIDs (for an experiment,
> > on time_interval) that meet a certain distance criteria. VERY, VERY
> > INTERESTING!
> > 
> > A couple of questions:
> > 
> > (1) Will a query such as distanceCF[ expriement_id.time_interval ]
> > will basically return an array of all 'zero_padded_distance.pair_id'
> > elements for the experiment?
> > (2) In such a case, I will get (presumably) a python list where
> > every item is a string (and I will need to process it)?
> > (3) Given the fact that we're doing a slice on millions of columns
> > (?), any idea how fast such an operation would be?
> > 
> > 
> > Just to make sure I understand, is it true that in both situations,
> > the query complexity is basically O(1) since it's simply a HASH?
> > 
> > 
> > Thank you for all of your help!
> > 
> > Shalom.
> > 
> > -----Original Message-----
> > From: aaron morton <aa...@thelastpickle.com>
> > Reply-to: user@cassandra.apache.org
> > To: user@cassandra.apache.org
> > Subject: Re: Cassandra Database Modeling
> > Date: Tue, 12 Apr 2011 10:43:42 +1200
> > 
> > The tricky part here is the level of flexibility you want for the
> > querying. In general you will want to denormalise to support the
> > read queries.   
> > 
> > If your queries are not interactive you may be able to use Hadoop /
> > Pig / Hive e.g. http://www.datastax.com/products/brisk In which case
> > you can probably have a simpler data model where you spend less
> > effort supporting the queries. But it sounds like you need
> > interactive queries as part of the experiment. 
> > 
> > You could store the data per pair in a standard CF (lets call it the
> > pair cf) as follows: 
> > 
> > - key: expriement_id.time_interval - column name: pair_id - column
> > value: distance, angle, other data packed together as JSON or some
> > other format 
> > 
> > This would support a basic record of what happened, for each time
> > interval you can get the list of all pairs and read their data.  
> > 
> > To support your spatial queries you could use two standard standard
> > CFs as follows: 
> > 
> > distance CF: - key: experiment_id.time_interval - colunm name:
> > zero_padded_distance.pair_id - column value: empty or the angle  
> > 
> > angle CF : - key: experiment_id.time_interval - colunm name:
> > zero_padded_angle.pair_id - column value: empty or the distance 
> > 
> > (two pairs can have the same distance and/or angle in same time
> > slice) 
> > 
> > Here we are using the column name as a compound value, and am
> > assuming they can be byte ordered. So for distance the column name
> > looks something like 000500.123456789. You would then use the Byte
> > comparator (or similar) for the columns.   
> > 
> > To find all of the particles for experiment 2 at t5 where distance
> > is < 100 you would use a get_slice
> > (see http://wiki.apache.org/cassandra/API or your higher level
> > client docs) against the key "2.5" with a SliceRange start at
> > "000000.000000000" and finish at "000100.999999999". Once you have
> > this list of columns you can either filter client side for the angle
> > or issue another query for the particles inside the angle range.
> > Then join the two results client side using the pair_id returned in
> > the column names.  
> > 
> > By using the same key for all 3 CF's all the data for a time slice
> > will be stored on the same nodes. You can potentially spread this
> > around by using slightly different keys so they may hash to
> > different areas of the cluster. e.g.
> > expriement_id.time_interval."distance" 
> > 
> > Data volume is not a concern, and it's not possible to talk about
> > performance until you have an idea of the workload and required
> > throughput. But writes are fast and I think your reads would be fast
> > as well as the row data for distance and angle will not change so
> > caches will be be useful.    
> > 
> > Hope that helps.  Aaron 
> > 
> > On 12 Apr 2011, at 03:01, Shalom wrote: 
> > 
> > > I would like to save statistics on 10,000,000 (ten millions) pairs
> > > of
> > > particles, how they relate to one another in any given space in
> > > time.
> > > 
> > > So suppose that within a total experiment time of T1..T1000
> > > (assume that T1
> > > is when the experiment starts, and T1000 is the time when the
> > > experiment
> > > ends) I would like, per each pair of particles, to measure the
> > > relationship
> > > between every Tn -- T(n+1) interval:
> > > 
> > > T1..T2 (this is the first interval)
> > > 
> > > T2..T3
> > > 
> > > T3..T4
> > > 
> > > ......
> > > 
> > > ......
> > > 
> > > T9,999,999..T10,000,000 (this is the last interval)
> > > 
> > > For each such a particle pair (there are 10,000,000 pairs) I would
> > > like to
> > > save some figures (such as distance, angel etc) on each interval
> > > of [
> > > Tn..T(n+1) ]
> > > 
> > > Once saved, the query I will be using to retrieve this data is as
> > > follows:
> > > "give me all particle pairs on time interval [ Tn..T(n+1) ] where
> > > the
> > > distance between the two particles is smaller than X and the angle
> > > between
> > > the two particles is greater than Y". Meaning, the query will
> > > always take
> > > place for all particle pairs on a certain interval of time.
> > > 
> > > How would you model this in Cassandra, so that the writes/reads
> > > are
> > > optimized? given the database size involved, can you recommend on
> > > a suitable
> > > solution? (I have been recommended to both MongoDB / Cassandra).
> > > 
> > > I should mention that the data does change often -- we run many
> > > such
> > > experiments (different particle sets / thousands of experiments)
> > > and would
> > > need a very decent performance of reads/writes.
> > > 
> > > Is Cassandra suitable for this time of work?
> > > 
> > > 
> > > --
> > > View this message in context:
> > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> > > Sent from the cassandra-user@incubator.apache.org mailing list
> > > archive at Nabble.com.
> > > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 
>

Re: Cassandra Database Modeling

Posted by aaron morton <aa...@thelastpickle.com>.

WRT your query, it depends on how big a slice you want to get how time critical it is. e.g. Could you be making queries that would return all 10M pairs ? Or would the queries generally want to get some small fraction of the data set? Again, depends on how the sim runs.

If you sim has stop the world pauses were you have a full view of the data space, then you could grab all the points at a certain distance and efficiently pack them up. Where "efficiently" means not using JSON.

http://wiki.apache.org/cassandra/LargeDataSetConsiderations
http://wiki.apache.org/cassandra/CassandraLimitations
 
Aaron

On 13 Apr 2011, at 15:48, csharpplusproject wrote:

> Aaron,
> 
> Thank you so much for your help. It is greatly appreciated!
> 
> Looking at the design of the particle pairs:
>> 
>> - key: expriement_id.time_interval 
>> - column name: pair_id 
>> - column value: distance, angle, other data packed together as JSON or some other format
> 
> You wrote that retrieving millions of columns (I will have about 10,000,000 particles pairs) would be slow. You are also right that the retrieval of millions of columns into Python, won't be fast.
> 
> If my desired query is to get "all particle pairs on time interval [ Tn..T(n+1) ] where the distance between the two particles is smaller than X and the angle between the two particles is greater than Y".
> 
> In such a query (as the above), given the fact that retrieving millions of columns could be slow, would it be best to say 'concatenate' all values for all particle pairs for a given 'expriement_id.time_interval' into one column?
> 
> If data is stored in this way, I will be getting from Cassandra a binary string / JSON Object that I will have to 'unpack' in my application. Is this a recommended approach? are there better approaches?
> 
> Is there a limit to the size that can be stored in one 'cell' (by 'cell' I mean the intersection between a key and a data column)? is there a limit to the size of data of one key?  one data column?
> 
> Thanks in advance for any help / guidance.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Wed, 13 Apr 2011 10:14:21 +1200
> 
> Yes for  interactive == real time queries.  Hadoop based techniques are non time critical queries, but they do have greater analytical capabilities.  
> 
> particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api call will be used by your client library to pull back chunks of (ordered) columns. Most client libraries abstract away the chunking for you.  
> 
> 2) If you are using a packed structure like JSON then no, Cassandra will have no idea what you've put in the columns other than bytes . It really depends on how much data you have per pair, but generally it's easier to pull back more data than try to get exactly what you need. Downside is you have to update all the data.  
> 
> 3) No, you would need to update all the data for the pair. I was assuming most of the data was written once, and that your simulation had something like a stop-the-world phase between time slices where state was dumped and then read to start the next interval. You could either read it first, or we can come up with something else. 
> 
> distance_cf 1) the query would return an list of columns, which have a name and value (as well as a timestamp and ttl). 2) depends on the client library, if using python go for https://github.com/pycassa/pycassa It will return objects  3) returning millions of columns is going to be slow, would also be slow using a RDBMS. Creating millions objects in python is going to be slow. You would need to have a better idea of what queries you will actually want to run to see if it's *too* slow. If it is one approach is to store the particles at the same distance in the same column, so you need to read less columns. Again depends on how your sim works.     Time complexity depends on the number of columns read. Finding a row will not be O(1) as it it may have to read from several files. Writes are more constant than reads. But remember, you can have a lot of io and cpu power in your cluster. 
> 
> Best advice is to jump in and see if the data model works for you at a small single node scale, most performance issues can be solved.  
> 
> Aaron 
> On 12 Apr 2011, at 15:34, csharpplusproject wrote: 
>> Hi Aaron,
>> 
>> Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank you very much!
>> 
>> First of all, by 'interactive' queries, are you referring to 'real-time' queries? (meaning, where experiments data is 'streaming', data needs to be stored and following that, the query needs to be run in real time)?
>> 
>> Looking at the design of the particle pairs:
>> 
>> - key: expriement_id.time_interval 
>> - column name: pair_id 
>> - column value: distance, angle, other data packed together as JSON or some other format
>> 
>> A couple of questions:
>> 
>> (1) Will a query such as pairID[ expriement_id.time_interval ] will basically return an array of all paidIDs for the experiment, where each item is a 'packed' JSON?
>> (2) Would it be possible, rather than returning the whole JSON object per every pairID, to get (say) only the distance?
>> (3) Would it be possible to easily update certain 'pairIDs' with new values (for example, update pairIDs = {2389, 93434} with new distance values)? 
>> 
>> Looking at the design of the distance CF (for example):
>> 
>> this is VERY INTERESTING. basically you are suggesting a design that will save the actual distance between each pair of particles, and will allow queries where we can find all pairIDs (for an experiment, on time_interval) that meet a certain distance criteria. VERY, VERY INTERESTING!
>> 
>> A couple of questions:
>> 
>> (1) Will a query such as distanceCF[ expriement_id.time_interval ] will basically return an array of all 'zero_padded_distance.pair_id' elements for the experiment?
>> (2) In such a case, I will get (presumably) a python list where every item is a string (and I will need to process it)?
>> (3) Given the fact that we're doing a slice on millions of columns (?), any idea how fast such an operation would be?
>> 
>> 
>> Just to make sure I understand, is it true that in both situations, the query complexity is basically O(1) since it's simply a HASH?
>> 
>> 
>> Thank you for all of your help!
>> 
>> Shalom.
>> 
>> -----Original Message-----
>> From: aaron morton <aa...@thelastpickle.com>
>> Reply-to: user@cassandra.apache.org
>> To: user@cassandra.apache.org
>> Subject: Re: Cassandra Database Modeling
>> Date: Tue, 12 Apr 2011 10:43:42 +1200
>> 
>> The tricky part here is the level of flexibility you want for the querying. In general you will want to denormalise to support the read queries.   
>> 
>> If your queries are not interactive you may be able to use Hadoop / Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you can probably have a simpler data model where you spend less effort supporting the queries. But it sounds like you need interactive queries as part of the experiment. 
>> 
>> You could store the data per pair in a standard CF (lets call it the pair cf) as follows: 
>> 
>> - key: expriement_id.time_interval - column name: pair_id - column value: distance, angle, other data packed together as JSON or some other format 
>> 
>> This would support a basic record of what happened, for each time interval you can get the list of all pairs and read their data.  
>> 
>> To support your spatial queries you could use two standard standard CFs as follows: 
>> 
>> distance CF: - key: experiment_id.time_interval - colunm name: zero_padded_distance.pair_id - column value: empty or the angle  
>> 
>> angle CF : - key: experiment_id.time_interval - colunm name: zero_padded_angle.pair_id - column value: empty or the distance 
>> 
>> (two pairs can have the same distance and/or angle in same time slice) 
>> 
>> Here we are using the column name as a compound value, and am assuming they can be byte ordered. So for distance the column name looks something like 000500.123456789. You would then use the Byte comparator (or similar) for the columns.   
>> 
>> To find all of the particles for experiment 2 at t5 where distance is < 100 you would use a get_slice (see http://wiki.apache.org/cassandra/API or your higher level client docs) against the key "2.5" with a SliceRange start at "000000.000000000" and finish at "000100.999999999". Once you have this list of columns you can either filter client side for the angle or issue another query for the particles inside the angle range. Then join the two results client side using the pair_id returned in the column names.  
>> 
>> By using the same key for all 3 CF's all the data for a time slice will be stored on the same nodes. You can potentially spread this around by using slightly different keys so they may hash to different areas of the cluster. e.g. expriement_id.time_interval."distance" 
>> 
>> Data volume is not a concern, and it's not possible to talk about performance until you have an idea of the workload and required throughput. But writes are fast and I think your reads would be fast as well as the row data for distance and angle will not change so caches will be be useful.    
>> 
>> Hope that helps.  Aaron 
>> 
>> On 12 Apr 2011, at 03:01, Shalom wrote: 
>>> I would like to save statistics on 10,000,000 (ten millions) pairs of
>>> particles, how they relate to one another in any given space in time.
>>> 
>>> So suppose that within a total experiment time of T1..T1000 (assume that T1
>>> is when the experiment starts, and T1000 is the time when the experiment
>>> ends) I would like, per each pair of particles, to measure the relationship
>>> between every Tn -- T(n+1) interval:
>>> 
>>> T1..T2 (this is the first interval)
>>> 
>>> T2..T3
>>> 
>>> T3..T4
>>> 
>>> ......
>>> 
>>> ......
>>> 
>>> T9,999,999..T10,000,000 (this is the last interval)
>>> 
>>> For each such a particle pair (there are 10,000,000 pairs) I would like to
>>> save some figures (such as distance, angel etc) on each interval of [
>>> Tn..T(n+1) ]
>>> 
>>> Once saved, the query I will be using to retrieve this data is as follows:
>>> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
>>> distance between the two particles is smaller than X and the angle between
>>> the two particles is greater than Y". Meaning, the query will always take
>>> place for all particle pairs on a certain interval of time.
>>> 
>>> How would you model this in Cassandra, so that the writes/reads are
>>> optimized? given the database size involved, can you recommend on a suitable
>>> solution? (I have been recommended to both MongoDB / Cassandra).
>>> 
>>> I should mention that the data does change often -- we run many such
>>> experiments (different particle sets / thousands of experiments) and would
>>> need a very decent performance of reads/writes.
>>> 
>>> Is Cassandra suitable for this time of work?
>>> 
>>> 
>>> --
>>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
>>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: Cassandra Database Modeling

Posted by csharpplusproject <cs...@gmail.com>.

Steven,

Thank you. 

You wrote: "The data of cassandra are partitioned by the row key;
therefore, if you want to put all pairs into the same row, you should
consider the disk size"

Can you please explain why the disk size is / might be a problem?

Thanks,
Shalom.

-----Original Message-----
From: Steven Yen-Liang Su <xp...@gmail.com>
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Wed, 13 Apr 2011 12:16:00 +0800

        Is there a limit to the size that can be stored in one
        'cell' (by 'cell' I mean the intersection between a key and a
        data column)? is there a limit to the size of data of one key?
        one data column?
        


http://wiki.apache.org/cassandra/CassandraLimitations


The data of cassandra are partitioned by the row key; therefore, if you
want to put all pairs into the same row, you should consider the disk
size.
 
        
        Thanks in advance for any help / guidance.
        
        -----Original Message-----
        From: aaron morton <aa...@thelastpickle.com>
        Reply-to: user@cassandra.apache.org
        To: user@cassandra.apache.org
        Subject: Re: Cassandra Database Modeling
        Date: Wed, 13 Apr 2011 10:14:21 +1200
        
        Yes for  interactive == real time queries.  Hadoop based
        techniques are non time critical queries, but they do have
        greater analytical capabilities.  
        
        particle_pairs: 1) Yes and no and sort of. Under the hood the
        get_slice api call will be used by your client library to pull
        back chunks of (ordered) columns. Most client libraries abstract
        away the chunking for you.  
        
        2) If you are using a packed structure like JSON then no,
        Cassandra will have no idea what you've put in the columns other
        than bytes . It really depends on how much data you have per
        pair, but generally it's easier to pull back more data than try
        to get exactly what you need. Downside is you have to update all
        the data.  
        
        3) No, you would need to update all the data for the pair. I was
        assuming most of the data was written once, and that your
        simulation had something like a stop-the-world phase between
        time slices where state was dumped and then read to start the
        next interval. You could either read it first, or we can come up
        with something else. 
        
        distance_cf 1) the query would return an list of columns, which
        have a name and value (as well as a timestamp and ttl). 2)
        depends on the client library, if using python go
        for https://github.com/pycassa/pycassa It will return objects
        3) returning millions of columns is going to be slow, would also
        be slow using a RDBMS. Creating millions objects in python is
        going to be slow. You would need to have a better idea of what
        queries you will actually want to run to see if it's *too* slow.
        If it is one approach is to store the particles at the same
        distance in the same column, so you need to read less columns.
        Again depends on how your sim works.     Time complexity depends
        on the number of columns read. Finding a row will not be O(1) as
        it it may have to read from several files. Writes are more
        constant than reads. But remember, you can have a lot of io and
        cpu power in your cluster. 
        
        Best advice is to jump in and see if the data model works for
        you at a small single node scale, most performance issues can be
        solved.  
        
        Aaron 
        On 12 Apr 2011, at 15:34, csharpplusproject wrote: 
        
        > Hi Aaron,
        > 
        > Yes, of course it helps, I am starting to get a flavor of
        > Cassandra -- thank you very much!
        > 
        > First of all, by 'interactive' queries, are you referring to
        > 'real-time' queries? (meaning, where experiments data is
        > 'streaming', data needs to be stored and following that, the
        > query needs to be run in real time)?
        > 
        > Looking at the design of the particle pairs:
        > 
        > - key: expriement_id.time_interval 
        > - column name: pair_id 
        > - column value: distance, angle, other data packed together as
        > JSON or some other format
        > 
        > A couple of questions:
        > 
        > (1) Will a query such as pairID[ expriement_id.time_interval ]
        > will basically return an array of all paidIDs for the
        > experiment, where each item is a 'packed' JSON?
        > (2) Would it be possible, rather than returning the whole JSON
        > object per every pairID, to get (say) only the distance?
        > (3) Would it be possible to easily update certain 'pairIDs'
        > with new values (for example, update pairIDs = {2389, 93434}
        > with new distance values)? 
        > 
        > Looking at the design of the distance CF (for example):
        > 
        > this is VERY INTERESTING. basically you are suggesting a
        > design that will save the actual distance between each pair of
        > particles, and will allow queries where we can find all
        > pairIDs (for an experiment, on time_interval) that meet a
        > certain distance criteria. VERY, VERY INTERESTING!
        > 
        > A couple of questions:
        > 
        > (1) Will a query such as
        > distanceCF[ expriement_id.time_interval ] will basically
        > return an array of all 'zero_padded_distance.pair_id' elements
        > for the experiment?
        > (2) In such a case, I will get (presumably) a python list
        > where every item is a string (and I will need to process it)?
        > (3) Given the fact that we're doing a slice on millions of
        > columns (?), any idea how fast such an operation would be?
        > 
        > 
        > Just to make sure I understand, is it true that in both
        > situations, the query complexity is basically O(1) since it's
        > simply a HASH?
        > 
        > 
        > Thank you for all of your help!
        > 
        > Shalom.
        > 
        > -----Original Message-----
        > From: aaron morton <aa...@thelastpickle.com>
        > Reply-to: user@cassandra.apache.org
        > To: user@cassandra.apache.org
        > Subject: Re: Cassandra Database Modeling
        > Date: Tue, 12 Apr 2011 10:43:42 +1200
        > 
        > The tricky part here is the level of flexibility you want for
        > the querying. In general you will want to denormalise to
        > support the read queries.   
        > 
        > If your queries are not interactive you may be able to use
        > Hadoop / Pig / Hive
        > e.g. http://www.datastax.com/products/brisk In which case you
        > can probably have a simpler data model where you spend less
        > effort supporting the queries. But it sounds like you need
        > interactive queries as part of the experiment. 
        > 
        > You could store the data per pair in a standard CF (lets call
        > it the pair cf) as follows: 
        > 
        > - key: expriement_id.time_interval - column name: pair_id -
        > column value: distance, angle, other data packed together as
        > JSON or some other format 
        > 
        > This would support a basic record of what happened, for each
        > time interval you can get the list of all pairs and read their
        > data.  
        > 
        > To support your spatial queries you could use two standard
        > standard CFs as follows: 
        > 
        > distance CF: - key: experiment_id.time_interval - colunm name:
        > zero_padded_distance.pair_id - column value: empty or the
        > angle  
        > 
        > angle CF : - key: experiment_id.time_interval - colunm name:
        > zero_padded_angle.pair_id - column value: empty or the
        > distance 
        > 
        > (two pairs can have the same distance and/or angle in same
        > time slice) 
        > 
        > Here we are using the column name as a compound value, and am
        > assuming they can be byte ordered. So for distance the column
        > name looks something like 000500.123456789. You would then use
        > the Byte comparator (or similar) for the columns.   
        > 
        > To find all of the particles for experiment 2 at t5 where
        > distance is < 100 you would use a get_slice
        > (see http://wiki.apache.org/cassandra/API or your higher level
        > client docs) against the key "2.5" with a SliceRange start at
        > "000000.000000000" and finish at "000100.999999999". Once you
        > have this list of columns you can either filter client side
        > for the angle or issue another query for the particles inside
        > the angle range. Then join the two results client side using
        > the pair_id returned in the column names.  
        > 
        > By using the same key for all 3 CF's all the data for a time
        > slice will be stored on the same nodes. You can potentially
        > spread this around by using slightly different keys so they
        > may hash to different areas of the cluster. e.g.
        > expriement_id.time_interval."distance" 
        > 
        > Data volume is not a concern, and it's not possible to talk
        > about performance until you have an idea of the workload and
        > required throughput. But writes are fast and I think your
        > reads would be fast as well as the row data for distance and
        > angle will not change so caches will be be useful.    
        > 
        > Hope that helps.  Aaron 
        > 
        > On 12 Apr 2011, at 03:01, Shalom wrote: 
        > 
        > > I would like to save statistics on 10,000,000 (ten millions)
        > > pairs of
        > > particles, how they relate to one another in any given space
        > > in time.
        > > 
        > > So suppose that within a total experiment time of T1..T1000
        > > (assume that T1
        > > is when the experiment starts, and T1000 is the time when
        > > the experiment
        > > ends) I would like, per each pair of particles, to measure
        > > the relationship
        > > between every Tn -- T(n+1) interval:
        > > 
        > > T1..T2 (this is the first interval)
        > > 
        > > T2..T3
        > > 
        > > T3..T4
        > > 
        > > ......
        > > 
        > > ......
        > > 
        > > T9,999,999..T10,000,000 (this is the last interval)
        > > 
        > > For each such a particle pair (there are 10,000,000 pairs) I
        > > would like to
        > > save some figures (such as distance, angel etc) on each
        > > interval of [
        > > Tn..T(n+1) ]
        > > 
        > > Once saved, the query I will be using to retrieve this data
        > > is as follows:
        > > "give me all particle pairs on time interval [ Tn..T(n+1) ]
        > > where the
        > > distance between the two particles is smaller than X and the
        > > angle between
        > > the two particles is greater than Y". Meaning, the query
        > > will always take
        > > place for all particle pairs on a certain interval of time.
        > > 
        > > How would you model this in Cassandra, so that the
        > > writes/reads are
        > > optimized? given the database size involved, can you
        > > recommend on a suitable
        > > solution? (I have been recommended to both MongoDB /
        > > Cassandra).
        > > 
        > > I should mention that the data does change often -- we run
        > > many such
        > > experiments (different particle sets / thousands of
        > > experiments) and would
        > > need a very decent performance of reads/writes.
        > > 
        > > Is Cassandra suitable for this time of work?
        > > 
        > > 
        > > --
        > > View this message in context:
        > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
        > > Sent from the cassandra-user@incubator.apache.org mailing
        > > list archive at Nabble.com.
        > > 
        > 
        > 
        > 
        > 
        >

Re: Cassandra Database Modeling

Posted by Steven Yen-Liang Su <xp...@gmail.com>.

>
> Is there a limit to the size that can be stored in one 'cell' (by 'cell' I
> mean the intersection between a *key* and a *data column*)? is there a
> limit to the size of data of one *key*?  one *data column*?
>

http://wiki.apache.org/cassandra/CassandraLimitations

<http://wiki.apache.org/cassandra/CassandraLimitations>The data of cassandra
are partitioned by the row key; therefore, if you want to put all pairs into
the same row, you should consider the disk size.


>
> Thanks in advance for any help / guidance.
>
> -----Original Message-----
> *From*: aaron morton <aaron@thelastpickle.com<aaron%20morton%20%3caaron@thelastpickle.com%3e>
> >
> *Reply-to*: user@cassandra.apache.org
> *To*: user@cassandra.apache.org
> *Subject*: Re: Cassandra Database Modeling
> *Date*: Wed, 13 Apr 2011 10:14:21 +1200
>
> Yes for  interactive == real time queries.  Hadoop based techniques are non
> time critical queries, but they do have greater analytical capabilities.
>
> particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api
> call will be used by your client library to pull back chunks of (ordered)
> columns. Most client libraries abstract away the chunking for you.
>
> 2) If you are using a packed structure like JSON then no, Cassandra will
> have no idea what you've put in the columns other than bytes . It really
> depends on how much data you have per pair, but generally it's easier to
> pull back more data than try to get exactly what you need. Downside is you
> have to update all the data.
>
> 3) No, you would need to update all the data for the pair. I was assuming
> most of the data was written once, and that your simulation had something
> like a stop-the-world phase between time slices where state was dumped and
> then read to start the next interval. You could either read it first, or we
> can come up with something else.
>
> distance_cf 1) the query would return an list of columns, which have a name
> and value (as well as a timestamp and ttl). 2) depends on the client
> library, if using python go for https://github.com/pycassa/pycassa It will
> return objects  3) returning millions of columns is going to be slow, would
> also be slow using a RDBMS. Creating millions objects in python is going to
> be slow. You would need to have a better idea of what queries you will
> actually want to run to see if it's *too* slow. If it is one approach is to
> store the particles at the same distance in the same column, so you need to
> read less columns. Again depends on how your sim works.     Time complexity
> depends on the number of columns read. Finding a row will not be O(1) as it
> it may have to read from several files. Writes are more constant than reads.
> But remember, you can have a lot of io and cpu power in your cluster.
>
> Best advice is to jump in and see if the data model works for you at a
> small single node scale, most performance issues can be solved.
>
> Aaron
> On 12 Apr 2011, at 15:34, csharpplusproject wrote:
>
> Hi Aaron,
>
> Yes, of course it helps, I am starting to get a flavor of *Cassandra* --
> thank you very much!
>
> First of all, by 'interactive' queries, are you referring to 'real-time'
> queries? (meaning, where experiments data is 'streaming', data needs to be
> stored and following that, the query needs to be run in real time)?
>
> *Looking at the design of the **particle pairs**:*
>
> - key: expriement_id.time_interval
> - column name: pair_id
> - column value: distance, angle, other data packed together as JSON or some
> other format
>
> *A couple of questions:*
>
> (1) Will a query such as *pairID[ *expriement_id.time_interval* ] *will
> basically return an array of all paidIDs for the experiment, where each item
> is a 'packed' JSON?
> (2) Would it be possible, rather than returning the whole JSON object per
> every pairID, to get (say) only the distance?
> (3) Would it be possible to easily update certain 'pairIDs' with new values
> (for example, update pairIDs = {2389, 93434} with new *distance* values)?
>
> *Looking at the design of the **distance CF* (for example)*:*
>
> this is VERY INTERESTING. basically you are suggesting a design that will
> save the actual distance between each pair of particles, and will allow
> queries where we can find all pairIDs (for an experiment, on time_interval)
> that meet a certain distance criteria. VERY, VERY INTERESTING!
>
> *A couple of questions:*
>
> (1) Will a query such as *distanceCF[ *expriement_id.time_interval* ] *will
> basically return an array of all '*zero_padded_distance.pair_id*' elements
> for the experiment?
> (2) In such a case, I will get (presumably) a python list where every item
> is a string (and I will need to process it)?
> (3) Given the fact that we're doing a slice on millions of columns (?), any
> idea how fast such an operation would be?
>
>
> Just to make sure I understand, is it true that in both situations, the
> query complexity is basically O(1) since it's simply a HASH?
>
>
> Thank you for all of your help!
>
> Shalom.
>
> -----Original Message-----
> *From*: aaron morton <aaron@thelastpickle.com<aaron%20morton%20%3caaron@thelastpickle.com%3e>
> >
> *Reply-to*: user@cassandra.apache.org
> *To*: user@cassandra.apache.org
> *Subject*: Re: Cassandra Database Modeling
> *Date*: Tue, 12 Apr 2011 10:43:42 +1200
>
> The tricky part here is the level of flexibility you want for the querying.
> In general you will want to denormalise to support the read queries.
>
> If your queries are not interactive you may be able to use Hadoop / Pig /
> Hive e.g. http://www.datastax.com/products/brisk In which case you can
> probably have a simpler data model where you spend less effort supporting
> the queries. But it sounds like you need interactive queries as part of the
> experiment.
>
> You could store the data per pair in a standard CF (lets call it the pair
> cf) as follows:
>
> - key: expriement_id.time_interval - column name: pair_id - column value:
> distance, angle, other data packed together as JSON or some other format
>
> This would support a basic record of what happened, for each time interval
> you can get the list of all pairs and read their data.
>
> To support your spatial queries you could use two standard standard CFs as
> follows:
>
> distance CF: - key: experiment_id.time_interval - colunm name:
> zero_padded_distance.pair_id - column value: empty or the angle
>
> angle CF : - key: experiment_id.time_interval - colunm name:
> zero_padded_angle.pair_id - column value: empty or the distance
>
> (two pairs can have the same distance and/or angle in same time slice)
>
> Here we are using the column name as a compound value, and am assuming they
> can be byte ordered. So for distance the column name looks something like
> 000500.123456789. You would then use the Byte comparator (or similar) for
> the columns.
>
> To find all of the particles for experiment 2 at t5 where distance is < 100
> you would use a get_slice (see http://wiki.apache.org/cassandra/API or
> your higher level client docs) against the key "2.5" with a SliceRange start
> at "000000.000000000" and finish at "000100.999999999". Once you have this
> list of columns you can either filter client side for the angle or issue
> another query for the particles inside the angle range. Then join the two
> results client side using the pair_id returned in the column names.
>
> By using the same key for all 3 CF's all the data for a time slice will be
> stored on the same nodes. You can potentially spread this around by using
> slightly different keys so they may hash to different areas of the cluster.
> e.g. expriement_id.time_interval."distance"
>
> Data volume is not a concern, and it's not possible to talk about
> performance until you have an idea of the workload and required throughput.
> But writes are fast and I think your reads would be fast as well as the row
> data for distance and angle will not change so caches will be be useful.
>
> Hope that helps.  Aaron
>
> On 12 Apr 2011, at 03:01, Shalom wrote:
>
> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
>
> So suppose that within a total experiment time of T1..T1000 (assume that T1
> is when the experiment starts, and T1000 is the time when the experiment
> ends) I would like, per each pair of particles, to measure the relationship
> between every Tn -- T(n+1) interval:
>
> T1..T2 (this is the first interval)
>
> T2..T3
>
> T3..T4
>
> ......
>
> ......
>
> T9,999,999..T10,000,000 (this is the last interval)
>
> For each such a particle pair (there are 10,000,000 pairs) I would like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
>
> Once saved, the query I will be using to retrieve this data is as follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle between
> the two particles is greater than Y". Meaning, the query will always take
> place for all particle pairs on a certain interval of time.
>
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a
> suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
>
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and would
> need a very decent performance of reads/writes.
>
> Is Cassandra suitable for this time of work?
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>
>
>
>
>
>
>
>
>

Re: Cassandra Database Modeling

Posted by csharpplusproject <cs...@gmail.com>.

Aaron,

Thank you so much for your help. It is greatly appreciated!

Looking at the design of the particle pairs:

> 
> - key: expriement_id.time_interval 
> - column name: pair_id 
> - column value: distance, angle, other data packed together as JSON or
> some other format

You wrote that retrieving millions of columns (I will have about
10,000,000 particles pairs) would be slow. You are also right that the
retrieval of millions of columns into Python, won't be fast.

If my desired query is to get "all particle pairs on time interval
[ Tn..T(n+1) ] where the distance between the two particles is smaller
than X and the angle between the two particles is greater than Y".

In such a query (as the above), given the fact that retrieving millions
of columns could be slow, would it be best to say 'concatenate' all
values for all particle pairs for a given 'expriement_id.time_interval'
into one column?

If data is stored in this way, I will be getting from Cassandra a binary
string / JSON Object that I will have to 'unpack' in my application. Is
this a recommended approach? are there better approaches?

Is there a limit to the size that can be stored in one 'cell' (by 'cell'
I mean the intersection between a key and a data column)? is there a
limit to the size of data of one key?  one data column?

Thanks in advance for any help / guidance.

-----Original Message-----
From: aaron morton <aa...@thelastpickle.com>
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Wed, 13 Apr 2011 10:14:21 +1200

Yes for  interactive == real time queries.  Hadoop based techniques are
non time critical queries, but they do have greater analytical
capabilities. 

particle_pairs:
1) Yes and no and sort of. Under the hood the get_slice api call will be
used by your client library to pull back chunks of (ordered) columns.
Most client libraries abstract away the chunking for you. 

2) If you are using a packed structure like JSON then no, Cassandra will
have no idea what you've put in the columns other than bytes . It really
depends on how much data you have per pair, but generally it's easier to
pull back more data than try to get exactly what you need. Downside is
you have to update all the data. 

3) No, you would need to update all the data for the pair. I was
assuming most of the data was written once, and that your simulation had
something like a stop-the-world phase between time slices where state
was dumped and then read to start the next interval. You could either
read it first, or we can come up with something else.

distance_cf
1) the query would return an list of columns, which have a name and
value (as well as a timestamp and ttl).
2) depends on the client library, if using python go
for https://github.com/pycassa/pycassa It will return objects 
3) returning millions of columns is going to be slow, would also be slow
using a RDBMS. Creating millions objects in python is going to be slow.
You would need to have a better idea of what queries you will actually
want to run to see if it's *too* slow. If it is one approach is to store
the particles at the same distance in the same column, so you need to
read less columns. Again depends on how your sim works. 

Time complexity depends on the number of columns read. Finding a row
will not be O(1) as it it may have to read from several files. Writes
are more constant than reads. But remember, you can have a lot of io and
cpu power in your cluster.

Best advice is to jump in and see if the data model works for you at a
small single node scale, most performance issues can be solved. 

Aaron

On 12 Apr 2011, at 15:34, csharpplusproject wrote:

> Hi Aaron,
> 
> Yes, of course it helps, I am starting to get a flavor of Cassandra --
> thank you very much!
> 
> First of all, by 'interactive' queries, are you referring to
> 'real-time' queries? (meaning, where experiments data is 'streaming',
> data needs to be stored and following that, the query needs to be run
> in real time)?
> 
> Looking at the design of the particle pairs:
> 
> - key: expriement_id.time_interval 
> - column name: pair_id 
> - column value: distance, angle, other data packed together as JSON or
> some other format
> 
> A couple of questions:
> 
> (1) Will a query such as pairID[ expriement_id.time_interval ] will
> basically return an array of all paidIDs for the experiment, where
> each item is a 'packed' JSON?
> (2) Would it be possible, rather than returning the whole JSON object
> per every pairID, to get (say) only the distance?
> (3) Would it be possible to easily update certain 'pairIDs' with new
> values (for example, update pairIDs = {2389, 93434} with new distance
> values)? 
> 
> Looking at the design of the distance CF (for example):
> 
> this is VERY INTERESTING. basically you are suggesting a design that
> will save the actual distance between each pair of particles, and will
> allow queries where we can find all pairIDs (for an experiment, on
> time_interval) that meet a certain distance criteria. VERY, VERY
> INTERESTING!
> 
> A couple of questions:
> 
> (1) Will a query such as distanceCF[ expriement_id.time_interval ]
> will basically return an array of all 'zero_padded_distance.pair_id'
> elements for the experiment?
> (2) In such a case, I will get (presumably) a python list where every
> item is a string (and I will need to process it)?
> (3) Given the fact that we're doing a slice on millions of columns
> (?), any idea how fast such an operation would be?
> 
> 
> Just to make sure I understand, is it true that in both situations,
> the query complexity is basically O(1) since it's simply a HASH?
> 
> 
> Thank you for all of your help!
> 
> Shalom.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Tue, 12 Apr 2011 10:43:42 +1200
> 
> The tricky part here is the level of flexibility you want for the
> querying. In general you will want to denormalise to support the read
> queries.   
> 
> If your queries are not interactive you may be able to use Hadoop /
> Pig / Hive e.g. http://www.datastax.com/products/brisk In which case
> you can probably have a simpler data model where you spend less effort
> supporting the queries. But it sounds like you need interactive
> queries as part of the experiment. 
> 
> You could store the data per pair in a standard CF (lets call it the
> pair cf) as follows: 
> 
> - key: expriement_id.time_interval - column name: pair_id - column
> value: distance, angle, other data packed together as JSON or some
> other format 
> 
> This would support a basic record of what happened, for each time
> interval you can get the list of all pairs and read their data.  
> 
> To support your spatial queries you could use two standard standard
> CFs as follows: 
> 
> distance CF: - key: experiment_id.time_interval - colunm name:
> zero_padded_distance.pair_id - column value: empty or the angle  
> 
> angle CF : - key: experiment_id.time_interval - colunm name:
> zero_padded_angle.pair_id - column value: empty or the distance 
> 
> (two pairs can have the same distance and/or angle in same time
> slice) 
> 
> Here we are using the column name as a compound value, and am assuming
> they can be byte ordered. So for distance the column name looks
> something like 000500.123456789. You would then use the Byte
> comparator (or similar) for the columns.   
> 
> To find all of the particles for experiment 2 at t5 where distance is
> < 100 you would use a get_slice
> (see http://wiki.apache.org/cassandra/API or your higher level client
> docs) against the key "2.5" with a SliceRange start at
> "000000.000000000" and finish at "000100.999999999". Once you have
> this list of columns you can either filter client side for the angle
> or issue another query for the particles inside the angle range. Then
> join the two results client side using the pair_id returned in the
> column names.  
> 
> By using the same key for all 3 CF's all the data for a time slice
> will be stored on the same nodes. You can potentially spread this
> around by using slightly different keys so they may hash to different
> areas of the cluster. e.g. expriement_id.time_interval."distance" 
> 
> Data volume is not a concern, and it's not possible to talk about
> performance until you have an idea of the workload and required
> throughput. But writes are fast and I think your reads would be fast
> as well as the row data for distance and angle will not change so
> caches will be be useful.    
> 
> Hope that helps.  Aaron 
> 
> On 12 Apr 2011, at 03:01, Shalom wrote: 
> 
> > I would like to save statistics on 10,000,000 (ten millions) pairs
> > of
> > particles, how they relate to one another in any given space in
> > time.
> > 
> > So suppose that within a total experiment time of T1..T1000 (assume
> > that T1
> > is when the experiment starts, and T1000 is the time when the
> > experiment
> > ends) I would like, per each pair of particles, to measure the
> > relationship
> > between every Tn -- T(n+1) interval:
> > 
> > T1..T2 (this is the first interval)
> > 
> > T2..T3
> > 
> > T3..T4
> > 
> > ......
> > 
> > ......
> > 
> > T9,999,999..T10,000,000 (this is the last interval)
> > 
> > For each such a particle pair (there are 10,000,000 pairs) I would
> > like to
> > save some figures (such as distance, angel etc) on each interval of
> > [
> > Tn..T(n+1) ]
> > 
> > Once saved, the query I will be using to retrieve this data is as
> > follows:
> > "give me all particle pairs on time interval [ Tn..T(n+1) ] where
> > the
> > distance between the two particles is smaller than X and the angle
> > between
> > the two particles is greater than Y". Meaning, the query will always
> > take
> > place for all particle pairs on a certain interval of time.
> > 
> > How would you model this in Cassandra, so that the writes/reads are
> > optimized? given the database size involved, can you recommend on a
> > suitable
> > solution? (I have been recommended to both MongoDB / Cassandra).
> > 
> > I should mention that the data does change often -- we run many such
> > experiments (different particle sets / thousands of experiments) and
> > would
> > need a very decent performance of reads/writes.
> > 
> > Is Cassandra suitable for this time of work?
> > 
> > 
> > --
> > View this message in context:
> > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> > Sent from the cassandra-user@incubator.apache.org mailing list
> > archive at Nabble.com.
> > 
> 
> 
> 
> 
>

Re: Cassandra Database Modeling

Posted by aaron morton <aa...@thelastpickle.com>.

Yes for  interactive == real time queries.  Hadoop based techniques are non time critical queries, but they do have greater analytical capabilities. 

particle_pairs:
1) Yes and no and sort of. Under the hood the get_slice api call will be used by your client library to pull back chunks of (ordered) columns. Most client libraries abstract away the chunking for you. 

2) If you are using a packed structure like JSON then no, Cassandra will have no idea what you've put in the columns other than bytes . It really depends on how much data you have per pair, but generally it's easier to pull back more data than try to get exactly what you need. Downside is you have to update all the data. 

3) No, you would need to update all the data for the pair. I was assuming most of the data was written once, and that your simulation had something like a stop-the-world phase between time slices where state was dumped and then read to start the next interval. You could either read it first, or we can come up with something else.

distance_cf
1) the query would return an list of columns, which have a name and value (as well as a timestamp and ttl).
2) depends on the client library, if using python go for https://github.com/pycassa/pycassa It will return objects 
3) returning millions of columns is going to be slow, would also be slow using a RDBMS. Creating millions objects in python is going to be slow. You would need to have a better idea of what queries you will actually want to run to see if it's *too* slow. If it is one approach is to store the particles at the same distance in the same column, so you need to read less columns. Again depends on how your sim works. 
  
Time complexity depends on the number of columns read. Finding a row will not be O(1) as it it may have to read from several files. Writes are more constant than reads. But remember, you can have a lot of io and cpu power in your cluster.

Best advice is to jump in and see if the data model works for you at a small single node scale, most performance issues can be solved. 

Aaron

On 12 Apr 2011, at 15:34, csharpplusproject wrote:

> Hi Aaron,
> 
> Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank you very much!
> 
> First of all, by 'interactive' queries, are you referring to 'real-time' queries? (meaning, where experiments data is 'streaming', data needs to be stored and following that, the query needs to be run in real time)?
> 
> Looking at the design of the particle pairs:
> 
> - key: expriement_id.time_interval 
> - column name: pair_id 
> - column value: distance, angle, other data packed together as JSON or some other format
> 
> A couple of questions:
> 
> (1) Will a query such as pairID[ expriement_id.time_interval ] will basically return an array of all paidIDs for the experiment, where each item is a 'packed' JSON?
> (2) Would it be possible, rather than returning the whole JSON object per every pairID, to get (say) only the distance?
> (3) Would it be possible to easily update certain 'pairIDs' with new values (for example, update pairIDs = {2389, 93434} with new distance values)? 
> 
> Looking at the design of the distance CF (for example):
> 
> this is VERY INTERESTING. basically you are suggesting a design that will save the actual distance between each pair of particles, and will allow queries where we can find all pairIDs (for an experiment, on time_interval) that meet a certain distance criteria. VERY, VERY INTERESTING!
> 
> A couple of questions:
> 
> (1) Will a query such as distanceCF[ expriement_id.time_interval ] will basically return an array of all 'zero_padded_distance.pair_id' elements for the experiment?
> (2) In such a case, I will get (presumably) a python list where every item is a string (and I will need to process it)?
> (3) Given the fact that we're doing a slice on millions of columns (?), any idea how fast such an operation would be?
> 
> 
> Just to make sure I understand, is it true that in both situations, the query complexity is basically O(1) since it's simply a HASH?
> 
> 
> Thank you for all of your help!
> 
> Shalom.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Tue, 12 Apr 2011 10:43:42 +1200
> 
> The tricky part here is the level of flexibility you want for the querying. In general you will want to denormalise to support the read queries.   
> 
> If your queries are not interactive you may be able to use Hadoop / Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you can probably have a simpler data model where you spend less effort supporting the queries. But it sounds like you need interactive queries as part of the experiment. 
> 
> You could store the data per pair in a standard CF (lets call it the pair cf) as follows: 
> 
> - key: expriement_id.time_interval - column name: pair_id - column value: distance, angle, other data packed together as JSON or some other format 
> 
> This would support a basic record of what happened, for each time interval you can get the list of all pairs and read their data.  
> 
> To support your spatial queries you could use two standard standard CFs as follows: 
> 
> distance CF: - key: experiment_id.time_interval - colunm name: zero_padded_distance.pair_id - column value: empty or the angle  
> 
> angle CF : - key: experiment_id.time_interval - colunm name: zero_padded_angle.pair_id - column value: empty or the distance 
> 
> (two pairs can have the same distance and/or angle in same time slice) 
> 
> Here we are using the column name as a compound value, and am assuming they can be byte ordered. So for distance the column name looks something like 000500.123456789. You would then use the Byte comparator (or similar) for the columns.   
> 
> To find all of the particles for experiment 2 at t5 where distance is < 100 you would use a get_slice (see http://wiki.apache.org/cassandra/API or your higher level client docs) against the key "2.5" with a SliceRange start at "000000.000000000" and finish at "000100.999999999". Once you have this list of columns you can either filter client side for the angle or issue another query for the particles inside the angle range. Then join the two results client side using the pair_id returned in the column names.  
> 
> By using the same key for all 3 CF's all the data for a time slice will be stored on the same nodes. You can potentially spread this around by using slightly different keys so they may hash to different areas of the cluster. e.g. expriement_id.time_interval."distance" 
> 
> Data volume is not a concern, and it's not possible to talk about performance until you have an idea of the workload and required throughput. But writes are fast and I think your reads would be fast as well as the row data for distance and angle will not change so caches will be be useful.    
> 
> Hope that helps.  Aaron 
> 
> On 12 Apr 2011, at 03:01, Shalom wrote: 
>> I would like to save statistics on 10,000,000 (ten millions) pairs of
>> particles, how they relate to one another in any given space in time.
>> 
>> So suppose that within a total experiment time of T1..T1000 (assume that T1
>> is when the experiment starts, and T1000 is the time when the experiment
>> ends) I would like, per each pair of particles, to measure the relationship
>> between every Tn -- T(n+1) interval:
>> 
>> T1..T2 (this is the first interval)
>> 
>> T2..T3
>> 
>> T3..T4
>> 
>> ......
>> 
>> ......
>> 
>> T9,999,999..T10,000,000 (this is the last interval)
>> 
>> For each such a particle pair (there are 10,000,000 pairs) I would like to
>> save some figures (such as distance, angel etc) on each interval of [
>> Tn..T(n+1) ]
>> 
>> Once saved, the query I will be using to retrieve this data is as follows:
>> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
>> distance between the two particles is smaller than X and the angle between
>> the two particles is greater than Y". Meaning, the query will always take
>> place for all particle pairs on a certain interval of time.
>> 
>> How would you model this in Cassandra, so that the writes/reads are
>> optimized? given the database size involved, can you recommend on a suitable
>> solution? (I have been recommended to both MongoDB / Cassandra).
>> 
>> I should mention that the data does change often -- we run many such
>> experiments (different particle sets / thousands of experiments) and would
>> need a very decent performance of reads/writes.
>> 
>> Is Cassandra suitable for this time of work?
>> 
>> 
>> --
>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>> 
> 
> 
>

Re: Cassandra Database Modeling

Posted by csharpplusproject <cs...@gmail.com>.

Hi Aaron,

Yes, of course it helps, I am starting to get a flavor of Cassandra --
thank you very much!

First of all, by 'interactive' queries, are you referring to 'real-time'
queries? (meaning, where experiments data is 'streaming', data needs to
be stored and following that, the query needs to be run in real time)?

Looking at the design of the particle pairs:

- key: expriement_id.time_interval 
- column name: pair_id 
- column value: distance, angle, other data packed together as JSON or
some other format

A couple of questions:

(1) Will a query such as pairID[ expriement_id.time_interval ] will
basically return an array of all paidIDs for the experiment, where each
item is a 'packed' JSON?
(2) Would it be possible, rather than returning the whole JSON object
per every pairID, to get (say) only the distance?
(3) Would it be possible to easily update certain 'pairIDs' with new
values (for example, update pairIDs = {2389, 93434} with new distance
values)? 

Looking at the design of the distance CF (for example):

this is VERY INTERESTING. basically you are suggesting a design that
will save the actual distance between each pair of particles, and will
allow queries where we can find all pairIDs (for an experiment, on
time_interval) that meet a certain distance criteria. VERY, VERY
INTERESTING!

A couple of questions:

(1) Will a query such as distanceCF[ expriement_id.time_interval ] will
basically return an array of all 'zero_padded_distance.pair_id' elements
for the experiment?
(2) In such a case, I will get (presumably) a python list where every
item is a string (and I will need to process it)?
(3) Given the fact that we're doing a slice on millions of columns (?),
any idea how fast such an operation would be?

Just to make sure I understand, is it true that in both situations, the
query complexity is basically O(1) since it's simply a HASH?

Thank you for all of your help!

Shalom.

-----Original Message-----
From: aaron morton <aa...@thelastpickle.com>
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Tue, 12 Apr 2011 10:43:42 +1200

The tricky part here is the level of flexibility you want for the
querying. In general you will want to denormalise to support the read
queries.  

If your queries are not interactive you may be able to use Hadoop /
Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you
can probably have a simpler data model where you spend less effort
supporting the queries. But it sounds like you need interactive queries
as part of the experiment.

You could store the data per pair in a standard CF (lets call it the
pair cf) as follows:

- key: expriement_id.time_interval
- column name: pair_id
- column value: distance, angle, other data packed together as JSON or
some other format

This would support a basic record of what happened, for each time
interval you can get the list of all pairs and read their data. 

To support your spatial queries you could use two standard standard CFs
as follows:

distance CF:
- key: experiment_id.time_interval
- colunm name: zero_padded_distance.pair_id
- column value: empty or the angle 

angle CF :
- key: experiment_id.time_interval
- colunm name: zero_padded_angle.pair_id
- column value: empty or the distance

(two pairs can have the same distance and/or angle in same time slice)

Here we are using the column name as a compound value, and am assuming
they can be byte ordered. So for distance the column name looks
something like 000500.123456789. You would then use the Byte comparator
(or similar) for the columns.  

To find all of the particles for experiment 2 at t5 where distance is <
100 you would use a get_slice
(see http://wiki.apache.org/cassandra/API or your higher level client
docs) against the key "2.5" with a SliceRange start at
"000000.000000000" and finish at "000100.999999999". Once you have this
list of columns you can either filter client side for the angle or issue
another query for the particles inside the angle range. Then join the
two results client side using the pair_id returned in the column names. 

By using the same key for all 3 CF's all the data for a time slice will
be stored on the same nodes. You can potentially spread this around by
using slightly different keys so they may hash to different areas of the
cluster. e.g. expriement_id.time_interval."distance"

Data volume is not a concern, and it's not possible to talk about
performance until you have an idea of the workload and required
throughput. But writes are fast and I think your reads would be fast as
well as the row data for distance and angle will not change so caches
will be be useful. 

Hope that helps. 
Aaron

On 12 Apr 2011, at 03:01, Shalom wrote:

> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
> 
> So suppose that within a total experiment time of T1..T1000 (assume
> that T1
> is when the experiment starts, and T1000 is the time when the
> experiment
> ends) I would like, per each pair of particles, to measure the
> relationship
> between every Tn -- T(n+1) interval:
> 
> T1..T2 (this is the first interval)
> 
> T2..T3
> 
> T3..T4
> 
> ......
> 
> ......
> 
> T9,999,999..T10,000,000 (this is the last interval)
> 
> For each such a particle pair (there are 10,000,000 pairs) I would
> like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
> 
> Once saved, the query I will be using to retrieve this data is as
> follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle
> between
> the two particles is greater than Y". Meaning, the query will always
> take
> place for all particle pairs on a certain interval of time.
> 
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a
> suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
> 
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and
> would
> need a very decent performance of reads/writes.
> 
> Is Cassandra suitable for this time of work?
> 
> 
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive
> at Nabble.com.
>

Re: Cassandra Database Modeling

Posted by aaron morton <aa...@thelastpickle.com>.

The tricky part here is the level of flexibility you want for the querying. In general you will want to denormalise to support the read queries.  

If your queries are not interactive you may be able to use Hadoop / Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you can probably have a simpler data model where you spend less effort supporting the queries. But it sounds like you need interactive queries as part of the experiment.

You could store the data per pair in a standard CF (lets call it the pair cf) as follows:

- key: expriement_id.time_interval
- column name: pair_id
- column value: distance, angle, other data packed together as JSON or some other format

This would support a basic record of what happened, for each time interval you can get the list of all pairs and read their data. 

To support your spatial queries you could use two standard standard CFs as follows:

distance CF:
- key: experiment_id.time_interval
- colunm name: zero_padded_distance.pair_id
- column value: empty or the angle 

angle CF :
- key: experiment_id.time_interval
- colunm name: zero_padded_angle.pair_id
- column value: empty or the distance

(two pairs can have the same distance and/or angle in same time slice)

Here we are using the column name as a compound value, and am assuming they can be byte ordered. So for distance the column name looks something like 000500.123456789. You would then use the Byte comparator (or similar) for the columns.  

To find all of the particles for experiment 2 at t5 where distance is < 100 you would use a get_slice (see http://wiki.apache.org/cassandra/API or your higher level client docs) against the key "2.5" with a SliceRange start at "000000.000000000" and finish at "000100.999999999". Once you have this list of columns you can either filter client side for the angle or issue another query for the particles inside the angle range. Then join the two results client side using the pair_id returned in the column names. 

By using the same key for all 3 CF's all the data for a time slice will be stored on the same nodes. You can potentially spread this around by using slightly different keys so they may hash to different areas of the cluster. e.g. expriement_id.time_interval."distance"

Data volume is not a concern, and it's not possible to talk about performance until you have an idea of the workload and required throughput. But writes are fast and I think your reads would be fast as well as the row data for distance and angle will not change so caches will be be useful. 

Hope that helps. 
Aaron

On 12 Apr 2011, at 03:01, Shalom wrote:

> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
> 
> So suppose that within a total experiment time of T1..T1000 (assume that T1
> is when the experiment starts, and T1000 is the time when the experiment
> ends) I would like, per each pair of particles, to measure the relationship
> between every Tn -- T(n+1) interval:
> 
> T1..T2 (this is the first interval)
> 
> T2..T3
> 
> T3..T4
> 
> ......
> 
> ......
> 
> T9,999,999..T10,000,000 (this is the last interval)
> 
> For each such a particle pair (there are 10,000,000 pairs) I would like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
> 
> Once saved, the query I will be using to retrieve this data is as follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle between
> the two particles is greater than Y". Meaning, the query will always take
> place for all particle pairs on a certain interval of time.
> 
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
> 
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and would
> need a very decent performance of reads/writes.
> 
> Is Cassandra suitable for this time of work?
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.