You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Даниел Симеонов <ds...@gmail.com> on 2010/05/04 15:06:36 UTC

Re: Best way to store millisecond-accurate data

Hi Miguel,
  I'd like to ask is it possible to have runtime sharding or rows in
cassandra, i.e. if the row has too much new columns inserted then create
another one row (let's say if the original timesharding is one day per row,
then we would have two rows for that day). Maybe batch processes could do
that.
Best regards, Daniel.

2010/4/24 Miguel Verde <mi...@gmail.com>

> TimeUUID's time component is measured in 100-nanosecond intervals. The
> library you use might calculate it with poorer accuracy or precision, but
> from a storage/comparison standpoint in Cassandra millisecond data is easily
> captured by it.
>
> One typical way of dealing with the data explosion of sampled time series
> data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you
> put an upper bound on the row length.
>
>
> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <
> andrew-lists-cassandra@ucsfcti.org> wrote:
>
>  Hello,
>>
>> I am looking to store patient physiologic data in Cassandra - it's being
>> collected at rates of 1 to 125 Hz.  I'm thinking of storing the timestamps
>> as the column names and the patient/parameter combo as the row key.  For
>> example, Bob is in the ICU and is currently having his blood pressure,
>> intracranial pressure, and heart rate monitored.  I'd like to collect this
>> with the following row keys:
>>
>> Bob-bloodpressure
>> Bob-intracranialpressure
>> Bob-heartrate
>>
>> The column names would be timestamps but that's where my questions start:
>>
>> I'm not sure what the best data type and CompareWith would be.  From my
>> searching, it sounds like the TimeUUID may be suitable but isn't really
>> designed for millisecond accuracy.  My other thought is just to store them
>> as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
>> concern, we will be collecting this data 24/7 so we'll be creating many
>> columns over the long-term.
>>
>> I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states
>> that the entire row must fit in memory.  Does this include the values as
>> well as the column names?
>>
>> In considering the limits of cassandra and the best way to model this, we
>> would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
>>  However, I can't really think of a better way to model this...  So, am I
>> thinking about this all wrong or am I on the right track?
>>
>> Thanks,
>> Andrew
>>
>

Re: Cassandra 0.6.1 - Help Required to setup Multiple Nodes/Cluster

Posted by Mohammad Mamajiwala <ma...@yahoo.com>.
Thanks for prompt reply.
As per your reply, my configuration should be like,
Node 1: Configuraiton
<Seeds>
    <Seed>43.193.211.215</Seed>      <Seed>43.193.213.160</Seed>
</Seeds>

Node 2: Configuration
<Seeds>
      <Seed>43.193.211.215</Seed>      <Seed>43.193.213.160</Seed>
  </Seeds>

About replication -  In my case it should be 2 as i got two cluster node. Am i right?In Cassandra, is there any way to store data in specific cluster/node?
About partitioner - Can u please provide more information on this..I am trying to understand internal data flow, configuration. If you have any documents, please share with me.
Thank you very much.
Mohammad


--- On Tue, 4/5/10, Shinpei Ohtani <sh...@gmail.com> wrote:

From: Shinpei Ohtani <sh...@gmail.com>
Subject: Re: Cassandra 0.6.1 - Help Required to setup Multiple Nodes/Cluster
To: user@cassandra.apache.org
Date: Tuesday, 4 May, 2010, 7:40

> All other parameters are identical in both servers. I have added some data from both node
> but i am confused on which node data stores. Does it stores in both node
> OR only stores in one node from where it has been added. I can retrieve data from both nodes
> but sometime can not. Not sure what's internally going on. Could you please help me on this.

First of all, you should have same seed settings with both Node1 and Node2.
There is no master node for Cassandra, so every node has to know others nodes.

And to know which one stores data, I think it depends on your
partitioner and replication factor settings.
If you have settings in storage-conf.xml like below, your data stores
at only one node
because replication factor is 1, and it is distributed by MD5
hash(RandomPartitioner).
----
   <ReplicationFactor>1</ReplicationFactor>
   <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
----

Hope this helps.
-----
Shinpei

On Tue, May 4, 2010 at 10:23 PM, Mohammad Mamajiwala
<ma...@yahoo.com> wrote:
>
> Hi,
> I am very new to Cassandra 0.6.1. I have setup the two node on two different server. I would like to know how data distribution and replication work.
> Node 1 IP:43.193.211.215
> Node 2 IP:43.193.213.160
> Node 1: Configuraiton
> <Seeds>
>       <Seed>43.193.211.215</Seed>
>   </Seeds>
> Node 2: Configuration
> <Seeds>
>       <Seed>43.193.213.160</Seed>
>       <Seed>43.193.211.215</Seed>
>   </Seeds>
> All other parameters are identical in both servers. I have added some data from both node but i am confused on which node data stores. Does it stores in both node OR only stores in one node from where it has been added. I can retrieve data from both nodes but sometime can not. Not sure what's internally going on. Could you please help me on this.
> Thank You
> Mohammad



      

Re: Cassandra 0.6.1 - Help Required to setup Multiple Nodes/Cluster

Posted by Shinpei Ohtani <sh...@gmail.com>.
> All other parameters are identical in both servers. I have added some data from both node
> but i am confused on which node data stores. Does it stores in both node
> OR only stores in one node from where it has been added. I can retrieve data from both nodes
> but sometime can not. Not sure what's internally going on. Could you please help me on this.

First of all, you should have same seed settings with both Node1 and Node2.
There is no master node for Cassandra, so every node has to know others nodes.

And to know which one stores data, I think it depends on your
partitioner and replication factor settings.
If you have settings in storage-conf.xml like below, your data stores
at only one node
because replication factor is 1, and it is distributed by MD5
hash(RandomPartitioner).
----
   <ReplicationFactor>1</ReplicationFactor>
   <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
----

Hope this helps.
-----
Shinpei

On Tue, May 4, 2010 at 10:23 PM, Mohammad Mamajiwala
<ma...@yahoo.com> wrote:
>
> Hi,
> I am very new to Cassandra 0.6.1. I have setup the two node on two different server. I would like to know how data distribution and replication work.
> Node 1 IP:43.193.211.215
> Node 2 IP:43.193.213.160
> Node 1: Configuraiton
> <Seeds>
>       <Seed>43.193.211.215</Seed>
>   </Seeds>
> Node 2: Configuration
> <Seeds>
>       <Seed>43.193.213.160</Seed>
>       <Seed>43.193.211.215</Seed>
>   </Seeds>
> All other parameters are identical in both servers. I have added some data from both node but i am confused on which node data stores. Does it stores in both node OR only stores in one node from where it has been added. I can retrieve data from both nodes but sometime can not. Not sure what's internally going on. Could you please help me on this.
> Thank You
> Mohammad

Re: Best way to store millisecond-accurate data

Posted by Даниел Симеонов <ds...@gmail.com>.
Hi
    "In practice, one would want to model their data such that the 'row has
too much columns' scenario is prevented."
   I am curious how really to prevent this, if the data is sharded with one
day granularity, nothing stops the client to insert enormous amount of new
columns (very often it is not possible to foreseen how much data clients
would insert) then some functionality is needed prevent too much columns in
a row (too much depends on the data), then such runtime sharding in
necessary (to split the day granulary to two rows). I still think if this
runtime sharding is possible in cassandra.
Best regards, Daniel.

2010/5/4 Miguel Verde <mi...@gmail.com>

> One would use batch processes (e.g. through Hadoop) or client-side
> aggregation, yes. In theory it would be possible to introduce runtime
> sharding across rows into the Cassandra server side, but it's not part of
> its design.
>
> In practice, one would want to model their data such that the 'row has too
> much columns' scenario is prevented.
>
> On May 4, 2010, at 8:06 AM, Даниел Симеонов <ds...@gmail.com> wrote:
>
> Hi Miguel,
>   I'd like to ask is it possible to have runtime sharding or rows in
> cassandra, i.e. if the row has too much new columns inserted then create
> another one row (let's say if the original timesharding is one day per row,
> then we would have two rows for that day). Maybe batch processes could do
> that.
> Best regards, Daniel.
>
> 2010/4/24 Miguel Verde < <mi...@gmail.com>
>
>> TimeUUID's time component is measured in 100-nanosecond intervals. The
>> library you use might calculate it with poorer accuracy or precision, but
>> from a storage/comparison standpoint in Cassandra millisecond data is easily
>> captured by it.
>>
>> One typical way of dealing with the data explosion of sampled time series
>> data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you
>> put an upper bound on the row length.
>>
>>
>> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <<a...@ucsfcti.org>
>> andrew-lists-cassandra@ucsfcti.org> wrote:
>>
>>  Hello,
>>>
>>> I am looking to store patient physiologic data in Cassandra - it's being
>>> collected at rates of 1 to 125 Hz.  I'm thinking of storing the timestamps
>>> as the column names and the patient/parameter combo as the row key.  For
>>> example, Bob is in the ICU and is currently having his blood pressure,
>>> intracranial pressure, and heart rate monitored.  I'd like to collect this
>>> with the following row keys:
>>>
>>> Bob-bloodpressure
>>> Bob-intracranialpressure
>>> Bob-heartrate
>>>
>>> The column names would be timestamps but that's where my questions start:
>>>
>>> I'm not sure what the best data type and CompareWith would be.  From my
>>> searching, it sounds like the TimeUUID may be suitable but isn't really
>>> designed for millisecond accuracy.  My other thought is just to store them
>>> as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
>>> concern, we will be collecting this data 24/7 so we'll be creating many
>>> columns over the long-term.
>>>
>>> I found <https://issues.apache.org/jira/browse/CASSANDRA-16>
>>> https://issues.apache.org/jira/browse/CASSANDRA-16 which states that the
>>> entire row must fit in memory.  Does this include the values as well as the
>>> column names?
>>>
>>> In considering the limits of cassandra and the best way to model this, we
>>> would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
>>>  However, I can't really think of a better way to model this...  So, am I
>>> thinking about this all wrong or am I on the right track?
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>

Re: Best way to store millisecond-accurate data

Posted by Miguel Verde <mi...@gmail.com>.
One would use batch processes (e.g. through Hadoop) or client-side  
aggregation, yes. In theory it would be possible to introduce runtime  
sharding across rows into the Cassandra server side, but it's not part  
of its design.

In practice, one would want to model their data such that the 'row has  
too much columns' scenario is prevented.

On May 4, 2010, at 8:06 AM, Даниел Симеонов  
<ds...@gmail.com> wrote:

> Hi Miguel,
>   I'd like to ask is it possible to have runtime sharding or rows in  
> cassandra, i.e. if the row has too much new columns inserted then  
> create another one row (let's say if the original timesharding is  
> one day per row, then we would have two rows for that day). Maybe  
> batch processes could do that.
> Best regards, Daniel.
>
> 2010/4/24 Miguel Verde <mi...@gmail.com>
> TimeUUID's time component is measured in 100-nanosecond intervals.  
> The library you use might calculate it with poorer accuracy or  
> precision, but from a storage/comparison standpoint in Cassandra  
> millisecond data is easily captured by it.
>
> One typical way of dealing with the data explosion of sampled time  
> series data is to bucket/shard rows (i.e. Bob-20100423- 
> bloodpressure) so that you put an upper bound on the row length.
>
>
> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <andrew-lists-cassandra@ucsfcti.org 
> > wrote:
>
> Hello,
>
> I am looking to store patient physiologic data in Cassandra - it's  
> being collected at rates of 1 to 125 Hz.  I'm thinking of storing  
> the timestamps as the column names and the patient/parameter combo  
> as the row key.  For example, Bob is in the ICU and is currently  
> having his blood pressure, intracranial pressure, and heart rate  
> monitored.  I'd like to collect this with the following row keys:
>
> Bob-bloodpressure
> Bob-intracranialpressure
> Bob-heartrate
>
> The column names would be timestamps but that's where my questions  
> start:
>
> I'm not sure what the best data type and CompareWith would be.  From  
> my searching, it sounds like the TimeUUID may be suitable but isn't  
> really designed for millisecond accuracy.  My other thought is just  
> to store them as strings (2010-04-23 10:23:45.016).  While I space  
> isn't the foremost concern, we will be collecting this data 24/7 so  
> we'll be creating many columns over the long-term.
>
> I found https://issues.apache.org/jira/browse/CASSANDRA-16 which  
> states that the entire row must fit in memory.  Does this include  
> the values as well as the column names?
>
> In considering the limits of cassandra and the best way to model  
> this, we would be adding 3.9 billion rows per year (assuming 125 Hz  
> @ 24/7).  However, I can't really think of a better way to model  
> this...  So, am I thinking about this all wrong or am I on the right  
> track?
>
> Thanks,
> Andrew
>

Cassandra 0.6.1 - Help Required to setup Multiple Nodes/Cluster

Posted by Mohammad Mamajiwala <ma...@yahoo.com>.
Hi,
I am very new to Cassandra 0.6.1. I have setup the two node on two different server. I would like to know how data distribution and replication work.
Node 1 IP:43.193.211.215Node 2 IP:43.193.213.160
Node 1: Configuraiton  <Seeds>      <Seed>43.193.211.215</Seed>  </Seeds>
Node 2: Configuration<Seeds>      <Seed>43.193.213.160</Seed>      <Seed>43.193.211.215</Seed>  </Seeds>
All other parameters are identical in both servers. I have added some data from both node but i am confused on which node data stores. Does it stores in both node OR only stores in one node from where it has been added. I can retrieve data from both nodes but sometime can not. Not sure what's internally going on. Could you please help me on this.
Thank YouMohammad