You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bill Speirs <bi...@gmail.com> on 2011/01/26 21:56:46 UTC

Schema Design

I'm looking to use Cassandra to store log messages from various
systems. A log message only has a message (UTF8Type) and a data/time.
My thought is to create a column family for each system. The row key
will be a TimeUUIDType. Each row will have 7 columns: year, month,
day, hour, minute, second, and message. I then have indexes setup for
each of the date/time columns.

I was hoping this would allow me to answer queries like: "What are all
the log messages that were generated between X & Y?" The problem is
that I can ONLY use the equals operator on these column values. For
example, I cannot issuing: get system_x where month > 1; gives me this
error: "No indexed columns present in index clause with operator EQ."
The equals operator works as expected though: get system_x where month
= 1;

What schema would allow me to get date ranges?

Thanks in advance...

Bill-

* ColumnFamily description *
    ColumnFamily: system_x_msg
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Row cache size / save period: 0.0/0
      Key cache size / save period: 200000.0/3600
      Memtable thresholds: 1.1671875/249/60
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 1.0
      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
proj_1_msg.7365636f6e64, proj_1_msg.79656172]
      Column Metadata:
        Column Name: year (year)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: month (month)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: second (second)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: minute (minute)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: hour (hour)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS
        Column Name: day (day)
          Validation Class: org.apache.cassandra.db.marshal.IntegerType
          Index Type: KEYS

Re: Schema Design

Posted by buddhasystem <po...@bnl.gov>.

I used the term "sharding" a bit frivolously. Sorry. It's just splitting
semantically homogenious data among CFs doesn't scale too well, as each CF
is allocated a piece of memory on the server.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964326.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Schema Design

Posted by Nick Santini <ni...@kaseya.com>.

One thing you can do is create one CF, then as the row key use the
application name + timestamp, with that you can do your range query using
OOP. then store whatever you want in the row

problem would be if one app generates far more logs than the others

Nicolas Santini


On Thu, Jan 27, 2011 at 10:26 AM, Bill Speirs <bi...@gmail.com> wrote:

> I have a basic understanding of OPP... if most of my messages come
> within a single hour then a few nodes could be storing all of my
> values, right?
>
> You totally lost me on, "whether to shard data as per system..." Is my
> schema (one column family per system, and row keys as TimeUUIDType)
> sharding by system? I thought -- probably incorrectly -- that the row
> keys are used in the sharding process, not column families.
>
> Thanks...
>
> Bill-
>
> On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem <po...@bnl.gov> wrote:
> >
> > Having separate columns for Year, Month etc seems redundant. It's tons
> more
> > efficient to keep say UTC time in POSIX format (basically integer). It's
> > easy to convert back and forth.
> >
> > If you want to get a range of dates, in that case you might use Order
> > Preserving Partitioner, and sort out which systems logged later in
> client.
> > Read up on consequences of using OPP.
> >
> > Whether to shard data as per system depends on how many you have. If more
> > than a few, don't do that, there are memory considerations.
> >
> > Cheers
> >
> > Maxim
> >
> > --
> > View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
> > Sent from the cassandra-user@incubator.apache.org mailing list archive
> at Nabble.com.
> >
>

Re: Schema Design

Posted by Bill Speirs <bi...@gmail.com>.

I have a basic understanding of OPP... if most of my messages come
within a single hour then a few nodes could be storing all of my
values, right?

You totally lost me on, "whether to shard data as per system..." Is my
schema (one column family per system, and row keys as TimeUUIDType)
sharding by system? I thought -- probably incorrectly -- that the row
keys are used in the sharding process, not column families.

Thanks...

Bill-

On Wed, Jan 26, 2011 at 4:17 PM, buddhasystem <po...@bnl.gov> wrote:
>
> Having separate columns for Year, Month etc seems redundant. It's tons more
> efficient to keep say UTC time in POSIX format (basically integer). It's
> easy to convert back and forth.
>
> If you want to get a range of dates, in that case you might use Order
> Preserving Partitioner, and sort out which systems logged later in client.
> Read up on consequences of using OPP.
>
> Whether to shard data as per system depends on how many you have. If more
> than a few, don't do that, there are memory considerations.
>
> Cheers
>
> Maxim
>
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>

Re: Schema Design

Posted by buddhasystem <po...@bnl.gov>.

Having separate columns for Year, Month etc seems redundant. It's tons more
efficient to keep say UTC time in POSIX format (basically integer). It's
easy to convert back and forth.

If you want to get a range of dates, in that case you might use Order
Preserving Partitioner, and sort out which systems logged later in client.
Read up on consequences of using OPP.

Whether to shard data as per system depends on how many you have. If more
than a few, don't do that, there are memory considerations.

Cheers

Maxim

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964227.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Schema Design

Posted by David McNelis <dm...@agentisenergy.com>.

My cli knowledge sucks so far, so I'll leave that  to others....I'm doing
most of my reading/writing through a thrift client (hector/java based)

As for the implications, as of the latest version of Cassandra there is not
theoretical limit to the number of columns that a particular row can hold.
 Over time you've got a couple of different options, if you're concerned
that you end up with too many columns to manage then you'd probably want to
start thinking about a warehousing strategy long-term for your older records
that involves expiring columns that are older than X in your Cassandra
cluster.  But for the most part you shouldn't *need* to do that.



On Wed, Jan 26, 2011 at 3:23 PM, Bill Speirs <bi...@gmail.com> wrote:

> I like this approach, but I have 2 questions:
>
> 1) what is the implications of continually adding columns to a single
> row? I'm unsure how Cassandra is able to grow. I realize you can have
> a virtually infinite number of columns, but what are the implications
> of growing the number of columns over time?
>
> 2) maybe it's just a restriction of the CLI, but how do I do issue a
> slice request? Also, what if start (or end) columns don't exist? I'm
> guessing it's smart enough to get the columns in that range.
>
> Thanks!
>
> Bill-
>
> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
> <dm...@agentisenergy.com> wrote:
> > I would say in that case you might want  to try a  single column family
> > where the key to the column is the system name.
> > Then, you could name your columns as the timestamp.  Then when retrieving
> > information from the data store you can can, in your slice request,
> specify
> > your start column as  X and end  column as Y.
> > Then you can use the stored column name to know when an event  occurred.
> >
> > On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs <bi...@gmail.com>
> wrote:
> >>
> >> I'm looking to use Cassandra to store log messages from various
> >> systems. A log message only has a message (UTF8Type) and a data/time.
> >> My thought is to create a column family for each system. The row key
> >> will be a TimeUUIDType. Each row will have 7 columns: year, month,
> >> day, hour, minute, second, and message. I then have indexes setup for
> >> each of the date/time columns.
> >>
> >> I was hoping this would allow me to answer queries like: "What are all
> >> the log messages that were generated between X & Y?" The problem is
> >> that I can ONLY use the equals operator on these column values. For
> >> example, I cannot issuing: get system_x where month > 1; gives me this
> >> error: "No indexed columns present in index clause with operator EQ."
> >> The equals operator works as expected though: get system_x where month
> >> = 1;
> >>
> >> What schema would allow me to get date ranges?
> >>
> >> Thanks in advance...
> >>
> >> Bill-
> >>
> >> * ColumnFamily description *
> >>    ColumnFamily: system_x_msg
> >>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
> >>      Row cache size / save period: 0.0/0
> >>      Key cache size / save period: 200000.0/3600
> >>      Memtable thresholds: 1.1671875/249/60
> >>      GC grace seconds: 864000
> >>      Compaction min/max thresholds: 4/32
> >>      Read repair chance: 1.0
> >>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
> >> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
> >> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
> >>      Column Metadata:
> >>        Column Name: year (year)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >>        Column Name: month (month)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >>        Column Name: second (second)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >>        Column Name: minute (minute)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >>        Column Name: hour (hour)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >>        Column Name: day (day)
> >>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
> >>          Index Type: KEYS
> >
> >
> >
> > --
> > David McNelis
> > Lead Software Engineer
> > Agentis Energy
> > www.agentisenergy.com
> > o: 630.359.6395
> > c: 219.384.5143
> > A Smart Grid technology company focused on helping consumers of energy
> > control an often under-managed resource.
> >
> >
>



-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*

Re: Schema Design

Posted by Jonathan Ellis <jb...@gmail.com>.

fwiw, https://github.com/thobbs/logsandra is more recent.

2011/1/30 aaron morton <aa...@thelastpickle.com>:
> This project may be what you are looking for, or provide some inspiration https://github.com/jbohman/logsandra
>
> Cloud Kick has an example or rolling up time series data https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
>
> The schema below sounds reasonable. If you will always bring back the entire log record, consider using a Standard CF rather than a Super CF. Then pack the log message using your favourite serialisation, e.g. JSON.
>
> Hope that helps.
> Aaron
>
> On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:
>
>> I am also working on a system store logs from hundreds system.
>> In my scenario, most query will like this: "let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range)."
>> My data model like this:
>> . only 1 CF. that's enough for this scenario.
>> . group logs from each host and day to one row. Key format is "hostname.category.date"
>> . store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column.
>>
>> Then this query can be done as 3 GET, no need to do key range scan.
>> Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that.
>> However, if I need to do a time range access, I can still use column slice.
>>
>> An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well.
>>
>> I think storing all logs for a host in a single row is not a good choice. 2 reason:
>> 1, too few keys, so your data will not distributing well.
>> 2, data under a key will always increase. So Cassandra have to do more SSTable compaction.
>>
>> -----邮件原件-----
>> 发件人: William R Speirs [mailto:bill.speirs@gmail.com]
>> 发送时间: 2011年1月27日 9:15
>> 收件人: user@cassandra.apache.org
>> 主题: Re: Schema Design
>>
>> It makes sense that the single row for a system (with a growing number of
>> columns) will reside on a single machine.
>>
>> With that in mind, here is my updated schema:
>>
>> - A single column family for all the messages. The row keys will be the TimeUUID
>> of the message with the following columns: date/time (in UTC POSIX), system
>> name/id (with an index for fast/easy gets), the actual message payload.
>>
>> - A column family for each system. The row keys will be UTC POSIX time with 1
>> second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of
>> any messages that were logged during that time bucket.
>>
>> My only hesitation with this design is that buddhasystem warned that each column
>> family, "is allocated a piece of memory on the server." I'm not sure what the
>> implications of this are and/or if this would be a problem if a I had a number
>> of systems on the order of hundreds.
>>
>> Thanks...
>>
>> Bill-
>>
>> On 01/26/2011 06:51 PM, Shu Zhang wrote:
>>> Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually.
>>>
>>> More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well.
>>> ________________________________________
>>> From: Bill Speirs [bill.speirs@gmail.com]
>>> Sent: Wednesday, January 26, 2011 1:23 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Schema Design
>>>
>>> I like this approach, but I have 2 questions:
>>>
>>> 1) what is the implications of continually adding columns to a single
>>> row? I'm unsure how Cassandra is able to grow. I realize you can have
>>> a virtually infinite number of columns, but what are the implications
>>> of growing the number of columns over time?
>>>
>>> 2) maybe it's just a restriction of the CLI, but how do I do issue a
>>> slice request? Also, what if start (or end) columns don't exist? I'm
>>> guessing it's smart enough to get the columns in that range.
>>>
>>> Thanks!
>>>
>>> Bill-
>>>
>>> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
>>> <dm...@agentisenergy.com>  wrote:
>>>> I would say in that case you might want  to try a  single column family
>>>> where the key to the column is the system name.
>>>> Then, you could name your columns as the timestamp.  Then when retrieving
>>>> information from the data store you can can, in your slice request, specify
>>>> your start column as  X and end  column as Y.
>>>> Then you can use the stored column name to know when an event  occurred.
>>>>
>>>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bi...@gmail.com>  wrote:
>>>>>
>>>>> I'm looking to use Cassandra to store log messages from various
>>>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>>>> My thought is to create a column family for each system. The row key
>>>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>>>> day, hour, minute, second, and message. I then have indexes setup for
>>>>> each of the date/time columns.
>>>>>
>>>>> I was hoping this would allow me to answer queries like: "What are all
>>>>> the log messages that were generated between X&  Y?" The problem is
>>>>> that I can ONLY use the equals operator on these column values. For
>>>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>>>> error: "No indexed columns present in index clause with operator EQ."
>>>>> The equals operator works as expected though: get system_x where month
>>>>> = 1;
>>>>>
>>>>> What schema would allow me to get date ranges?
>>>>>
>>>>> Thanks in advance...
>>>>>
>>>>> Bill-
>>>>>
>>>>> * ColumnFamily description *
>>>>>    ColumnFamily: system_x_msg
>>>>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>>>      Row cache size / save period: 0.0/0
>>>>>      Key cache size / save period: 200000.0/3600
>>>>>      Memtable thresholds: 1.1671875/249/60
>>>>>      GC grace seconds: 864000
>>>>>      Compaction min/max thresholds: 4/32
>>>>>      Read repair chance: 1.0
>>>>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>>>      Column Metadata:
>>>>>        Column Name: year (year)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: month (month)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: second (second)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: minute (minute)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: hour (hour)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>>        Column Name: day (day)
>>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>>          Index Type: KEYS
>>>>
>>>>
>>>>
>>>> --
>>>> David McNelis
>>>> Lead Software Engineer
>>>> Agentis Energy
>>>> www.agentisenergy.com
>>>> o: 630.359.6395
>>>> c: 219.384.5143
>>>> A Smart Grid technology company focused on helping consumers of energy
>>>> control an often under-managed resource.
>>>>
>>>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Schema Design

Posted by aaron morton <aa...@thelastpickle.com>.

This project may be what you are looking for, or provide some inspiration https://github.com/jbohman/logsandra

Cloud Kick has an example or rolling up time series data https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

The schema below sounds reasonable. If you will always bring back the entire log record, consider using a Standard CF rather than a Super CF. Then pack the log message using your favourite serialisation, e.g. JSON. 

Hope that helps.
Aaron

On 27 Jan 2011, at 16:26, Wangpei (Peter) wrote:

> I am also working on a system store logs from hundreds system.
> In my scenario, most query will like this: "let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range)."
> My data model like this:
> . only 1 CF. that's enough for this scenario.
> . group logs from each host and day to one row. Key format is "hostname.category.date"
> . store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column.
> 
> Then this query can be done as 3 GET, no need to do key range scan.
> Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that.
> However, if I need to do a time range access, I can still use column slice.
> 
> An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well.
> 
> I think storing all logs for a host in a single row is not a good choice. 2 reason:
> 1, too few keys, so your data will not distributing well.
> 2, data under a key will always increase. So Cassandra have to do more SSTable compaction.
> 
> -----邮件原件-----
> 发件人: William R Speirs [mailto:bill.speirs@gmail.com] 
> 发送时间: 2011年1月27日 9:15
> 收件人: user@cassandra.apache.org
> 主题: Re: Schema Design
> 
> It makes sense that the single row for a system (with a growing number of 
> columns) will reside on a single machine.
> 
> With that in mind, here is my updated schema:
> 
> - A single column family for all the messages. The row keys will be the TimeUUID 
> of the message with the following columns: date/time (in UTC POSIX), system 
> name/id (with an index for fast/easy gets), the actual message payload.
> 
> - A column family for each system. The row keys will be UTC POSIX time with 1 
> second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
> any messages that were logged during that time bucket.
> 
> My only hesitation with this design is that buddhasystem warned that each column 
> family, "is allocated a piece of memory on the server." I'm not sure what the 
> implications of this are and/or if this would be a problem if a I had a number 
> of systems on the order of hundreds.
> 
> Thanks...
> 
> Bill-
> 
> On 01/26/2011 06:51 PM, Shu Zhang wrote:
>> Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually.
>> 
>> More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well.
>> ________________________________________
>> From: Bill Speirs [bill.speirs@gmail.com]
>> Sent: Wednesday, January 26, 2011 1:23 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Schema Design
>> 
>> I like this approach, but I have 2 questions:
>> 
>> 1) what is the implications of continually adding columns to a single
>> row? I'm unsure how Cassandra is able to grow. I realize you can have
>> a virtually infinite number of columns, but what are the implications
>> of growing the number of columns over time?
>> 
>> 2) maybe it's just a restriction of the CLI, but how do I do issue a
>> slice request? Also, what if start (or end) columns don't exist? I'm
>> guessing it's smart enough to get the columns in that range.
>> 
>> Thanks!
>> 
>> Bill-
>> 
>> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
>> <dm...@agentisenergy.com>  wrote:
>>> I would say in that case you might want  to try a  single column family
>>> where the key to the column is the system name.
>>> Then, you could name your columns as the timestamp.  Then when retrieving
>>> information from the data store you can can, in your slice request, specify
>>> your start column as  X and end  column as Y.
>>> Then you can use the stored column name to know when an event  occurred.
>>> 
>>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bi...@gmail.com>  wrote:
>>>> 
>>>> I'm looking to use Cassandra to store log messages from various
>>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>>> My thought is to create a column family for each system. The row key
>>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>>> day, hour, minute, second, and message. I then have indexes setup for
>>>> each of the date/time columns.
>>>> 
>>>> I was hoping this would allow me to answer queries like: "What are all
>>>> the log messages that were generated between X&  Y?" The problem is
>>>> that I can ONLY use the equals operator on these column values. For
>>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>>> error: "No indexed columns present in index clause with operator EQ."
>>>> The equals operator works as expected though: get system_x where month
>>>> = 1;
>>>> 
>>>> What schema would allow me to get date ranges?
>>>> 
>>>> Thanks in advance...
>>>> 
>>>> Bill-
>>>> 
>>>> * ColumnFamily description *
>>>>    ColumnFamily: system_x_msg
>>>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>>      Row cache size / save period: 0.0/0
>>>>      Key cache size / save period: 200000.0/3600
>>>>      Memtable thresholds: 1.1671875/249/60
>>>>      GC grace seconds: 864000
>>>>      Compaction min/max thresholds: 4/32
>>>>      Read repair chance: 1.0
>>>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>>      Column Metadata:
>>>>        Column Name: year (year)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: month (month)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: second (second)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: minute (minute)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: hour (hour)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>>>        Column Name: day (day)
>>>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>>          Index Type: KEYS
>>> 
>>> 
>>> 
>>> --
>>> David McNelis
>>> Lead Software Engineer
>>> Agentis Energy
>>> www.agentisenergy.com
>>> o: 630.359.6395
>>> c: 219.384.5143
>>> A Smart Grid technology company focused on helping consumers of energy
>>> control an often under-managed resource.
>>> 
>>>

Re: Schema Design

Posted by "Wangpei (Peter)" <pe...@huawei.com>.

I am also working on a system store logs from hundreds system.
In my scenario, most query will like this: "let's look at login logs (category EQ) of that proxy (host EQ) between this Monday and Wednesday(time range)."
My data model like this:
. only 1 CF. that's enough for this scenario.
. group logs from each host and day to one row. Key format is "hostname.category.date"
. store each log entry as a super column, super olumn name is TimeUUID of the log. each attribute as a column.

Then this query can be done as 3 GET, no need to do key range scan.
Then I can use RP instead of OPP. If I use OPP, I have to worry about load balance myself. I hate that.
However, if I need to do a time range access, I can still use column slice.

An additional benefit is, I can clean old logs very easily. We only store logs in 1 year. Just deleting by keys can do this job well.

I think storing all logs for a host in a single row is not a good choice. 2 reason:
1, too few keys, so your data will not distributing well.
2, data under a key will always increase. So Cassandra have to do more SSTable compaction.

-----邮件原件-----
发件人: William R Speirs [mailto:bill.speirs@gmail.com] 
发送时间: 2011年1月27日 9:15
收件人: user@cassandra.apache.org
主题: Re: Schema Design

It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.

With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.

- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.

My only hesitation with this design is that buddhasystem warned that each column 
family, "is allocated a piece of memory on the server." I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.

Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:
> Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually.
>
> More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well.
> ________________________________________
> From: Bill Speirs [bill.speirs@gmail.com]
> Sent: Wednesday, January 26, 2011 1:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Schema Design
>
> I like this approach, but I have 2 questions:
>
> 1) what is the implications of continually adding columns to a single
> row? I'm unsure how Cassandra is able to grow. I realize you can have
> a virtually infinite number of columns, but what are the implications
> of growing the number of columns over time?
>
> 2) maybe it's just a restriction of the CLI, but how do I do issue a
> slice request? Also, what if start (or end) columns don't exist? I'm
> guessing it's smart enough to get the columns in that range.
>
> Thanks!
>
> Bill-
>
> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
> <dm...@agentisenergy.com>  wrote:
>> I would say in that case you might want  to try a  single column family
>> where the key to the column is the system name.
>> Then, you could name your columns as the timestamp.  Then when retrieving
>> information from the data store you can can, in your slice request, specify
>> your start column as  X and end  column as Y.
>> Then you can use the stored column name to know when an event  occurred.
>>
>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bi...@gmail.com>  wrote:
>>>
>>> I'm looking to use Cassandra to store log messages from various
>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>> My thought is to create a column family for each system. The row key
>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>> day, hour, minute, second, and message. I then have indexes setup for
>>> each of the date/time columns.
>>>
>>> I was hoping this would allow me to answer queries like: "What are all
>>> the log messages that were generated between X&  Y?" The problem is
>>> that I can ONLY use the equals operator on these column values. For
>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>> error: "No indexed columns present in index clause with operator EQ."
>>> The equals operator works as expected though: get system_x where month
>>> = 1;
>>>
>>> What schema would allow me to get date ranges?
>>>
>>> Thanks in advance...
>>>
>>> Bill-
>>>
>>> * ColumnFamily description *
>>>     ColumnFamily: system_x_msg
>>>       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>       Row cache size / save period: 0.0/0
>>>       Key cache size / save period: 200000.0/3600
>>>       Memtable thresholds: 1.1671875/249/60
>>>       GC grace seconds: 864000
>>>       Compaction min/max thresholds: 4/32
>>>       Read repair chance: 1.0
>>>       Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>       Column Metadata:
>>>         Column Name: year (year)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: month (month)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: second (second)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: minute (minute)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: hour (hour)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: day (day)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>
>>
>>
>> --
>> David McNelis
>> Lead Software Engineer
>> Agentis Energy
>> www.agentisenergy.com
>> o: 630.359.6395
>> c: 219.384.5143
>> A Smart Grid technology company focused on helping consumers of energy
>> control an often under-managed resource.
>>
>>

Re: Schema Design

Posted by William R Speirs <bi...@gmail.com>.

Ah, sweet... thanks for the link!

Bill-

On 01/26/2011 08:20 PM, buddhasystem wrote:
>
> Bill, it's all explained here:
>
> http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the
>
> Watch the number of CFs and the memtable sizes.
>
> In my experience, this all matters.

Re: Schema Design

Posted by buddhasystem <po...@bnl.gov>.

Bill, it's all explained here:

http://wiki.apache.org/cassandra/MemtableThresholds#JVM_Heap_Size,the

Watch the number of CFs and the memtable sizes.

In my experience, this all matters.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Schema-Design-tp5964167p5964852.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Schema Design

Posted by William R Speirs <bi...@gmail.com>.

It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.

With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.

- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.

My only hesitation with this design is that buddhasystem warned that each column 
family, "is allocated a piece of memory on the server." I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.

Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:
> Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually.
>
> More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well.
> ________________________________________
> From: Bill Speirs [bill.speirs@gmail.com]
> Sent: Wednesday, January 26, 2011 1:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Schema Design
>
> I like this approach, but I have 2 questions:
>
> 1) what is the implications of continually adding columns to a single
> row? I'm unsure how Cassandra is able to grow. I realize you can have
> a virtually infinite number of columns, but what are the implications
> of growing the number of columns over time?
>
> 2) maybe it's just a restriction of the CLI, but how do I do issue a
> slice request? Also, what if start (or end) columns don't exist? I'm
> guessing it's smart enough to get the columns in that range.
>
> Thanks!
>
> Bill-
>
> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
> <dm...@agentisenergy.com>  wrote:
>> I would say in that case you might want  to try a  single column family
>> where the key to the column is the system name.
>> Then, you could name your columns as the timestamp.  Then when retrieving
>> information from the data store you can can, in your slice request, specify
>> your start column as  X and end  column as Y.
>> Then you can use the stored column name to know when an event  occurred.
>>
>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bi...@gmail.com>  wrote:
>>>
>>> I'm looking to use Cassandra to store log messages from various
>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>> My thought is to create a column family for each system. The row key
>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>> day, hour, minute, second, and message. I then have indexes setup for
>>> each of the date/time columns.
>>>
>>> I was hoping this would allow me to answer queries like: "What are all
>>> the log messages that were generated between X&  Y?" The problem is
>>> that I can ONLY use the equals operator on these column values. For
>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>> error: "No indexed columns present in index clause with operator EQ."
>>> The equals operator works as expected though: get system_x where month
>>> = 1;
>>>
>>> What schema would allow me to get date ranges?
>>>
>>> Thanks in advance...
>>>
>>> Bill-
>>>
>>> * ColumnFamily description *
>>>     ColumnFamily: system_x_msg
>>>       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>       Row cache size / save period: 0.0/0
>>>       Key cache size / save period: 200000.0/3600
>>>       Memtable thresholds: 1.1671875/249/60
>>>       GC grace seconds: 864000
>>>       Compaction min/max thresholds: 4/32
>>>       Read repair chance: 1.0
>>>       Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>       Column Metadata:
>>>         Column Name: year (year)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: month (month)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: second (second)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: minute (minute)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: hour (hour)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: day (day)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>
>>
>>
>> --
>> David McNelis
>> Lead Software Engineer
>> Agentis Energy
>> www.agentisenergy.com
>> o: 630.359.6395
>> c: 219.384.5143
>> A Smart Grid technology company focused on helping consumers of energy
>> control an often under-managed resource.
>>
>>

RE: Schema Design

Posted by Shu Zhang <sz...@mediosystems.com>.

Each row can have a maximum of 2 billion columns, which a logging system will probably hit eventually.

More importantly, you'll only have 1 row per set of system logs. Every row is stored on the same machine(s), which you means you'll definitely not be able to distribute your load very well.
________________________________________
From: Bill Speirs [bill.speirs@gmail.com]
Sent: Wednesday, January 26, 2011 1:23 PM
To: user@cassandra.apache.org
Subject: Re: Schema Design

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
<dm...@agentisenergy.com> wrote:
> I would say in that case you might want  to try a  single column family
> where the key to the column is the system name.
> Then, you could name your columns as the timestamp.  Then when retrieving
> information from the data store you can can, in your slice request, specify
> your start column as  X and end  column as Y.
> Then you can use the stored column name to know when an event  occurred.
>
> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs <bi...@gmail.com> wrote:
>>
>> I'm looking to use Cassandra to store log messages from various
>> systems. A log message only has a message (UTF8Type) and a data/time.
>> My thought is to create a column family for each system. The row key
>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>> day, hour, minute, second, and message. I then have indexes setup for
>> each of the date/time columns.
>>
>> I was hoping this would allow me to answer queries like: "What are all
>> the log messages that were generated between X & Y?" The problem is
>> that I can ONLY use the equals operator on these column values. For
>> example, I cannot issuing: get system_x where month > 1; gives me this
>> error: "No indexed columns present in index clause with operator EQ."
>> The equals operator works as expected though: get system_x where month
>> = 1;
>>
>> What schema would allow me to get date ranges?
>>
>> Thanks in advance...
>>
>> Bill-
>>
>> * ColumnFamily description *
>>    ColumnFamily: system_x_msg
>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>      Row cache size / save period: 0.0/0
>>      Key cache size / save period: 200000.0/3600
>>      Memtable thresholds: 1.1671875/249/60
>>      GC grace seconds: 864000
>>      Compaction min/max thresholds: 4/32
>>      Read repair chance: 1.0
>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>      Column Metadata:
>>        Column Name: year (year)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: month (month)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: second (second)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: minute (minute)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: hour (hour)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: day (day)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>
>
>
> --
> David McNelis
> Lead Software Engineer
> Agentis Energy
> www.agentisenergy.com
> o: 630.359.6395
> c: 219.384.5143
> A Smart Grid technology company focused on helping consumers of energy
> control an often under-managed resource.
>
>

Re: Schema Design

Posted by Bill Speirs <bi...@gmail.com>.

I like this approach, but I have 2 questions:

1) what is the implications of continually adding columns to a single
row? I'm unsure how Cassandra is able to grow. I realize you can have
a virtually infinite number of columns, but what are the implications
of growing the number of columns over time?

2) maybe it's just a restriction of the CLI, but how do I do issue a
slice request? Also, what if start (or end) columns don't exist? I'm
guessing it's smart enough to get the columns in that range.

Thanks!

Bill-

On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
<dm...@agentisenergy.com> wrote:
> I would say in that case you might want  to try a  single column family
> where the key to the column is the system name.
> Then, you could name your columns as the timestamp.  Then when retrieving
> information from the data store you can can, in your slice request, specify
> your start column as  X and end  column as Y.
> Then you can use the stored column name to know when an event  occurred.
>
> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs <bi...@gmail.com> wrote:
>>
>> I'm looking to use Cassandra to store log messages from various
>> systems. A log message only has a message (UTF8Type) and a data/time.
>> My thought is to create a column family for each system. The row key
>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>> day, hour, minute, second, and message. I then have indexes setup for
>> each of the date/time columns.
>>
>> I was hoping this would allow me to answer queries like: "What are all
>> the log messages that were generated between X & Y?" The problem is
>> that I can ONLY use the equals operator on these column values. For
>> example, I cannot issuing: get system_x where month > 1; gives me this
>> error: "No indexed columns present in index clause with operator EQ."
>> The equals operator works as expected though: get system_x where month
>> = 1;
>>
>> What schema would allow me to get date ranges?
>>
>> Thanks in advance...
>>
>> Bill-
>>
>> * ColumnFamily description *
>>    ColumnFamily: system_x_msg
>>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>      Row cache size / save period: 0.0/0
>>      Key cache size / save period: 200000.0/3600
>>      Memtable thresholds: 1.1671875/249/60
>>      GC grace seconds: 864000
>>      Compaction min/max thresholds: 4/32
>>      Read repair chance: 1.0
>>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>      Column Metadata:
>>        Column Name: year (year)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: month (month)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: second (second)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: minute (minute)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: hour (hour)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>>        Column Name: day (day)
>>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>          Index Type: KEYS
>
>
>
> --
> David McNelis
> Lead Software Engineer
> Agentis Energy
> www.agentisenergy.com
> o: 630.359.6395
> c: 219.384.5143
> A Smart Grid technology company focused on helping consumers of energy
> control an often under-managed resource.
>
>

Re: Schema Design

Posted by David McNelis <dm...@agentisenergy.com>.

I would say in that case you might want  to try a  single column family
where the key to the column is the system name.

Then, you could name your columns as the timestamp.  Then when retrieving
information from the data store you can can, in your slice request, specify
your start column as  X and end  column as Y.

Then you can use the stored column name to know when an event  occurred.

On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs <bi...@gmail.com> wrote:

> I'm looking to use Cassandra to store log messages from various
> systems. A log message only has a message (UTF8Type) and a data/time.
> My thought is to create a column family for each system. The row key
> will be a TimeUUIDType. Each row will have 7 columns: year, month,
> day, hour, minute, second, and message. I then have indexes setup for
> each of the date/time columns.
>
> I was hoping this would allow me to answer queries like: "What are all
> the log messages that were generated between X & Y?" The problem is
> that I can ONLY use the equals operator on these column values. For
> example, I cannot issuing: get system_x where month > 1; gives me this
> error: "No indexed columns present in index clause with operator EQ."
> The equals operator works as expected though: get system_x where month
> = 1;
>
> What schema would allow me to get date ranges?
>
> Thanks in advance...
>
> Bill-
>
> * ColumnFamily description *
>    ColumnFamily: system_x_msg
>      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>      Row cache size / save period: 0.0/0
>      Key cache size / save period: 200000.0/3600
>      Memtable thresholds: 1.1671875/249/60
>      GC grace seconds: 864000
>      Compaction min/max thresholds: 4/32
>      Read repair chance: 1.0
>      Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>      Column Metadata:
>        Column Name: year (year)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>        Column Name: month (month)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>        Column Name: second (second)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>        Column Name: minute (minute)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>        Column Name: hour (hour)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>        Column Name: day (day)
>          Validation Class: org.apache.cassandra.db.marshal.IntegerType
>          Index Type: KEYS
>



-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*