You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Boris Solovyov <bo...@gmail.com> on 2013/02/12 11:55:28 UTC

Seeking suggestions for a use case

Hello list!

I have application with following characteristics:

   - data is time series, tens of millions of series at 1-sec granularity,
   like stock ticker data
   - values are timestamp, integer (uint64)
   - data is append only, never update
   - data don't write in distant past, maybe sometimes write 10 sec ago but
   not more
   - data is write mostly, like 99.9% write I think
   - most read will be of recent data, always in range of timestamps
   - data needs purge after some time, ex. 1 week

I consider to use Cassandra. No other existing database (HBase, Riak, etc)
seems well suited for this.

Questions:

   - Did I miss some others database that could work? Please suggest me if
   you know one.
   - What are benefits or drawbacks of leveled compaction for this workload?
   - Setting column TTL seems bad choice due to extra storage. Agree? Is
   efficient to run routine batch job to purge oldest data? Is there will be
   any gotcha with that (like fullscan of something instead of just oldest,
   maybe?)
   - Will column index beneficial? If reads are scans, does it matter, or
   is it just extra work and storage space to maintain, without much benefit
   especially since reads are rare?
   - How gc_grace_seconds impacts operations in this workload? Will purges
   of old data leave sstables mostly obsolete, rather than sparsely obsolete?
   I think they will. So, after purge, tombstones can be GC shortly, no need
   for default 10 days grace period. BUT, I read in docs that
   if gc_grace_seconds is short, then nodetool repair needs run quite often.
   Is that true? Why would that be needed in my use case?
   - Related question: is it sensible to set tombstone_threshold to 1.0 but
   tombstone_compaction_interval to something short, like 1 hour? I suppose
   this depends on whether I am correct that SSTables will be deleted
   entirely, instead of just getting sparse.
   - Should I disable row_cache_provider? It invalidates every row on
   update, right? I will be updating rows constantly, so it seems not
   benefitial.
   - Docs say "compaction_throughput_mb_per_sec" is per "entire system."
   Does that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble
   with periodic deletions of expired columns? Do I need to make sure my
   purges of old data are trickled out over time to avoid huge overhead of
   compaction? But in that case, SSTables will become sparsely deleted, right?
   And then re-compacted, which seems wasteful if the remaining data will soon
   be purged again and there will be another re-compaction. So this is
   partially why I asked about tombstone-threshold and compaction interval --
   I think is best if I can purge data in such a way that Cassandra never
   recompacts SsTables, but just realizes "oh, whole thing is dead, I can
   delete, no work needed." But I am not sure if my considered settings will
   have unintended consequence.
   - Finally, with proposed workload, will there be troubles with
   flush_larges_memtables_at and reduce_cache_capacity_to,
   reduce_cache_sizes_at? These are describe as "emergency measures" in docs.
   If my workload is edge case that could trigger bad emergency-measure
   behavior I hope you can say me that :-)

Many thanks!

Boris

Re: Seeking suggestions for a use case

Posted by "Hiller, Dean" <De...@nrel.gov>.
We are open sourcing the system so I don't mind at all.

We are using patterns from this web page
https://github.com/deanhiller/playorm/wiki/Patterns-Page

Realize PlayOrm is doing a huge amount of heavy lifting for us with it's virtual tables and partioning.  This will be hard to explain but I will give it a shot.

We have one columnfamily called "data".  We have 60,000 virtual tables in that CF (PlayOrm prefixes every key with the table name so short table names for our data is good).  So thus far, we simply have

Data
rowKey = CompositeKey(virtual tablename, time since epoch)

Next, we haven't done this just yet, but we are going to partition each virtual table.  In playorm, our partition is not like cassandra so one partition is spread across the cluster much like a virtual table), and for this we just add a special column, the partitioned column(in PlayOrm, we just annotate a field and it partitions it for us)

So we have

ColumnFamily="Data"
rowKey = CompositeKey(virtual tablename, time since epoch)
ColumnName="PartitionTimeKey" / ColumnValue= (value of time at beginning of the month)
….(the rest is names of the columns with data and their data)

As you can see since we only deal in looking up rowKeys at this point, we are completely scalable to infinity and beyond.  Now, behind the scenes PlayOrm is creating some indexes for us in that a partition can scale up to < 10 million rows.  Let's say we have 100,000 rows in the above model where all rows are in virtualTable=deansTemperature AND let's say those same rows are in all the same partition of the month of February.  There is a single wide row(created by playorm, not me directly) like so

ColumnFamily="IntegerIndex"
rowKey=Composite(<virtual tablename>, "PartitionTimeKey", <begin of February time>)
column1Name=Composite(<time1>, <rowKeyToData98>)
column2Name=Composite(<time2>, <rowKeyToData56>)
……This is a very wide row WITH NO VALUES…..all information is in the column names!!!!

The "PartitionTimeKey" is not necessary but playOrm allows me to partition in different directions so if I did multiple partition types, it would be needed but we don't use that.

I hope that makes sense.  I never sure whether I am being clear enough or not as I don't know how much noSQL you know…….if you are just getting started, you need to read up on the Composite column names pattern and wide rows.  The link above is general noSQL patterns mixed from a playorm point of view but it tries to explaint he underlying noSQL pattern each time.

Thanks,
Dean




From: Boris Solovyov <bo...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, February 12, 2013 2:56 PM
To: user <us...@cassandra.apache.org>>
Subject: Re: Seeking suggestions for a use case

Would you mind sharing your schema on the list? It would be useful to see how you modeled your data. Or you could email me privately if you want.

Thanks
Boris


On Tue, Feb 12, 2013 at 4:11 PM, Hiller, Dean <De...@nrel.gov>> wrote:
Yes, the limit of the width of a row is approximately in the millions, perhaps lower than 10 million.  We plan to go well above that in our use case ;).  Our widest row for indexing right now is only around 200,000 columns and we have been in production one month(At 10 years that would be about 24 million).  They want 10 years of data at the very least and they constantly have researchers working with the data sets from all times.


Re: Seeking suggestions for a use case

Posted by Boris Solovyov <bo...@gmail.com>.
Would you mind sharing your schema on the list? It would be useful to see
how you modeled your data. Or you could email me privately if you want.

Thanks
Boris


On Tue, Feb 12, 2013 at 4:11 PM, Hiller, Dean <De...@nrel.gov> wrote:

> Yes, the limit of the width of a row is approximately in the millions,
> perhaps lower than 10 million.  We plan to go well above that in our use
> case ;).  Our widest row for indexing right now is only around 200,000
> columns and we have been in production one month(At 10 years that would be
> about 24 million).  They want 10 years of data at the very least and they
> constantly have researchers working with the data sets from all times.
>
>

Re: Seeking suggestions for a use case

Posted by "Hiller, Dean" <De...@nrel.gov>.
Yes, the limit of the width of a row is approximately in the millions, perhaps lower than 10 million.  We plan to go well above that in our use case ;).  Our widest row for indexing right now is only around 200,000 columns and we have been in production one month(At 10 years that would be about 24 million).  They want 10 years of data at the very least and they constantly have researchers working with the data sets from all times.

It apears they may have some even faster time series data get recorded as well like every second or even faster(we update the system in batches anyways so we can handle sub-microsecond if they want as we just need to add more nodes).

It all depends on how big you think your data is going to grow…I just asked what the client wanted and designed for their wants.

Dean

From: Boris Solovyov <bo...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, February 12, 2013 1:08 PM
To: user <us...@cassandra.apache.org>>
Subject: Re: Seeking suggestions for a use case

Thanks. So in your use case, you actually keep parts of the same series in different rows, to keep the rows from getting too wide? I thought Cassandra worked OK with millions of columns per row. If I don't have to split a row into parts, that keep the data model simpler for me. (Otherwise, if I want to split row and reassemble in client code, I could just use RDBMS :-)


On Tue, Feb 12, 2013 at 12:07 PM, Hiller, Dean <De...@nrel.gov>> wrote:
We are using cassandra for time series as well with PlayOrm.  A guess is
we will be doing equal reads and writes on all the data going back 10
years(currently in production we are write heavy right now).  We have
60,000 virtual tables (one table per sensor we read from and yes we have
that many sensors).  We partition with PlayOrm partitioning one months
worth for each of the virtual tables.  This gives us a wide row index into
each partition that playorm creates and the rest of the data varies
between very narrow tables (one column) and tables with around 20 columns.
 It seems to be working extremely well so far and we run it on 6 cassandra
nodes as well.

Anyways, thought I would share as perhaps it helps you understand your use
case.

Later,
Dean

On 2/12/13 8:08 AM, "Edward Capriolo" <ed...@gmail.com>> wrote:

>Your use case is 100% on the money for Cassandra. But let me take a
>chance to slam the other NoSQLs. (not really slam but you know)
>
>Riak is a key-value store. It is not a column family store where a
>rowkey has a map of sorted values. This makes the time series more
>awkward as the time series has to span many rows, rather then one
>large row.
>
>HBase has similiar problems with time-series. On one hand if your
>rowkeys are series you get hotspots, if you columns are time series
>you run into two subtle issues. Last I check hbase's on disk format
>repeats the key each time (somewhat wasteful)
>
>key,column,value
>key,column,value
>key,column,value
>
>Also there are issues with really big rows, although they are dealt
>with in a similiar way to really wide rows in cassandra, just use time
>as part of the row key and the rows will not get that large.
>
>I do not think you need leveled compaction for an append only
>workload, although it might be helpful depending on how long you want
>to keep these rows. If you are not keeping them very long possibly
>leveled would keep the on disk size smaller.
>
>Column TTLs in cassandra do not require extra storage. It is a very
>efficient way to do this. Otherwise you have to scan through your data
>with some offline process and delete.
>
>Do not worry about gc_grace to much. The moral is because of
>distributed deletes some data lives on disk for a while after it is
>deleted. All this means is you need "some" more storage then just the
>space for your live data.
>
>Don't use row cache with wide rows REPEAT Don't use row cache with wide
>rows
>
>Compaction throughput is metered on each node (again not a setting to
>worry about
>
>if you are hitting flush_largest_memtables_at and
>reduce_cache_capacity_to it basically means your have over tuned or
>you do not have enough hardware. These are mostly emergency valves and
>if you are setup well these are not a factor. They are only around to
>relieve memory pressure to prevent the node from hitting a cycle where
>it is in GC more then it is in serving mode.
>
>Whew!
>
>Anyway. Nice to see that you are trying to understand the knobs,
>before kicking the tires.
>
>On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
><bo...@gmail.com>> wrote:
>> Hello list!
>>
>> I have application with following characteristics:
>>
>> data is time series, tens of millions of series at 1-sec granularity,
>>like
>> stock ticker data
>> values are timestamp, integer (uint64)
>> data is append only, never update
>> data don't write in distant past, maybe sometimes write 10 sec ago but
>>not
>> more
>> data is write mostly, like 99.9% write I think
>> most read will be of recent data, always in range of timestamps
>> data needs purge after some time, ex. 1 week
>>
>> I consider to use Cassandra. No other existing database (HBase, Riak,
>>etc)
>> seems well suited for this.
>>
>> Questions:
>>
>> Did I miss some others database that could work? Please suggest me if
>>you
>> know one.
>> What are benefits or drawbacks of leveled compaction for this workload?
>> Setting column TTL seems bad choice due to extra storage. Agree? Is
>> efficient to run routine batch job to purge oldest data? Is there will
>>be
>> any gotcha with that (like fullscan of something instead of just oldest,
>> maybe?)
>> Will column index beneficial? If reads are scans, does it matter, or is
>>it
>> just extra work and storage space to maintain, without much benefit
>> especially since reads are rare?
>> How gc_grace_seconds impacts operations in this workload? Will purges
>>of old
>> data leave sstables mostly obsolete, rather than sparsely obsolete? I
>>think
>> they will. So, after purge, tombstones can be GC shortly, no need for
>> default 10 days grace period. BUT, I read in docs that if
>>gc_grace_seconds
>> is short, then nodetool repair needs run quite often. Is that true? Why
>> would that be needed in my use case?
>> Related question: is it sensible to set tombstone_threshold to 1.0 but
>> tombstone_compaction_interval to something short, like 1 hour? I suppose
>> this depends on whether I am correct that SSTables will be deleted
>>entirely,
>> instead of just getting sparse.
>> Should I disable row_cache_provider? It invalidates every row on update,
>> right? I will be updating rows constantly, so it seems not benefitial.
>> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
>> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
>> periodic deletions of expired columns? Do I need to make sure my purges
>>of
>> old data are trickled out over time to avoid huge overhead of
>>compaction?
>> But in that case, SSTables will become sparsely deleted, right? And then
>> re-compacted, which seems wasteful if the remaining data will soon be
>>purged
>> again and there will be another re-compaction. So this is partially why
>>I
>> asked about tombstone-threshold and compaction interval -- I think is
>>best
>> if I can purge data in such a way that Cassandra never recompacts
>>SsTables,
>> but just realizes "oh, whole thing is dead, I can delete, no work
>>needed."
>> But I am not sure if my considered settings will have unintended
>> consequence.
>> Finally, with proposed workload, will there be troubles with
>> flush_larges_memtables_at and reduce_cache_capacity_to,
>> reduce_cache_sizes_at? These are describe as "emergency measures" in
>>docs.
>> If my workload is edge case that could trigger bad emergency-measure
>> behavior I hope you can say me that :-)
>>
>> Many thanks!
>>
>> Boris



Re: Seeking suggestions for a use case

Posted by Boris Solovyov <bo...@gmail.com>.
Thanks. So in your use case, you actually keep parts of the same series in
different rows, to keep the rows from getting too wide? I thought Cassandra
worked OK with millions of columns per row. If I don't have to split a row
into parts, that keep the data model simpler for me. (Otherwise, if I want
to split row and reassemble in client code, I could just use RDBMS :-)


On Tue, Feb 12, 2013 at 12:07 PM, Hiller, Dean <De...@nrel.gov> wrote:

> We are using cassandra for time series as well with PlayOrm.  A guess is
> we will be doing equal reads and writes on all the data going back 10
> years(currently in production we are write heavy right now).  We have
> 60,000 virtual tables (one table per sensor we read from and yes we have
> that many sensors).  We partition with PlayOrm partitioning one months
> worth for each of the virtual tables.  This gives us a wide row index into
> each partition that playorm creates and the rest of the data varies
> between very narrow tables (one column) and tables with around 20 columns.
>  It seems to be working extremely well so far and we run it on 6 cassandra
> nodes as well.
>
> Anyways, thought I would share as perhaps it helps you understand your use
> case.
>
> Later,
> Dean
>
> On 2/12/13 8:08 AM, "Edward Capriolo" <ed...@gmail.com> wrote:
>
> >Your use case is 100% on the money for Cassandra. But let me take a
> >chance to slam the other NoSQLs. (not really slam but you know)
> >
> >Riak is a key-value store. It is not a column family store where a
> >rowkey has a map of sorted values. This makes the time series more
> >awkward as the time series has to span many rows, rather then one
> >large row.
> >
> >HBase has similiar problems with time-series. On one hand if your
> >rowkeys are series you get hotspots, if you columns are time series
> >you run into two subtle issues. Last I check hbase's on disk format
> >repeats the key each time (somewhat wasteful)
> >
> >key,column,value
> >key,column,value
> >key,column,value
> >
> >Also there are issues with really big rows, although they are dealt
> >with in a similiar way to really wide rows in cassandra, just use time
> >as part of the row key and the rows will not get that large.
> >
> >I do not think you need leveled compaction for an append only
> >workload, although it might be helpful depending on how long you want
> >to keep these rows. If you are not keeping them very long possibly
> >leveled would keep the on disk size smaller.
> >
> >Column TTLs in cassandra do not require extra storage. It is a very
> >efficient way to do this. Otherwise you have to scan through your data
> >with some offline process and delete.
> >
> >Do not worry about gc_grace to much. The moral is because of
> >distributed deletes some data lives on disk for a while after it is
> >deleted. All this means is you need "some" more storage then just the
> >space for your live data.
> >
> >Don't use row cache with wide rows REPEAT Don't use row cache with wide
> >rows
> >
> >Compaction throughput is metered on each node (again not a setting to
> >worry about
> >
> >if you are hitting flush_largest_memtables_at and
> >reduce_cache_capacity_to it basically means your have over tuned or
> >you do not have enough hardware. These are mostly emergency valves and
> >if you are setup well these are not a factor. They are only around to
> >relieve memory pressure to prevent the node from hitting a cycle where
> >it is in GC more then it is in serving mode.
> >
> >Whew!
> >
> >Anyway. Nice to see that you are trying to understand the knobs,
> >before kicking the tires.
> >
> >On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
> ><bo...@gmail.com> wrote:
> >> Hello list!
> >>
> >> I have application with following characteristics:
> >>
> >> data is time series, tens of millions of series at 1-sec granularity,
> >>like
> >> stock ticker data
> >> values are timestamp, integer (uint64)
> >> data is append only, never update
> >> data don't write in distant past, maybe sometimes write 10 sec ago but
> >>not
> >> more
> >> data is write mostly, like 99.9% write I think
> >> most read will be of recent data, always in range of timestamps
> >> data needs purge after some time, ex. 1 week
> >>
> >> I consider to use Cassandra. No other existing database (HBase, Riak,
> >>etc)
> >> seems well suited for this.
> >>
> >> Questions:
> >>
> >> Did I miss some others database that could work? Please suggest me if
> >>you
> >> know one.
> >> What are benefits or drawbacks of leveled compaction for this workload?
> >> Setting column TTL seems bad choice due to extra storage. Agree? Is
> >> efficient to run routine batch job to purge oldest data? Is there will
> >>be
> >> any gotcha with that (like fullscan of something instead of just oldest,
> >> maybe?)
> >> Will column index beneficial? If reads are scans, does it matter, or is
> >>it
> >> just extra work and storage space to maintain, without much benefit
> >> especially since reads are rare?
> >> How gc_grace_seconds impacts operations in this workload? Will purges
> >>of old
> >> data leave sstables mostly obsolete, rather than sparsely obsolete? I
> >>think
> >> they will. So, after purge, tombstones can be GC shortly, no need for
> >> default 10 days grace period. BUT, I read in docs that if
> >>gc_grace_seconds
> >> is short, then nodetool repair needs run quite often. Is that true? Why
> >> would that be needed in my use case?
> >> Related question: is it sensible to set tombstone_threshold to 1.0 but
> >> tombstone_compaction_interval to something short, like 1 hour? I suppose
> >> this depends on whether I am correct that SSTables will be deleted
> >>entirely,
> >> instead of just getting sparse.
> >> Should I disable row_cache_provider? It invalidates every row on update,
> >> right? I will be updating rows constantly, so it seems not benefitial.
> >> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
> >> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
> >> periodic deletions of expired columns? Do I need to make sure my purges
> >>of
> >> old data are trickled out over time to avoid huge overhead of
> >>compaction?
> >> But in that case, SSTables will become sparsely deleted, right? And then
> >> re-compacted, which seems wasteful if the remaining data will soon be
> >>purged
> >> again and there will be another re-compaction. So this is partially why
> >>I
> >> asked about tombstone-threshold and compaction interval -- I think is
> >>best
> >> if I can purge data in such a way that Cassandra never recompacts
> >>SsTables,
> >> but just realizes "oh, whole thing is dead, I can delete, no work
> >>needed."
> >> But I am not sure if my considered settings will have unintended
> >> consequence.
> >> Finally, with proposed workload, will there be troubles with
> >> flush_larges_memtables_at and reduce_cache_capacity_to,
> >> reduce_cache_sizes_at? These are describe as "emergency measures" in
> >>docs.
> >> If my workload is edge case that could trigger bad emergency-measure
> >> behavior I hope you can say me that :-)
> >>
> >> Many thanks!
> >>
> >> Boris
>
>

Re: Seeking suggestions for a use case

Posted by Boris Solovyov <bo...@gmail.com>.
Thanks for your suggestions and feedbacks! We will see how it goes. I am
trying to set up first test cluster now :)


On Tue, Feb 12, 2013 at 10:08 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Your use case is 100% on the money for Cassandra. But let me take a
> chance to slam the other NoSQLs. (not really slam but you know)
>
> Riak is a key-value store. It is not a column family store where a
> rowkey has a map of sorted values. This makes the time series more
> awkward as the time series has to span many rows, rather then one
> large row.
>
> HBase has similiar problems with time-series. On one hand if your
> rowkeys are series you get hotspots, if you columns are time series
> you run into two subtle issues. Last I check hbase's on disk format
> repeats the key each time (somewhat wasteful)
>
> key,column,value
> key,column,value
> key,column,value
>
> Also there are issues with really big rows, although they are dealt
> with in a similiar way to really wide rows in cassandra, just use time
> as part of the row key and the rows will not get that large.
>
> I do not think you need leveled compaction for an append only
> workload, although it might be helpful depending on how long you want
> to keep these rows. If you are not keeping them very long possibly
> leveled would keep the on disk size smaller.
>
> Column TTLs in cassandra do not require extra storage. It is a very
> efficient way to do this. Otherwise you have to scan through your data
> with some offline process and delete.
>
> Do not worry about gc_grace to much. The moral is because of
> distributed deletes some data lives on disk for a while after it is
> deleted. All this means is you need "some" more storage then just the
> space for your live data.
>
> Don't use row cache with wide rows REPEAT Don't use row cache with wide
> rows
>
> Compaction throughput is metered on each node (again not a setting to
> worry about
>
> if you are hitting flush_largest_memtables_at and
> reduce_cache_capacity_to it basically means your have over tuned or
> you do not have enough hardware. These are mostly emergency valves and
> if you are setup well these are not a factor. They are only around to
> relieve memory pressure to prevent the node from hitting a cycle where
> it is in GC more then it is in serving mode.
>
> Whew!
>
> Anyway. Nice to see that you are trying to understand the knobs,
> before kicking the tires.
>
> On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
> <bo...@gmail.com> wrote:
> > Hello list!
> >
> > I have application with following characteristics:
> >
> > data is time series, tens of millions of series at 1-sec granularity,
> like
> > stock ticker data
> > values are timestamp, integer (uint64)
> > data is append only, never update
> > data don't write in distant past, maybe sometimes write 10 sec ago but
> not
> > more
> > data is write mostly, like 99.9% write I think
> > most read will be of recent data, always in range of timestamps
> > data needs purge after some time, ex. 1 week
> >
> > I consider to use Cassandra. No other existing database (HBase, Riak,
> etc)
> > seems well suited for this.
> >
> > Questions:
> >
> > Did I miss some others database that could work? Please suggest me if you
> > know one.
> > What are benefits or drawbacks of leveled compaction for this workload?
> > Setting column TTL seems bad choice due to extra storage. Agree? Is
> > efficient to run routine batch job to purge oldest data? Is there will be
> > any gotcha with that (like fullscan of something instead of just oldest,
> > maybe?)
> > Will column index beneficial? If reads are scans, does it matter, or is
> it
> > just extra work and storage space to maintain, without much benefit
> > especially since reads are rare?
> > How gc_grace_seconds impacts operations in this workload? Will purges of
> old
> > data leave sstables mostly obsolete, rather than sparsely obsolete? I
> think
> > they will. So, after purge, tombstones can be GC shortly, no need for
> > default 10 days grace period. BUT, I read in docs that if
> gc_grace_seconds
> > is short, then nodetool repair needs run quite often. Is that true? Why
> > would that be needed in my use case?
> > Related question: is it sensible to set tombstone_threshold to 1.0 but
> > tombstone_compaction_interval to something short, like 1 hour? I suppose
> > this depends on whether I am correct that SSTables will be deleted
> entirely,
> > instead of just getting sparse.
> > Should I disable row_cache_provider? It invalidates every row on update,
> > right? I will be updating rows constantly, so it seems not benefitial.
> > Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
> > that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
> > periodic deletions of expired columns? Do I need to make sure my purges
> of
> > old data are trickled out over time to avoid huge overhead of compaction?
> > But in that case, SSTables will become sparsely deleted, right? And then
> > re-compacted, which seems wasteful if the remaining data will soon be
> purged
> > again and there will be another re-compaction. So this is partially why I
> > asked about tombstone-threshold and compaction interval -- I think is
> best
> > if I can purge data in such a way that Cassandra never recompacts
> SsTables,
> > but just realizes "oh, whole thing is dead, I can delete, no work
> needed."
> > But I am not sure if my considered settings will have unintended
> > consequence.
> > Finally, with proposed workload, will there be troubles with
> > flush_larges_memtables_at and reduce_cache_capacity_to,
> > reduce_cache_sizes_at? These are describe as "emergency measures" in
> docs.
> > If my workload is edge case that could trigger bad emergency-measure
> > behavior I hope you can say me that :-)
> >
> > Many thanks!
> >
> > Boris
>

Re: Seeking suggestions for a use case

Posted by "Hiller, Dean" <De...@nrel.gov>.
We are using cassandra for time series as well with PlayOrm.  A guess is
we will be doing equal reads and writes on all the data going back 10
years(currently in production we are write heavy right now).  We have
60,000 virtual tables (one table per sensor we read from and yes we have
that many sensors).  We partition with PlayOrm partitioning one months
worth for each of the virtual tables.  This gives us a wide row index into
each partition that playorm creates and the rest of the data varies
between very narrow tables (one column) and tables with around 20 columns.
 It seems to be working extremely well so far and we run it on 6 cassandra
nodes as well.

Anyways, thought I would share as perhaps it helps you understand your use
case.

Later,
Dean

On 2/12/13 8:08 AM, "Edward Capriolo" <ed...@gmail.com> wrote:

>Your use case is 100% on the money for Cassandra. But let me take a
>chance to slam the other NoSQLs. (not really slam but you know)
>
>Riak is a key-value store. It is not a column family store where a
>rowkey has a map of sorted values. This makes the time series more
>awkward as the time series has to span many rows, rather then one
>large row.
>
>HBase has similiar problems with time-series. On one hand if your
>rowkeys are series you get hotspots, if you columns are time series
>you run into two subtle issues. Last I check hbase's on disk format
>repeats the key each time (somewhat wasteful)
>
>key,column,value
>key,column,value
>key,column,value
>
>Also there are issues with really big rows, although they are dealt
>with in a similiar way to really wide rows in cassandra, just use time
>as part of the row key and the rows will not get that large.
>
>I do not think you need leveled compaction for an append only
>workload, although it might be helpful depending on how long you want
>to keep these rows. If you are not keeping them very long possibly
>leveled would keep the on disk size smaller.
>
>Column TTLs in cassandra do not require extra storage. It is a very
>efficient way to do this. Otherwise you have to scan through your data
>with some offline process and delete.
>
>Do not worry about gc_grace to much. The moral is because of
>distributed deletes some data lives on disk for a while after it is
>deleted. All this means is you need "some" more storage then just the
>space for your live data.
>
>Don't use row cache with wide rows REPEAT Don't use row cache with wide
>rows
>
>Compaction throughput is metered on each node (again not a setting to
>worry about
>
>if you are hitting flush_largest_memtables_at and
>reduce_cache_capacity_to it basically means your have over tuned or
>you do not have enough hardware. These are mostly emergency valves and
>if you are setup well these are not a factor. They are only around to
>relieve memory pressure to prevent the node from hitting a cycle where
>it is in GC more then it is in serving mode.
>
>Whew!
>
>Anyway. Nice to see that you are trying to understand the knobs,
>before kicking the tires.
>
>On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
><bo...@gmail.com> wrote:
>> Hello list!
>>
>> I have application with following characteristics:
>>
>> data is time series, tens of millions of series at 1-sec granularity,
>>like
>> stock ticker data
>> values are timestamp, integer (uint64)
>> data is append only, never update
>> data don't write in distant past, maybe sometimes write 10 sec ago but
>>not
>> more
>> data is write mostly, like 99.9% write I think
>> most read will be of recent data, always in range of timestamps
>> data needs purge after some time, ex. 1 week
>>
>> I consider to use Cassandra. No other existing database (HBase, Riak,
>>etc)
>> seems well suited for this.
>>
>> Questions:
>>
>> Did I miss some others database that could work? Please suggest me if
>>you
>> know one.
>> What are benefits or drawbacks of leveled compaction for this workload?
>> Setting column TTL seems bad choice due to extra storage. Agree? Is
>> efficient to run routine batch job to purge oldest data? Is there will
>>be
>> any gotcha with that (like fullscan of something instead of just oldest,
>> maybe?)
>> Will column index beneficial? If reads are scans, does it matter, or is
>>it
>> just extra work and storage space to maintain, without much benefit
>> especially since reads are rare?
>> How gc_grace_seconds impacts operations in this workload? Will purges
>>of old
>> data leave sstables mostly obsolete, rather than sparsely obsolete? I
>>think
>> they will. So, after purge, tombstones can be GC shortly, no need for
>> default 10 days grace period. BUT, I read in docs that if
>>gc_grace_seconds
>> is short, then nodetool repair needs run quite often. Is that true? Why
>> would that be needed in my use case?
>> Related question: is it sensible to set tombstone_threshold to 1.0 but
>> tombstone_compaction_interval to something short, like 1 hour? I suppose
>> this depends on whether I am correct that SSTables will be deleted
>>entirely,
>> instead of just getting sparse.
>> Should I disable row_cache_provider? It invalidates every row on update,
>> right? I will be updating rows constantly, so it seems not benefitial.
>> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
>> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
>> periodic deletions of expired columns? Do I need to make sure my purges
>>of
>> old data are trickled out over time to avoid huge overhead of
>>compaction?
>> But in that case, SSTables will become sparsely deleted, right? And then
>> re-compacted, which seems wasteful if the remaining data will soon be
>>purged
>> again and there will be another re-compaction. So this is partially why
>>I
>> asked about tombstone-threshold and compaction interval -- I think is
>>best
>> if I can purge data in such a way that Cassandra never recompacts
>>SsTables,
>> but just realizes "oh, whole thing is dead, I can delete, no work
>>needed."
>> But I am not sure if my considered settings will have unintended
>> consequence.
>> Finally, with proposed workload, will there be troubles with
>> flush_larges_memtables_at and reduce_cache_capacity_to,
>> reduce_cache_sizes_at? These are describe as "emergency measures" in
>>docs.
>> If my workload is edge case that could trigger bad emergency-measure
>> behavior I hope you can say me that :-)
>>
>> Many thanks!
>>
>> Boris


Re: Seeking suggestions for a use case

Posted by Edward Capriolo <ed...@gmail.com>.
Your use case is 100% on the money for Cassandra. But let me take a
chance to slam the other NoSQLs. (not really slam but you know)

Riak is a key-value store. It is not a column family store where a
rowkey has a map of sorted values. This makes the time series more
awkward as the time series has to span many rows, rather then one
large row.

HBase has similiar problems with time-series. On one hand if your
rowkeys are series you get hotspots, if you columns are time series
you run into two subtle issues. Last I check hbase's on disk format
repeats the key each time (somewhat wasteful)

key,column,value
key,column,value
key,column,value

Also there are issues with really big rows, although they are dealt
with in a similiar way to really wide rows in cassandra, just use time
as part of the row key and the rows will not get that large.

I do not think you need leveled compaction for an append only
workload, although it might be helpful depending on how long you want
to keep these rows. If you are not keeping them very long possibly
leveled would keep the on disk size smaller.

Column TTLs in cassandra do not require extra storage. It is a very
efficient way to do this. Otherwise you have to scan through your data
with some offline process and delete.

Do not worry about gc_grace to much. The moral is because of
distributed deletes some data lives on disk for a while after it is
deleted. All this means is you need "some" more storage then just the
space for your live data.

Don't use row cache with wide rows REPEAT Don't use row cache with wide rows

Compaction throughput is metered on each node (again not a setting to
worry about

if you are hitting flush_largest_memtables_at and
reduce_cache_capacity_to it basically means your have over tuned or
you do not have enough hardware. These are mostly emergency valves and
if you are setup well these are not a factor. They are only around to
relieve memory pressure to prevent the node from hitting a cycle where
it is in GC more then it is in serving mode.

Whew!

Anyway. Nice to see that you are trying to understand the knobs,
before kicking the tires.

On Tue, Feb 12, 2013 at 5:55 AM, Boris Solovyov
<bo...@gmail.com> wrote:
> Hello list!
>
> I have application with following characteristics:
>
> data is time series, tens of millions of series at 1-sec granularity, like
> stock ticker data
> values are timestamp, integer (uint64)
> data is append only, never update
> data don't write in distant past, maybe sometimes write 10 sec ago but not
> more
> data is write mostly, like 99.9% write I think
> most read will be of recent data, always in range of timestamps
> data needs purge after some time, ex. 1 week
>
> I consider to use Cassandra. No other existing database (HBase, Riak, etc)
> seems well suited for this.
>
> Questions:
>
> Did I miss some others database that could work? Please suggest me if you
> know one.
> What are benefits or drawbacks of leveled compaction for this workload?
> Setting column TTL seems bad choice due to extra storage. Agree? Is
> efficient to run routine batch job to purge oldest data? Is there will be
> any gotcha with that (like fullscan of something instead of just oldest,
> maybe?)
> Will column index beneficial? If reads are scans, does it matter, or is it
> just extra work and storage space to maintain, without much benefit
> especially since reads are rare?
> How gc_grace_seconds impacts operations in this workload? Will purges of old
> data leave sstables mostly obsolete, rather than sparsely obsolete? I think
> they will. So, after purge, tombstones can be GC shortly, no need for
> default 10 days grace period. BUT, I read in docs that if gc_grace_seconds
> is short, then nodetool repair needs run quite often. Is that true? Why
> would that be needed in my use case?
> Related question: is it sensible to set tombstone_threshold to 1.0 but
> tombstone_compaction_interval to something short, like 1 hour? I suppose
> this depends on whether I am correct that SSTables will be deleted entirely,
> instead of just getting sparse.
> Should I disable row_cache_provider? It invalidates every row on update,
> right? I will be updating rows constantly, so it seems not benefitial.
> Docs say "compaction_throughput_mb_per_sec" is per "entire system." Does
> that mean per NODE, or per ENTIRE CLUSTER? Will this cause trouble with
> periodic deletions of expired columns? Do I need to make sure my purges of
> old data are trickled out over time to avoid huge overhead of compaction?
> But in that case, SSTables will become sparsely deleted, right? And then
> re-compacted, which seems wasteful if the remaining data will soon be purged
> again and there will be another re-compaction. So this is partially why I
> asked about tombstone-threshold and compaction interval -- I think is best
> if I can purge data in such a way that Cassandra never recompacts SsTables,
> but just realizes "oh, whole thing is dead, I can delete, no work needed."
> But I am not sure if my considered settings will have unintended
> consequence.
> Finally, with proposed workload, will there be troubles with
> flush_larges_memtables_at and reduce_cache_capacity_to,
> reduce_cache_sizes_at? These are describe as "emergency measures" in docs.
> If my workload is edge case that could trigger bad emergency-measure
> behavior I hope you can say me that :-)
>
> Many thanks!
>
> Boris