You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ian Rose <ia...@fullstory.com> on 2014/08/10 20:19:19 UTC

clarification on 100k tombstone limit in indexes

Hi -

On this page (
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
the docs state:

Do not use an index [...] On a frequently updated or deleted column

and

> *Problems using an index on a frequently updated or deleted column*¶
> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>

Cassandra stores tombstones in the index until the tombstone limit reaches
> 100K cells. After exceeding the tombstone limit, the query that uses the
> indexed value will fail.

I'm afraid I don't really understand this limit from its (brief)
description. I also saw this recent thread
<http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E>
but
I'm afraid it didn't help me much...

*SHORT VERSION*

If I have tens or hundreds of thousands of rows in a keyspace, where every
row has an indexed column that is updated O(10) times during the lifetime
of each row, is that going to cause problems for me? If that 100k
limit is *per
row* then I should be fine but if that 100k limit is *per keyspace* then
I'd definitely exceed it quickly.

*FULL EXPLANATION*

In our system, items are created at a rate of ~10/sec. Each item is
updated ~10 times over the next few minutes (although in rare cases the
number of updates, and the duration, might be several times as long). Once
the last update is received for an item, we select it from Cassandra,
process the data, then delete the entire row.

The tricky bit is that sometimes (maybe 30-40% of the time) we don't
actually know when the last update has been received so we use a timeout:
if an item hasn't been updated for 30 minutes, then we assume it is done
and should process it as before (select, then delete). So I am trying to
design a schema that will allow for efficient queries of the form "find me
all items that have not been updated in the past 30 minutes." We plan to
call this query once a minute.

Here is my tentative schema:

CREATE TABLE items (
item_id ascii,
last_updated timestamp,
item_data list<blob>,
PRIMARY KEY (item_id)
)
plus an index on last_updated.

So updates to an existing item would just be "lookup by item_id, append new
data to item_data, and set last_updated to now". And queries to find items
that have timed out would use the index on last_updated: "find all items
where last_updated < [now - 30 minutes]".

Assuming, that is, that the aforementioned 100k tombstone limit won't bring
this index crashing to a halt...

Any clarification on this limit and/or suggestions on a better way to
model/implement this system would be greatly appreciated!

Cheers,
Ian

Re: clarification on 100k tombstone limit in indexes

Posted by Mark Reddy <ma...@boxever.com>.

Hi Ian

Are these tombstones ever "GCed" out of the index?  How frequently?


Yes, tombstones are removed after the time specified by gc_grace_seconds
has elapsed, which by default is 10 days and is configurable. Knowing and
understanding how Cassandra handles distributed deletes is key to designing
an efficient schema if you plan to delete often. There are lots of
resources on how deletes are handled in Cassandra, take a look at these
links for example:
http://wiki.apache.org/cassandra/DistributedDeletes
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html

Mark


On Sun, Aug 10, 2014 at 9:02 PM, Ian Rose <ia...@fullstory.com> wrote:

> Hi Mark -
>
> Thanks for the clarification but as I'm not too familiar with the nuts &
> bolts of Cassandra I'm not sure how to apply that info to my current
> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
> index?  How frequently?  If not, then it seems like *any* index is at risk
> of reaching this tipping point; it's just that indexes on frequently
> updated columns will reach this pointer faster the indexes on rarely
> updated columns.
>
> Basically I'm trying to get some kind of sense for what "frequently
> updated
> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
> means quantitatively.  As written, the docs make it sound dangerous to
> create an index on a column that is *ever* deleted or updated since there
> is no sense of how frequent is "too frequent".
>
> Cheers,
> Ian
>
>
>
>
>
>
>
> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com>
> wrote:
>
>> Hi Ian,
>>
>> The issues here, which relates to normal and index column families, is
>> scanning over a large number of tombstones can cause Cassandra to fall over
>> due to increased GC pressure. This pressure is caused because tombstones
>> will create DeletedColumn objects which consume heap. Also
>> these DeletedColumn objects will have to be serialized and sent back to the
>> coordinator, thus increasing your response times. Take for example a row
>> that does deletes and you query it with a limit of 100. In a worst case
>> scenario you could end up reading say 50k tombstones to reach the 100
>> 'live' column limit, all of which has to be put on heap and then sent over
>> the wire to the coordinator. This would be considered a Cassandra
>> anti-pattern.[1]
>>
>> With that in mind there was a debug warning added to 1.2 to inform the
>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>> the action was taken to drop requests reaching 100k tombstones[3] rather
>> than just printing out a warning. This is a safety measure, as it is not
>> advised to perform such a query and is a result of most people 'doing it
>> wrong'.
>>
>> For those people who understand the risk of scanning over large numbers
>> of tombstones there is a configuration option in the cassandra.yaml to
>> increase this threshold, tombstone_failure_threshold.[4]
>>
>>
>> Mark
>>
>> [1]
>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>> [4]
>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>
>>
>>
>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com> wrote:
>>
>>> Hi -
>>>
>>> On this page (
>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>> the docs state:
>>>
>>> Do not use an index [...] On a frequently updated or deleted column
>>>
>>>
>>> and
>>>
>>>
>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>
>>> Cassandra stores tombstones in the index until the tombstone limit
>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>> uses the indexed value will fail.
>>>
>>>
>>>
>>> I'm afraid I don't really understand this limit from its (brief)
>>> description.  I also saw this recent thread
>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>>> I'm afraid it didn't help me much...
>>>
>>>
>>> *SHORT VERSION*
>>>
>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>> every row has an indexed column that is updated O(10) times during the
>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>> keyspace* then I'd definitely exceed it quickly.
>>>
>>>
>>> *FULL EXPLANATION*
>>>
>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>> updated ~10 times over the next few minutes (although in rare cases the
>>> number of updates, and the duration, might be several times as long).  Once
>>> the last update is received for an item, we select it from Cassandra,
>>> process the data, then delete the entire row.
>>>
>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>> actually know when the last update has been received so we use a timeout:
>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>> and should process it as before (select, then delete).  So I am trying to
>>> design a schema that will allow for efficient queries of the form "find me
>>> all items that have not been updated in the past 30 minutes."  We plan to
>>> call this query once a minute.
>>>
>>> Here is my tentative schema:
>>>
>>> CREATE TABLE items (
>>>   item_id ascii,
>>>   last_updated timestamp,
>>>   item_data list<blob>,
>>>   PRIMARY KEY (item_id)
>>> )
>>> plus an index on last_updated.
>>>
>>> So updates to an existing item would just be "lookup by item_id, append
>>> new data to item_data, and set last_updated to now".  And queries to find
>>> items that have timed out would use the index on last_updated: "find all
>>> items where last_updated < [now - 30 minutes]".
>>>
>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>> bring this index crashing to a halt...
>>>
>>> Any clarification on this limit and/or suggestions on a better way to
>>> model/implement this system would be greatly appreciated!
>>>
>>> Cheers,
>>> Ian
>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by Ian Rose <ia...@fullstory.com>.

Makes sense - thanks again!


On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan <do...@gmail.com> wrote:

> Hello Ian
>
> "So that way each index entry *will* have quite a few entries and the
> index as a whole won't grow too big.  Is my thinking correct here?" --> In
> this case yes. Do not forget that for each date value, there will be 1
> corresponding index value + 10 updates. If you have an approximate count
> for "a few entries", a quick maths should give you an idea about how
> "large" the index partition is
>
> "I had considered an approach like this but my concern is that for any
> given minute *all* of the updates will be handled by a single node,
> right?" --> If you time resolution is a minute, yes it will be a problem.
> And depending on the insert rate, it can become a quickly a bottle neck
> during this minute.
>
>  The manual index approach suffers a lot from bottleneck issue for heavy
> workload, that's the main reason they implement a distributed secondary
> index. There is no free lunch though. What you gain in term of control and
> tuning with the manual index, you loose on the load distribution side.
>
>
>
>
> On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose <ia...@fullstory.com> wrote:
>
>> Hi DuyHai,
>>
>> Thanks for the detailed response!  A few responses below:
>>
>> "On a side node, your usage of secondary index is not the best one.
>> Indeed, indexing the update date will lead to a situation where for one
>> date, you'll mostly have one or a few matching items (assuming that the
>> update date resolution is small enough and update rate is not intense)."
>> --> I should have mentioned this original (slipped my mind) but to deal
>> specifically with this problem I had planned to use a timestamp with a
>> resolution of 1 minute (like your minute_bucket).  So that way each index
>> entry *will* have quite a few entries and the index as a whole won't
>> grow too big.  Is my thinking correct here?
>>
>> "You better off create a manuel reverse-index to track modification
>> date, something like this"  --> I had considered an approach like this but
>> my concern is that for any given minute *all* of the updates will be
>> handled by a single node, right?  For example, if the minute_bucket is 2739
>> then for that one minute, every single item update will flow to the node at
>> HASH(2739).  Assuming I am thinking about that right, that seemed like a
>> potential scaling bottleneck, which scared me off that approach.
>>
>> Cheers,
>> Ian
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan <do...@gmail.com>
>> wrote:
>>
>>> Hello Ian
>>>
>>> "It sounds like this 100k limit is, indeed, a "global" limit as opposed
>>> to a per-row limit" -->The threshold applies to each "REQUEST", not
>>> partition or globally.
>>>
>>> The threshold does not apply to a partition (physical row) simply
>>> because in one request you can fetch data from many partitions (multi get
>>> slice). There was a JIRA about this here:
>>> https://issues.apache.org/jira/browse/CASSANDRA-6865
>>>
>>> "Are these tombstones ever "GCed" out of the index?" --> Yes they are,
>>> during compactions of the index column family.
>>>
>>> "How frequently?" --> That's the real pain. Indeed you do not have any
>>> control on the tuning of secondary index CF compaction. As far as I know,
>>> the compaction settings (strategy, min/max thresholds...) inherits from the
>>> one of the base table
>>>
>>> Now, by looking very fast into your data model, it seems that you have a
>>> skinny partition patter. Since you mentioned that the date is updated only
>>> 10 times max, you should not run into the tombstonne threshold issue.
>>>
>>> On a side node, your usage of secondary index is not the best one.
>>> Indeed, indexing the update date will lead to a situation where for one
>>> date, you'll mostly have one or a few matching items (assuming that the
>>> update date resolution is small enough and update rate is not intense). It
>>> is the high-cardinality scenario to be avoided (
>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
>>> Plus, the query on the index (find all items where last_updated < [now - 30
>>> minutes]) makes things worse since it is not an exact match but inequality.
>>>
>>>  You better off create a manuel reverse-index to track modification
>>> date, something like this:
>>>
>>> CREATE TABLE last_updated_item (
>>>     minute_bucket int, // format YYYYMMDDHHmm
>>>     last_update_date timestamp,
>>>     item_id ascii,
>>>     PRIMARY KEY(minute_bucket, last_update_date)
>>> );
>>>
>>>  The last_update_date column is quite self-explanatory. The
>>> minute_bucket is trickier. The idea is to split ranges on 30 minutes into
>>> buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on.
>>> For a whole day, you'd have 48 buckets. We need to put data into buckets to
>>> avoid ultra wide rows since you mentioned that there are 10 items (so 10
>>> updates) / sec. Of course, 30 mins is just an exemple, you can tune it down
>>> to a window of 5 minutes or 1 minute, depending on the insertion rate.
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ia...@fullstory.com>
>>> wrote:
>>>
>>>> Hi Mark -
>>>>
>>>> Thanks for the clarification but as I'm not too familiar with the nuts
>>>> & bolts of Cassandra I'm not sure how to apply that info to my current
>>>> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
>>>> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
>>>> index?  How frequently?  If not, then it seems like *any* index is at risk
>>>> of reaching this tipping point; it's just that indexes on frequently
>>>> updated columns will reach this pointer faster the indexes on rarely
>>>> updated columns.
>>>>
>>>> Basically I'm trying to get some kind of sense for what "frequently
>>>> updated
>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
>>>> means quantitatively.  As written, the docs make it sound dangerous to
>>>> create an index on a column that is *ever* deleted or updated since
>>>> there is no sense of how frequent is "too frequent".
>>>>
>>>> Cheers,
>>>> Ian
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com>
>>>> wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> The issues here, which relates to normal and index column families, is
>>>>> scanning over a large number of tombstones can cause Cassandra to fall over
>>>>> due to increased GC pressure. This pressure is caused because tombstones
>>>>> will create DeletedColumn objects which consume heap. Also
>>>>> these DeletedColumn objects will have to be serialized and sent back to the
>>>>> coordinator, thus increasing your response times. Take for example a row
>>>>> that does deletes and you query it with a limit of 100. In a worst case
>>>>> scenario you could end up reading say 50k tombstones to reach the 100
>>>>> 'live' column limit, all of which has to be put on heap and then sent over
>>>>> the wire to the coordinator. This would be considered a Cassandra
>>>>> anti-pattern.[1]
>>>>>
>>>>> With that in mind there was a debug warning added to 1.2 to inform the
>>>>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>>>>> the action was taken to drop requests reaching 100k tombstones[3] rather
>>>>> than just printing out a warning. This is a safety measure, as it is not
>>>>> advised to perform such a query and is a result of most people 'doing it
>>>>> wrong'.
>>>>>
>>>>> For those people who understand the risk of scanning over large
>>>>> numbers of tombstones there is a configuration option in the cassandra.yaml
>>>>> to increase this threshold, tombstone_failure_threshold.[4]
>>>>>
>>>>>
>>>>> Mark
>>>>>
>>>>> [1]
>>>>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>>>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>>>>> [4]
>>>>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com>
>>>>> wrote:
>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> On this page (
>>>>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>>>>> the docs state:
>>>>>>
>>>>>> Do not use an index [...] On a frequently updated or deleted column
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>>>>
>>>>>> Cassandra stores tombstones in the index until the tombstone limit
>>>>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>>>>> uses the indexed value will fail.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm afraid I don't really understand this limit from its (brief)
>>>>>> description.  I also saw this recent thread
>>>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>>>>>> I'm afraid it didn't help me much...
>>>>>>
>>>>>>
>>>>>> *SHORT VERSION*
>>>>>>
>>>>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>>>>> every row has an indexed column that is updated O(10) times during the
>>>>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>>>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>>>>> keyspace* then I'd definitely exceed it quickly.
>>>>>>
>>>>>>
>>>>>> *FULL EXPLANATION*
>>>>>>
>>>>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>>>>> updated ~10 times over the next few minutes (although in rare cases the
>>>>>> number of updates, and the duration, might be several times as long).  Once
>>>>>> the last update is received for an item, we select it from Cassandra,
>>>>>> process the data, then delete the entire row.
>>>>>>
>>>>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>>>>> actually know when the last update has been received so we use a timeout:
>>>>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>>>>> and should process it as before (select, then delete).  So I am trying to
>>>>>> design a schema that will allow for efficient queries of the form "find me
>>>>>> all items that have not been updated in the past 30 minutes."  We plan to
>>>>>> call this query once a minute.
>>>>>>
>>>>>> Here is my tentative schema:
>>>>>>
>>>>>> CREATE TABLE items (
>>>>>>   item_id ascii,
>>>>>>   last_updated timestamp,
>>>>>>   item_data list<blob>,
>>>>>>   PRIMARY KEY (item_id)
>>>>>> )
>>>>>> plus an index on last_updated.
>>>>>>
>>>>>> So updates to an existing item would just be "lookup by item_id,
>>>>>> append new data to item_data, and set last_updated to now".  And queries to
>>>>>> find items that have timed out would use the index on last_updated: "find
>>>>>> all items where last_updated < [now - 30 minutes]".
>>>>>>
>>>>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>>>>> bring this index crashing to a halt...
>>>>>>
>>>>>> Any clarification on this limit and/or suggestions on a better way to
>>>>>> model/implement this system would be greatly appreciated!
>>>>>>
>>>>>> Cheers,
>>>>>> Ian
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by DuyHai Doan <do...@gmail.com>.

Hello Ian

"So that way each index entry *will* have quite a few entries and the index
as a whole won't grow too big.  Is my thinking correct here?" --> In this
case yes. Do not forget that for each date value, there will be 1
corresponding index value + 10 updates. If you have an approximate count
for "a few entries", a quick maths should give you an idea about how
"large" the index partition is

"I had considered an approach like this but my concern is that for any
given minute *all* of the updates will be handled by a single node, right?"
--> If you time resolution is a minute, yes it will be a problem. And
depending on the insert rate, it can become a quickly a bottle neck during
this minute.

 The manual index approach suffers a lot from bottleneck issue for heavy
workload, that's the main reason they implement a distributed secondary
index. There is no free lunch though. What you gain in term of control and
tuning with the manual index, you loose on the load distribution side.




On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose <ia...@fullstory.com> wrote:

> Hi DuyHai,
>
> Thanks for the detailed response!  A few responses below:
>
> "On a side node, your usage of secondary index is not the best one.
> Indeed, indexing the update date will lead to a situation where for one
> date, you'll mostly have one or a few matching items (assuming that the
> update date resolution is small enough and update rate is not intense)."
> --> I should have mentioned this original (slipped my mind) but to deal
> specifically with this problem I had planned to use a timestamp with a
> resolution of 1 minute (like your minute_bucket).  So that way each index
> entry *will* have quite a few entries and the index as a whole won't grow
> too big.  Is my thinking correct here?
>
> "You better off create a manuel reverse-index to track modification date,
> something like this"  --> I had considered an approach like this but my
> concern is that for any given minute *all* of the updates will be handled
> by a single node, right?  For example, if the minute_bucket is 2739 then
> for that one minute, every single item update will flow to the node at
> HASH(2739).  Assuming I am thinking about that right, that seemed like a
> potential scaling bottleneck, which scared me off that approach.
>
> Cheers,
> Ian
>
>
>
>
> On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan <do...@gmail.com> wrote:
>
>> Hello Ian
>>
>> "It sounds like this 100k limit is, indeed, a "global" limit as opposed
>> to a per-row limit" -->The threshold applies to each "REQUEST", not
>> partition or globally.
>>
>> The threshold does not apply to a partition (physical row) simply because
>> in one request you can fetch data from many partitions (multi get slice).
>> There was a JIRA about this here:
>> https://issues.apache.org/jira/browse/CASSANDRA-6865
>>
>> "Are these tombstones ever "GCed" out of the index?" --> Yes they are,
>> during compactions of the index column family.
>>
>> "How frequently?" --> That's the real pain. Indeed you do not have any
>> control on the tuning of secondary index CF compaction. As far as I know,
>> the compaction settings (strategy, min/max thresholds...) inherits from the
>> one of the base table
>>
>> Now, by looking very fast into your data model, it seems that you have a
>> skinny partition patter. Since you mentioned that the date is updated only
>> 10 times max, you should not run into the tombstonne threshold issue.
>>
>> On a side node, your usage of secondary index is not the best one.
>> Indeed, indexing the update date will lead to a situation where for one
>> date, you'll mostly have one or a few matching items (assuming that the
>> update date resolution is small enough and update rate is not intense). It
>> is the high-cardinality scenario to be avoided (
>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
>> Plus, the query on the index (find all items where last_updated < [now - 30
>> minutes]) makes things worse since it is not an exact match but inequality.
>>
>>  You better off create a manuel reverse-index to track modification date,
>> something like this:
>>
>> CREATE TABLE last_updated_item (
>>     minute_bucket int, // format YYYYMMDDHHmm
>>     last_update_date timestamp,
>>     item_id ascii,
>>     PRIMARY KEY(minute_bucket, last_update_date)
>> );
>>
>>  The last_update_date column is quite self-explanatory. The minute_bucket
>> is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
>> to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
>> day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
>> wide rows since you mentioned that there are 10 items (so 10 updates) /
>> sec. Of course, 30 mins is just an exemple, you can tune it down to a
>> window of 5 minutes or 1 minute, depending on the insertion rate.
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ia...@fullstory.com> wrote:
>>
>>> Hi Mark -
>>>
>>> Thanks for the clarification but as I'm not too familiar with the nuts &
>>> bolts of Cassandra I'm not sure how to apply that info to my current
>>> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
>>> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
>>> index?  How frequently?  If not, then it seems like *any* index is at risk
>>> of reaching this tipping point; it's just that indexes on frequently
>>> updated columns will reach this pointer faster the indexes on rarely
>>> updated columns.
>>>
>>> Basically I'm trying to get some kind of sense for what "frequently
>>> updated
>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
>>> means quantitatively.  As written, the docs make it sound dangerous to
>>> create an index on a column that is *ever* deleted or updated since
>>> there is no sense of how frequent is "too frequent".
>>>
>>> Cheers,
>>> Ian
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com>
>>> wrote:
>>>
>>>> Hi Ian,
>>>>
>>>> The issues here, which relates to normal and index column families, is
>>>> scanning over a large number of tombstones can cause Cassandra to fall over
>>>> due to increased GC pressure. This pressure is caused because tombstones
>>>> will create DeletedColumn objects which consume heap. Also
>>>> these DeletedColumn objects will have to be serialized and sent back to the
>>>> coordinator, thus increasing your response times. Take for example a row
>>>> that does deletes and you query it with a limit of 100. In a worst case
>>>> scenario you could end up reading say 50k tombstones to reach the 100
>>>> 'live' column limit, all of which has to be put on heap and then sent over
>>>> the wire to the coordinator. This would be considered a Cassandra
>>>> anti-pattern.[1]
>>>>
>>>> With that in mind there was a debug warning added to 1.2 to inform the
>>>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>>>> the action was taken to drop requests reaching 100k tombstones[3] rather
>>>> than just printing out a warning. This is a safety measure, as it is not
>>>> advised to perform such a query and is a result of most people 'doing it
>>>> wrong'.
>>>>
>>>> For those people who understand the risk of scanning over large numbers
>>>> of tombstones there is a configuration option in the cassandra.yaml to
>>>> increase this threshold, tombstone_failure_threshold.[4]
>>>>
>>>>
>>>> Mark
>>>>
>>>> [1]
>>>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>>>> [4]
>>>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>>>
>>>>
>>>>
>>>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com>
>>>> wrote:
>>>>
>>>>> Hi -
>>>>>
>>>>> On this page (
>>>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>>>> the docs state:
>>>>>
>>>>> Do not use an index [...] On a frequently updated or deleted column
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>>
>>>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>>>
>>>>> Cassandra stores tombstones in the index until the tombstone limit
>>>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>>>> uses the indexed value will fail.
>>>>>
>>>>>
>>>>>
>>>>> I'm afraid I don't really understand this limit from its (brief)
>>>>> description.  I also saw this recent thread
>>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>>>>> I'm afraid it didn't help me much...
>>>>>
>>>>>
>>>>> *SHORT VERSION*
>>>>>
>>>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>>>> every row has an indexed column that is updated O(10) times during the
>>>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>>>> keyspace* then I'd definitely exceed it quickly.
>>>>>
>>>>>
>>>>> *FULL EXPLANATION*
>>>>>
>>>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>>>> updated ~10 times over the next few minutes (although in rare cases the
>>>>> number of updates, and the duration, might be several times as long).  Once
>>>>> the last update is received for an item, we select it from Cassandra,
>>>>> process the data, then delete the entire row.
>>>>>
>>>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>>>> actually know when the last update has been received so we use a timeout:
>>>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>>>> and should process it as before (select, then delete).  So I am trying to
>>>>> design a schema that will allow for efficient queries of the form "find me
>>>>> all items that have not been updated in the past 30 minutes."  We plan to
>>>>> call this query once a minute.
>>>>>
>>>>> Here is my tentative schema:
>>>>>
>>>>> CREATE TABLE items (
>>>>>   item_id ascii,
>>>>>   last_updated timestamp,
>>>>>   item_data list<blob>,
>>>>>   PRIMARY KEY (item_id)
>>>>> )
>>>>> plus an index on last_updated.
>>>>>
>>>>> So updates to an existing item would just be "lookup by item_id,
>>>>> append new data to item_data, and set last_updated to now".  And queries to
>>>>> find items that have timed out would use the index on last_updated: "find
>>>>> all items where last_updated < [now - 30 minutes]".
>>>>>
>>>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>>>> bring this index crashing to a halt...
>>>>>
>>>>> Any clarification on this limit and/or suggestions on a better way to
>>>>> model/implement this system would be greatly appreciated!
>>>>>
>>>>> Cheers,
>>>>> Ian
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by Tyler Hobbs <ty...@datastax.com>.

On Wed, Aug 13, 2014 at 4:35 AM, DuyHai Doan <do...@gmail.com> wrote:

> "add an additional integer column to the partition key (making it a
> composite partition key if it isn't already).  When inserting, randomly
> pick a value between, say, 0 and 10 to use for this column"  --> Due to the
> low cardinality of bucket (only 10), there is no guarantee that the
> partitions would be distributed evenly. But it's better than nothing.
>

It's important to think about it probablistically, i.e. "what is the
probability that all ten partitions belong to the same node?"  If you have
a ten node cluster (assume RF=1 for simplicity), there's a 1/10^9 (one in a
billion) chance that a single node is the owner for all partitions.  So
it's quite a bit better than nothing.  If you want to improve your odds,
bump the number up.  But, keep in mind that it's a balance, because reads
become more expensive.

>
> "Alternatively, instead of using a random number, you could hash the
> other key components and use the lowest bits for the value.  This has the
> advantage of being deterministic" --> Does it work with VNodes, where
> tokens are split in 256 ranges and shuffled in all nodes ?
>

Yes, it works perfectly fine with vnodes.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: clarification on 100k tombstone limit in indexes

Posted by DuyHai Doan <do...@gmail.com>.

"add an additional integer column to the partition key (making it a
composite partition key if it isn't already).  When inserting, randomly
pick a value between, say, 0 and 10 to use for this column"  --> Due to the
low cardinality of bucket (only 10), there is no guarantee that the
partitions would be distributed evenly. But it's better than nothing.

"Alternatively, instead of using a random number, you could hash the other
key components and use the lowest bits for the value.  This has the
advantage of being deterministic" --> Does it work with VNodes, where
tokens are split in 256 ranges and shuffled in all nodes ?

On Tue, Aug 12, 2014 at 7:39 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose <ia...@fullstory.com> wrote:
>
>>
>> "You better off create a manuel reverse-index to track modification
>> date, something like this"  --> I had considered an approach like this but
>> my concern is that for any given minute *all* of the updates will be
>> handled by a single node, right?  For example, if the minute_bucket is 2739
>> then for that one minute, every single item update will flow to the node at
>> HASH(2739).  Assuming I am thinking about that right, that seemed like a
>> potential scaling bottleneck, which scared me off that approach.
>>
>
> If you're concerned about bottlenecking on one node (or set of replicas)
> during the minute, add an additional integer column to the partition key
> (making it a composite partition key if it isn't already).  When inserting,
> randomly pick a value between, say, 0 and 10 to use for this column.  When
> reading, read all 10 partitions and merge them.  (Alternatively, instead of
> using a random number, you could hash the other key components and use the
> lowest bits for the value.  This has the advantage of being deterministic.)
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: clarification on 100k tombstone limit in indexes

Posted by Tyler Hobbs <ty...@datastax.com>.

On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose <ia...@fullstory.com> wrote:

>
> "You better off create a manuel reverse-index to track modification date,
> something like this"  --> I had considered an approach like this but my
> concern is that for any given minute *all* of the updates will be handled
> by a single node, right?  For example, if the minute_bucket is 2739 then
> for that one minute, every single item update will flow to the node at
> HASH(2739).  Assuming I am thinking about that right, that seemed like a
> potential scaling bottleneck, which scared me off that approach.
>

If you're concerned about bottlenecking on one node (or set of replicas)
during the minute, add an additional integer column to the partition key
(making it a composite partition key if it isn't already).  When inserting,
randomly pick a value between, say, 0 and 10 to use for this column.  When
reading, read all 10 partitions and merge them.  (Alternatively, instead of
using a random number, you could hash the other key components and use the
lowest bits for the value.  This has the advantage of being deterministic.)

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: clarification on 100k tombstone limit in indexes

Posted by Ian Rose <ia...@fullstory.com>.

Hi DuyHai,

Thanks for the detailed response!  A few responses below:

"On a side node, your usage of secondary index is not the best one. Indeed,
indexing the update date will lead to a situation where for one date,
you'll mostly have one or a few matching items (assuming that the update
date resolution is small enough and update rate is not intense)." --> I
should have mentioned this original (slipped my mind) but to deal
specifically with this problem I had planned to use a timestamp with a
resolution of 1 minute (like your minute_bucket).  So that way each index
entry *will* have quite a few entries and the index as a whole won't grow
too big.  Is my thinking correct here?

"You better off create a manuel reverse-index to track modification date,
something like this"  --> I had considered an approach like this but my
concern is that for any given minute *all* of the updates will be handled
by a single node, right?  For example, if the minute_bucket is 2739 then
for that one minute, every single item update will flow to the node at
HASH(2739).  Assuming I am thinking about that right, that seemed like a
potential scaling bottleneck, which scared me off that approach.

Cheers,
Ian




On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan <do...@gmail.com> wrote:

> Hello Ian
>
> "It sounds like this 100k limit is, indeed, a "global" limit as opposed to
> a per-row limit" -->The threshold applies to each "REQUEST", not partition
> or globally.
>
> The threshold does not apply to a partition (physical row) simply because
> in one request you can fetch data from many partitions (multi get slice).
> There was a JIRA about this here:
> https://issues.apache.org/jira/browse/CASSANDRA-6865
>
> "Are these tombstones ever "GCed" out of the index?" --> Yes they are,
> during compactions of the index column family.
>
> "How frequently?" --> That's the real pain. Indeed you do not have any
> control on the tuning of secondary index CF compaction. As far as I know,
> the compaction settings (strategy, min/max thresholds...) inherits from the
> one of the base table
>
> Now, by looking very fast into your data model, it seems that you have a
> skinny partition patter. Since you mentioned that the date is updated only
> 10 times max, you should not run into the tombstonne threshold issue.
>
> On a side node, your usage of secondary index is not the best one. Indeed,
> indexing the update date will lead to a situation where for one date,
> you'll mostly have one or a few matching items (assuming that the update
> date resolution is small enough and update rate is not intense). It is the
> high-cardinality scenario to be avoided (
> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
> Plus, the query on the index (find all items where last_updated < [now - 30
> minutes]) makes things worse since it is not an exact match but inequality.
>
>  You better off create a manuel reverse-index to track modification date,
> something like this:
>
> CREATE TABLE last_updated_item (
>     minute_bucket int, // format YYYYMMDDHHmm
>     last_update_date timestamp,
>     item_id ascii,
>     PRIMARY KEY(minute_bucket, last_update_date)
> );
>
>  The last_update_date column is quite self-explanatory. The minute_bucket
> is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
> to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
> day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
> wide rows since you mentioned that there are 10 items (so 10 updates) /
> sec. Of course, 30 mins is just an exemple, you can tune it down to a
> window of 5 minutes or 1 minute, depending on the insertion rate.
>
>
>
>
>
> On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ia...@fullstory.com> wrote:
>
>> Hi Mark -
>>
>> Thanks for the clarification but as I'm not too familiar with the nuts &
>> bolts of Cassandra I'm not sure how to apply that info to my current
>> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
>> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
>> index?  How frequently?  If not, then it seems like *any* index is at risk
>> of reaching this tipping point; it's just that indexes on frequently
>> updated columns will reach this pointer faster the indexes on rarely
>> updated columns.
>>
>> Basically I'm trying to get some kind of sense for what "frequently
>> updated
>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
>> means quantitatively.  As written, the docs make it sound dangerous to
>> create an index on a column that is *ever* deleted or updated since
>> there is no sense of how frequent is "too frequent".
>>
>> Cheers,
>> Ian
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com>
>> wrote:
>>
>>> Hi Ian,
>>>
>>> The issues here, which relates to normal and index column families, is
>>> scanning over a large number of tombstones can cause Cassandra to fall over
>>> due to increased GC pressure. This pressure is caused because tombstones
>>> will create DeletedColumn objects which consume heap. Also
>>> these DeletedColumn objects will have to be serialized and sent back to the
>>> coordinator, thus increasing your response times. Take for example a row
>>> that does deletes and you query it with a limit of 100. In a worst case
>>> scenario you could end up reading say 50k tombstones to reach the 100
>>> 'live' column limit, all of which has to be put on heap and then sent over
>>> the wire to the coordinator. This would be considered a Cassandra
>>> anti-pattern.[1]
>>>
>>> With that in mind there was a debug warning added to 1.2 to inform the
>>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>>> the action was taken to drop requests reaching 100k tombstones[3] rather
>>> than just printing out a warning. This is a safety measure, as it is not
>>> advised to perform such a query and is a result of most people 'doing it
>>> wrong'.
>>>
>>> For those people who understand the risk of scanning over large numbers
>>> of tombstones there is a configuration option in the cassandra.yaml to
>>> increase this threshold, tombstone_failure_threshold.[4]
>>>
>>>
>>> Mark
>>>
>>> [1]
>>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>>> [4]
>>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>>
>>>
>>>
>>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com> wrote:
>>>
>>>> Hi -
>>>>
>>>> On this page (
>>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>>> the docs state:
>>>>
>>>> Do not use an index [...] On a frequently updated or deleted column
>>>>
>>>>
>>>> and
>>>>
>>>>
>>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>>
>>>> Cassandra stores tombstones in the index until the tombstone limit
>>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>>> uses the indexed value will fail.
>>>>
>>>>
>>>>
>>>> I'm afraid I don't really understand this limit from its (brief)
>>>> description.  I also saw this recent thread
>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>>>> I'm afraid it didn't help me much...
>>>>
>>>>
>>>> *SHORT VERSION*
>>>>
>>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>>> every row has an indexed column that is updated O(10) times during the
>>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>>> keyspace* then I'd definitely exceed it quickly.
>>>>
>>>>
>>>> *FULL EXPLANATION*
>>>>
>>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>>> updated ~10 times over the next few minutes (although in rare cases the
>>>> number of updates, and the duration, might be several times as long).  Once
>>>> the last update is received for an item, we select it from Cassandra,
>>>> process the data, then delete the entire row.
>>>>
>>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>>> actually know when the last update has been received so we use a timeout:
>>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>>> and should process it as before (select, then delete).  So I am trying to
>>>> design a schema that will allow for efficient queries of the form "find me
>>>> all items that have not been updated in the past 30 minutes."  We plan to
>>>> call this query once a minute.
>>>>
>>>> Here is my tentative schema:
>>>>
>>>> CREATE TABLE items (
>>>>   item_id ascii,
>>>>   last_updated timestamp,
>>>>   item_data list<blob>,
>>>>   PRIMARY KEY (item_id)
>>>> )
>>>> plus an index on last_updated.
>>>>
>>>> So updates to an existing item would just be "lookup by item_id, append
>>>> new data to item_data, and set last_updated to now".  And queries to find
>>>> items that have timed out would use the index on last_updated: "find all
>>>> items where last_updated < [now - 30 minutes]".
>>>>
>>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>>> bring this index crashing to a halt...
>>>>
>>>> Any clarification on this limit and/or suggestions on a better way to
>>>> model/implement this system would be greatly appreciated!
>>>>
>>>> Cheers,
>>>> Ian
>>>>
>>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by DuyHai Doan <do...@gmail.com>.

Hello Ian

"It sounds like this 100k limit is, indeed, a "global" limit as opposed to
a per-row limit" -->The threshold applies to each "REQUEST", not partition
or globally.

The threshold does not apply to a partition (physical row) simply because
in one request you can fetch data from many partitions (multi get slice).
There was a JIRA about this here:
https://issues.apache.org/jira/browse/CASSANDRA-6865

"Are these tombstones ever "GCed" out of the index?" --> Yes they are,
during compactions of the index column family.

"How frequently?" --> That's the real pain. Indeed you do not have any
control on the tuning of secondary index CF compaction. As far as I know,
the compaction settings (strategy, min/max thresholds...) inherits from the
one of the base table

Now, by looking very fast into your data model, it seems that you have a
skinny partition patter. Since you mentioned that the date is updated only
10 times max, you should not run into the tombstonne threshold issue.

On a side node, your usage of secondary index is not the best one. Indeed,
indexing the update date will lead to a situation where for one date,
you'll mostly have one or a few matching items (assuming that the update
date resolution is small enough and update rate is not intense). It is the
high-cardinality scenario to be avoided (
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
Plus, the query on the index (find all items where last_updated < [now - 30
minutes]) makes things worse since it is not an exact match but inequality.

 You better off create a manuel reverse-index to track modification date,
something like this:

CREATE TABLE last_updated_item (
    minute_bucket int, // format YYYYMMDDHHmm
    last_update_date timestamp,
    item_id ascii,
    PRIMARY KEY(minute_bucket, last_update_date)
);

 The last_update_date column is quite self-explanatory. The minute_bucket
is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
wide rows since you mentioned that there are 10 items (so 10 updates) /
sec. Of course, 30 mins is just an exemple, you can tune it down to a
window of 5 minutes or 1 minute, depending on the insertion rate.





On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ia...@fullstory.com> wrote:

> Hi Mark -
>
> Thanks for the clarification but as I'm not too familiar with the nuts &
> bolts of Cassandra I'm not sure how to apply that info to my current
> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
> index?  How frequently?  If not, then it seems like *any* index is at risk
> of reaching this tipping point; it's just that indexes on frequently
> updated columns will reach this pointer faster the indexes on rarely
> updated columns.
>
> Basically I'm trying to get some kind of sense for what "frequently
> updated
> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
> means quantitatively.  As written, the docs make it sound dangerous to
> create an index on a column that is *ever* deleted or updated since there
> is no sense of how frequent is "too frequent".
>
> Cheers,
> Ian
>
>
>
>
>
>
>
> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com>
> wrote:
>
>> Hi Ian,
>>
>> The issues here, which relates to normal and index column families, is
>> scanning over a large number of tombstones can cause Cassandra to fall over
>> due to increased GC pressure. This pressure is caused because tombstones
>> will create DeletedColumn objects which consume heap. Also
>> these DeletedColumn objects will have to be serialized and sent back to the
>> coordinator, thus increasing your response times. Take for example a row
>> that does deletes and you query it with a limit of 100. In a worst case
>> scenario you could end up reading say 50k tombstones to reach the 100
>> 'live' column limit, all of which has to be put on heap and then sent over
>> the wire to the coordinator. This would be considered a Cassandra
>> anti-pattern.[1]
>>
>> With that in mind there was a debug warning added to 1.2 to inform the
>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>> the action was taken to drop requests reaching 100k tombstones[3] rather
>> than just printing out a warning. This is a safety measure, as it is not
>> advised to perform such a query and is a result of most people 'doing it
>> wrong'.
>>
>> For those people who understand the risk of scanning over large numbers
>> of tombstones there is a configuration option in the cassandra.yaml to
>> increase this threshold, tombstone_failure_threshold.[4]
>>
>>
>> Mark
>>
>> [1]
>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>> [4]
>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>
>>
>>
>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com> wrote:
>>
>>> Hi -
>>>
>>> On this page (
>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>> the docs state:
>>>
>>> Do not use an index [...] On a frequently updated or deleted column
>>>
>>>
>>> and
>>>
>>>
>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>
>>> Cassandra stores tombstones in the index until the tombstone limit
>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>> uses the indexed value will fail.
>>>
>>>
>>>
>>> I'm afraid I don't really understand this limit from its (brief)
>>> description.  I also saw this recent thread
>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>>> I'm afraid it didn't help me much...
>>>
>>>
>>> *SHORT VERSION*
>>>
>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>> every row has an indexed column that is updated O(10) times during the
>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>> keyspace* then I'd definitely exceed it quickly.
>>>
>>>
>>> *FULL EXPLANATION*
>>>
>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>> updated ~10 times over the next few minutes (although in rare cases the
>>> number of updates, and the duration, might be several times as long).  Once
>>> the last update is received for an item, we select it from Cassandra,
>>> process the data, then delete the entire row.
>>>
>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>> actually know when the last update has been received so we use a timeout:
>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>> and should process it as before (select, then delete).  So I am trying to
>>> design a schema that will allow for efficient queries of the form "find me
>>> all items that have not been updated in the past 30 minutes."  We plan to
>>> call this query once a minute.
>>>
>>> Here is my tentative schema:
>>>
>>> CREATE TABLE items (
>>>   item_id ascii,
>>>   last_updated timestamp,
>>>   item_data list<blob>,
>>>   PRIMARY KEY (item_id)
>>> )
>>> plus an index on last_updated.
>>>
>>> So updates to an existing item would just be "lookup by item_id, append
>>> new data to item_data, and set last_updated to now".  And queries to find
>>> items that have timed out would use the index on last_updated: "find all
>>> items where last_updated < [now - 30 minutes]".
>>>
>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>> bring this index crashing to a halt...
>>>
>>> Any clarification on this limit and/or suggestions on a better way to
>>> model/implement this system would be greatly appreciated!
>>>
>>> Cheers,
>>> Ian
>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by Ian Rose <ia...@fullstory.com>.

Hi Mark -

Thanks for the clarification but as I'm not too familiar with the nuts &
bolts of Cassandra I'm not sure how to apply that info to my current
situation.  It sounds like this 100k limit is, indeed, a "global" limit as
opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
index?  How frequently?  If not, then it seems like *any* index is at risk
of reaching this tipping point; it's just that indexes on frequently
updated columns will reach this pointer faster the indexes on rarely
updated columns.

Basically I'm trying to get some kind of sense for what "frequently updated
<http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
means quantitatively.  As written, the docs make it sound dangerous to
create an index on a column that is *ever* deleted or updated since there
is no sense of how frequent is "too frequent".

Cheers,
Ian







On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <ma...@boxever.com> wrote:

> Hi Ian,
>
> The issues here, which relates to normal and index column families, is
> scanning over a large number of tombstones can cause Cassandra to fall over
> due to increased GC pressure. This pressure is caused because tombstones
> will create DeletedColumn objects which consume heap. Also
> these DeletedColumn objects will have to be serialized and sent back to the
> coordinator, thus increasing your response times. Take for example a row
> that does deletes and you query it with a limit of 100. In a worst case
> scenario you could end up reading say 50k tombstones to reach the 100
> 'live' column limit, all of which has to be put on heap and then sent over
> the wire to the coordinator. This would be considered a Cassandra
> anti-pattern.[1]
>
> With that in mind there was a debug warning added to 1.2 to inform the
> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
> the action was taken to drop requests reaching 100k tombstones[3] rather
> than just printing out a warning. This is a safety measure, as it is not
> advised to perform such a query and is a result of most people 'doing it
> wrong'.
>
> For those people who understand the risk of scanning over large numbers of
> tombstones there is a configuration option in the cassandra.yaml to
> increase this threshold, tombstone_failure_threshold.[4]
>
>
> Mark
>
> [1]
> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
> [4]
> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>
>
>
> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com> wrote:
>
>> Hi -
>>
>> On this page (
>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>> the docs state:
>>
>> Do not use an index [...] On a frequently updated or deleted column
>>
>>
>> and
>>
>>
>>> *Problems using an index on a frequently updated or deleted column*¶
>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>
>> Cassandra stores tombstones in the index until the tombstone limit
>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>> uses the indexed value will fail.
>>
>>
>>
>> I'm afraid I don't really understand this limit from its (brief)
>> description.  I also saw this recent thread
>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
>> I'm afraid it didn't help me much...
>>
>>
>> *SHORT VERSION*
>>
>> If I have tens or hundreds of thousands of rows in a keyspace, where
>> every row has an indexed column that is updated O(10) times during the
>> lifetime of each row, is that going to cause problems for me?  If that 100k
>> limit is *per row* then I should be fine but if that 100k limit is *per
>> keyspace* then I'd definitely exceed it quickly.
>>
>>
>> *FULL EXPLANATION*
>>
>> In our system, items are created at a rate of ~10/sec.  Each item is
>> updated ~10 times over the next few minutes (although in rare cases the
>> number of updates, and the duration, might be several times as long).  Once
>> the last update is received for an item, we select it from Cassandra,
>> process the data, then delete the entire row.
>>
>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>> actually know when the last update has been received so we use a timeout:
>> if an item hasn't been updated for 30 minutes, then we assume it is done
>> and should process it as before (select, then delete).  So I am trying to
>> design a schema that will allow for efficient queries of the form "find me
>> all items that have not been updated in the past 30 minutes."  We plan to
>> call this query once a minute.
>>
>> Here is my tentative schema:
>>
>> CREATE TABLE items (
>>   item_id ascii,
>>   last_updated timestamp,
>>   item_data list<blob>,
>>   PRIMARY KEY (item_id)
>> )
>> plus an index on last_updated.
>>
>> So updates to an existing item would just be "lookup by item_id, append
>> new data to item_data, and set last_updated to now".  And queries to find
>> items that have timed out would use the index on last_updated: "find all
>> items where last_updated < [now - 30 minutes]".
>>
>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>> bring this index crashing to a halt...
>>
>> Any clarification on this limit and/or suggestions on a better way to
>> model/implement this system would be greatly appreciated!
>>
>> Cheers,
>> Ian
>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Posted by Mark Reddy <ma...@boxever.com>.

Hi Ian,

The issues here, which relates to normal and index column families, is
scanning over a large number of tombstones can cause Cassandra to fall over
due to increased GC pressure. This pressure is caused because tombstones
will create DeletedColumn objects which consume heap. Also
these DeletedColumn objects will have to be serialized and sent back to the
coordinator, thus increasing your response times. Take for example a row
that does deletes and you query it with a limit of 100. In a worst case
scenario you could end up reading say 50k tombstones to reach the 100
'live' column limit, all of which has to be put on heap and then sent over
the wire to the coordinator. This would be considered a Cassandra
anti-pattern.[1]

With that in mind there was a debug warning added to 1.2 to inform the user
when they were querying a row with 1000 tombstones [2]. Then in 2.0 the
action was taken to drop requests reaching 100k tombstones[3] rather than
just printing out a warning. This is a safety measure, as it is not advised
to perform such a query and is a result of most people 'doing it wrong'.

For those people who understand the risk of scanning over large numbers of
tombstones there is a configuration option in the cassandra.yaml to
increase this threshold, tombstone_failure_threshold.[4]


Mark

[1]
http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
[2] https://issues.apache.org/jira/browse/CASSANDRA-6042
[3] https://issues.apache.org/jira/browse/CASSANDRA-6117
[4]
https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417



On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ia...@fullstory.com> wrote:

> Hi -
>
> On this page (
> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
> the docs state:
>
> Do not use an index [...] On a frequently updated or deleted column
>
>
> and
>
>
>> *Problems using an index on a frequently updated or deleted column*¶
>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>
> Cassandra stores tombstones in the index until the tombstone limit reaches
>> 100K cells. After exceeding the tombstone limit, the query that uses the
>> indexed value will fail.
>
>
>
> I'm afraid I don't really understand this limit from its (brief)
> description.  I also saw this recent thread
> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E> but
> I'm afraid it didn't help me much...
>
>
> *SHORT VERSION*
>
> If I have tens or hundreds of thousands of rows in a keyspace, where every
> row has an indexed column that is updated O(10) times during the lifetime
> of each row, is that going to cause problems for me?  If that 100k limit is *per
> row* then I should be fine but if that 100k limit is *per keyspace* then
> I'd definitely exceed it quickly.
>
>
> *FULL EXPLANATION*
>
> In our system, items are created at a rate of ~10/sec.  Each item is
> updated ~10 times over the next few minutes (although in rare cases the
> number of updates, and the duration, might be several times as long).  Once
> the last update is received for an item, we select it from Cassandra,
> process the data, then delete the entire row.
>
> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
> actually know when the last update has been received so we use a timeout:
> if an item hasn't been updated for 30 minutes, then we assume it is done
> and should process it as before (select, then delete).  So I am trying to
> design a schema that will allow for efficient queries of the form "find me
> all items that have not been updated in the past 30 minutes."  We plan to
> call this query once a minute.
>
> Here is my tentative schema:
>
> CREATE TABLE items (
>   item_id ascii,
>   last_updated timestamp,
>   item_data list<blob>,
>   PRIMARY KEY (item_id)
> )
> plus an index on last_updated.
>
> So updates to an existing item would just be "lookup by item_id, append
> new data to item_data, and set last_updated to now".  And queries to find
> items that have timed out would use the index on last_updated: "find all
> items where last_updated < [now - 30 minutes]".
>
> Assuming, that is, that the aforementioned 100k tombstone limit won't
> bring this index crashing to a halt...
>
> Any clarification on this limit and/or suggestions on a better way to
> model/implement this system would be greatly appreciated!
>
> Cheers,
> Ian
>
>