You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Koert Kuipers <Ko...@diamondnotch.com> on 2010/10/14 20:55:12 UTC

deletion

Hello All,

I am testing Cassandra 0.7 with the Avro api on a single machine as a financial time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = field (price, volume, high, low), column = timestamp.

So a single value, say a price of 140.72 for IBM today at 14:00 would be stored as
tickdata["IBM"]["price"]["2010-10-14 14:00"] = 140.72 (well of course everything needs to be encoded properly but you get the point).

My subcomparator type is TimeUUIDType so that I can do queries over time ranges. Inserting and querying all work reasonably well so far.

But sometimes I have a need to wipe out all the data for all day. To be more precise: I need to delete the stored values for all keys (tickers) and all super-columns (fields) for a given time period (condition on column). How would I go about doing that? First a multiget_slice and then a remove command for each value? Or am I missing an easier way?

Is slice deletion within batch_mutate still scheduled to be implemented?

Thanks for your help,
Koert


Re: RE: deletion

Posted by Aaron Morton <aa...@thelastpickle.com>.
Ah, I see code in /thrift/ThriftValidation .java
throw new InvalidRequestException("Deletion does not yet support SliceRange predicates.");

Sorry about that, did not fully understand what you were saying. I've done something similar where I did a get slice get_slice then sent a single batch_mutate to delete all the col's the get_slice returned. This was done in a loop to process say 1,000 columns at a time. 

If deleting data is important to your app, and you *generally* do it over the same time span such as a whole day. It may be good to support that in the model, and break the rows into days. If it's not critical, run it as a background task perhaps. 

One warning, deleting a lot of columns from a row can cause get_slice on that row to slow down when a start_column is not specified or does not exist. Until major compaction removes the tombstones, see http://www.mail-archive.com/user@cassandra.apache.org/msg05938.html

TimeUUID thing makes sense. I guess you can set the second part to zero when you want to get all the values after a certain time. 

Good Luck 
Aaron


On 15 Oct, 2010,at 09:52 AM, Koert Kuipers <Ko...@diamondnotch.com> wrote:

Aaron, Thanks for your response.
 
I use a custom UUID generator so that the second part is randomly generated (no MAC address). I actually want this to be random since I could potentially have multiple values for the same ticker, measure and time and I do not want to override.
 
I didn’t realize that supercolumns had that limitation. ticker:measure fields indeed seem to make sense. That’s a relative easy switch.
 
I could indeed add the day to the field (so ticker:measure:day) to enable easy deletion of days. However this doesn’t feel very clean. I would prefer to keep using columns for time and use a slice for deletion. However last time I tried this I got an error (something about slice deletion not yet being supported with batch_mutate). CASSANDRA-494 seems to indicate this is still in the works but I am not sure if it actually is.
 
Thanks again. Koert
 
From: Aaron Morton [mailto:aaron@thelastpickle.com] 
Sent: October 14 2010 15:45
To: user@cassandra.apache.org
Cc: 'user@cassandra.apache.org'
Subject: Re: deletion
 
I would recommend using epoch time for your timestamp and comparing as LongType. The version 1 UUID includes the MAC of the machine that generated it, it two different machines will create different UUID's for the some time. They are meant to be unique after all http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_1_.28MAC_address.29
 
You may also want to adjust your model, see the discussion on supercolumn limitations here http://wiki.apache.org/cassandra/CassandraLimitations . Your current model is going to create very big super columns, which will degrade in performance over time. Perhaps use a standard CF and use "ticket:measure" as the row key, then you can add 2billion (i think) columns on there for each time. You may still want to break the rows up further depending on your use case, e.g. ticket:measure:day then perhaps pull back the entire row to get every value for the day or delete the entire day easily.
 
For your deletion issue, batch_mutate is your friend. The Deletion struct lets you delete:
- a row, by excluding the predicate and super_column
- a super_column by including super_column and not predicate 
- a column
 
Some of the things that were not implemented were fixed in 0.6.4 i think. Anyway they all work AFAIK. 
 
Hope that helps. 
Aaron
 

On 15 Oct, 2010,at 07:55 AM, Koert Kuipers <Ko...@diamondnotch.com> wrote:

Hello All,
 
I am testing Cassandra 0.7 with the Avro api on a single machine as a financial time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = field (price, volume, high, low), column = timestamp.
 
So a single value, say a price of 140.72 for IBM today at 14:00 would be stored as
tickdata[“IBM”][“price”][“2010-10-14 14:00”] = 140.72 (well of course everything needs to be encoded properly but you get the point).
 
My subcomparator type is TimeUUIDType so that I can do queries over time ranges. Inserting and querying all work reasonably well so far.
 
But sometimes I have a need to wipe out all the data for all day. To be more precise: I need to delete the stored values for all keys (tickers) and all super-columns (fields) for a given time period (condition on column). How would I go about doing that? First a multiget_slice and then a remove command for each value? Or am I missing an easier way?
 
Is slice deletion within batch_mutate still scheduled to be implemented?
 
Thanks for your help,
Koert
 

RE: deletion

Posted by Koert Kuipers <Ko...@diamondnotch.com>.
Aaron, Thanks for your response.

I use a custom UUID generator so that the second part is randomly generated (no MAC address). I actually want this to be random since I could potentially have multiple values for the same ticker, measure and time and I do not want to override.

I didn't realize that supercolumns had that limitation. ticker:measure fields indeed seem to make sense. That's a relative easy switch.

I could indeed add the day to the field (so ticker:measure:day) to enable easy deletion of days. However this doesn't feel very clean. I would prefer to keep using columns for time and use a slice for deletion. However last time I tried this I got an error (something about slice deletion not yet being supported with batch_mutate). CASSANDRA-494 seems to indicate this is still in the works but I am not sure if it actually is.

Thanks again. Koert

________________________________
From: Aaron Morton [mailto:aaron@thelastpickle.com]
Sent: October 14 2010 15:45
To: user@cassandra.apache.org
Cc: 'user@cassandra.apache.org'
Subject: Re: deletion

I would recommend using epoch time for your timestamp and comparing as LongType. The version 1 UUID includes the MAC of the machine that generated it, it two different machines will create different UUID's for the some time. They are meant to be unique after all http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_1_.28MAC_address.29

You may also want to adjust your model, see the discussion on supercolumn limitations here http://wiki.apache.org/cassandra/CassandraLimitations . Your current model is going to create very big super columns, which will degrade in performance over time. Perhaps use a standard CF and use "ticket:measure" as the row key, then you can add 2billion (i think) columns on there for each time. You may still want to break the rows up further depending on your use case, e.g. ticket:measure:day then perhaps pull back the entire row to get every value for the day or delete the entire day easily.

For your deletion issue, batch_mutate is your friend. The Deletion struct lets you delete:
- a row, by excluding the predicate and super_column
- a super_column by including super_column and not predicate
- a column

Some of the things that were not implemented were fixed in 0.6.4 i think. Anyway they all work AFAIK.

Hope that helps.
Aaron


On 15 Oct, 2010,at 07:55 AM, Koert Kuipers <Ko...@diamondnotch.com> wrote:
Hello All,

I am testing Cassandra 0.7 with the Avro api on a single machine as a financial time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = field (price, volume, high, low), column = timestamp.

So a single value, say a price of 140.72 for IBM today at 14:00 would be stored as
tickdata["IBM"]["price"]["2010-10-14 14:00"] = 140.72 (well of course everything needs to be encoded properly but you get the point).

My subcomparator type is TimeUUIDType so that I can do queries over time ranges. Inserting and querying all work reasonably well so far.

But sometimes I have a need to wipe out all the data for all day. To be more precise: I need to delete the stored values for all keys (tickers) and all super-columns (fields) for a given time period (condition on column). How would I go about doing that? First a multiget_slice and then a remove command for each value? Or am I missing an easier way?

Is slice deletion within batch_mutate still scheduled to be implemented?

Thanks for your help,
Koert


Re: deletion

Posted by Aaron Morton <aa...@thelastpickle.com>.
I would recommend using epoch time for your timestamp and comparing as LongType. The version 1 UUID includes the MAC of the machine that generated it, it two different machines will create different UUID's for the some time. They are meant to be unique after all http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_1_.28MAC_address.29

You may also want to adjust your model, see the discussion on supercolumn limitations here http://wiki.apache.org/cassandra/CassandraLimitations . Your current model is going to create very big super columns, which will degrade in performance over time. Perhaps use a standard CF and use "ticket:measure" as the row key, then you can add 2billion (i think) columns on there for each time. You may still want to break the rows up further depending on your use case, e.g. ticket:measure:day then perhaps pull back the entire row to get every value for the day or delete the entire day easily.

For your deletion issue, batch_mutate is your friend. The Deletion struct lets you delete:
- a row, by excluding the predicate and super_column
- a super_column by including super_column and not predicate 
- a column

Some of the things that were not implemented were fixed in 0.6.4 i think. Anyway they all work AFAIK. 

Hope that helps. 
Aaron


On 15 Oct, 2010,at 07:55 AM, Koert Kuipers <Ko...@diamondnotch.com> wrote:

Hello All,
 
I am testing Cassandra 0.7 with the Avro api on a single machine as a financial time series server, so my setup looks something like this:
keyspace = timeseries, column family = tickdata, key = ticker, super column = field (price, volume, high, low), column = timestamp.
 
So a single value, say a price of 140.72 for IBM today at 14:00 would be stored as
tickdata[“IBM”][“price”][“2010-10-14 14:00”] = 140.72 (well of course everything needs to be encoded properly but you get the point).
 
My subcomparator type is TimeUUIDType so that I can do queries over time ranges. Inserting and querying all work reasonably well so far.
 
But sometimes I have a need to wipe out all the data for all day. To be more precise: I need to delete the stored values for all keys (tickers) and all super-columns (fields) for a given time period (condition on column). How would I go about doing that? First a multiget_slice and then a remove command for each value? Or am I missing an easier way?
 
Is slice deletion within batch_mutate still scheduled to be implemented?
 
Thanks for your help,
Koert