You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Lawrence Turcotte <la...@gmail.com> on 2013/11/16 20:45:47 UTC

DESIGN QUESTION: Need to update only older data in cassandra

that is, data consists of of an account id with a timestamp column that
indicates when the account was updated. This is not to be confused with row
insertion/update times tamp maintained by Cassandra for conflict resolution
within the Cassanda Nodes. Furthermore the account has about 200 columns
and updates occur nightly in batch mode where roughly 300-400 million
updates are sent. The problems occurs during the day where updates can be
sent that possibly contain older data then the nightly batch update. As
such the requirement to first look at the account update time stamp in the
database and comparing the proposed update time stamp to determine whether
to update or not.

The idea here is that a read before update in Cassandra is generally not a
good idea. To alleviate this problem I was thinking of either maintaining a
separate Cassandra db with only two columns of account id and update time
stamp and using this as a look up before updating or setting a stored
procedure within the main database to do the read and update if the data
within the database is older.

UPDATE Account SET some columns WHERE lastUpdateTimeStamp <
proposedUpdateTimeStamp.

I am kind of leaning towards the separate database or keys pace as a simple
look up to determine whether to update the data in the main Cassandra
database, that is the database that contain the 200 columns of account
data. If this is the best choice then I would like to explore the pros and
cons of creating a separate Cassandra Node cluster for look up of account
update time stamps vs just adding another key space within the main
Cassandra database in terms of performance implications. In this account
and time stamp only database I would need to also update the time stamp
when the main database would be updated.

Any thoughts are welcome

Lawrence

Re: DESIGN QUESTION: Need to update only older data in cassandra

Posted by Aaron Morton <aa...@thelastpickle.com>.

>  The problems occurs during the day where updates can be sent that possibly contain older data then the nightly batch update. 
If you have a an application level sequence for updates (I used that term to avoid saying timestamp) you could use it as the cassandra timestamp. As long as you know it increases it’s fine. You can specify the timestamp for a column via either thrift or cql3. 

When the updates come in during the day if they have the older time stamp just send the write and it will be ignored. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 17/11/2013, at 8:45 am, Lawrence Turcotte <la...@gmail.com> wrote:

> that is, data consists of of an account id with a timestamp column that indicates when the account was updated. This is not to be confused with row insertion/update times tamp maintained by Cassandra for conflict resolution within the Cassanda Nodes. Furthermore the account has about 200 columns and updates occur nightly in batch mode where roughly 300-400 million updates are sent. The problems occurs during the day where updates can be sent that possibly contain older data then the nightly batch update. As such the requirement to first look at the account update time stamp in the database and comparing the proposed update time stamp to determine whether to update or not.
> 
> The idea here is that a read before update in Cassandra is generally not a good idea. To alleviate this problem I was thinking of either maintaining a separate Cassandra db with only two columns of account id and update time stamp and using this as a look up before updating or setting a stored procedure within the main database to do the read and update if the data within the database is older.
> 
> UPDATE Account SET some columns WHERE lastUpdateTimeStamp < proposedUpdateTimeStamp.
> 
> I am kind of leaning towards the separate database or keys pace as a simple look up to determine whether to update the data in the main Cassandra database, that is the database that contain the 200 columns of account data. If this is the best choice then I would like to explore the pros and cons of creating a separate Cassandra Node cluster for look up of account update time stamps vs just adding another key space within the main Cassandra database in terms of performance implications. In this account and time stamp only database I would need to also update the time stamp when the main database would be updated.
> 
> Any thoughts are welcome
> 
> Lawrence
>