You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Steve <sj...@shic.co.uk> on 2010/04/06 13:00:01 UTC

A question of 'referential integrity'...

First, I apologise sending this to the 'dev' mailing list - I couldn't
find one for Cassandra users - and also for the basic nature of my
questions...

I'm trying to get my head around the possibility of using Cassandra as
the back-end to a project... and while, in most respects, Cassandra
looks absolutely ideal... I'm finding it difficult to ascertain an
appropriate strategy to ensure consistency (which would come 'for free'
with a traditional, transactional, back end.)

As a sufficient abstraction of my storage requirements, imagine two
(application oriented) universes of SHA-512 hashes - SRC and DST (each
will, in practice, be tagged with other application data).  I need to
support a remote API to manage a one-many mapping from SRC to DST, and a
consistent (efficiently addressable) one-one mapping from DST to SRC.  I
need to envisage millions of clients and tens of billions of mappings
with billions of updates and lookups daily...

newAssoc(s:SRC,d:DST)
listAssoc(s:SRC) => List<d:DST>
delAssoc(d:DST)
lookupAssoc(d:DST) => s:SRC

If I were using BDB, I'd have two maps - the first with s:SRC as key and
d:DST as value - the second with (d:DST,s:SRC) as key with no values....
and update these maps in a transaction.

If I were in SQL land, I'd need a table a bit like this:

create table Assoc( src binary(64) , dst binary(64) unique, primary key
(src,dst) )

The implementations of the API calls would be trivial insert, delete and
select operations - and consistency between the primary key and the
implicit (unique constraint) index would arise as a consequence of
transactions.  I realise that, with Cassandra, I need a different
approach - since I don't have the same notion of transactions on which
to rely... and, in any case, given a desire for scalability, relying
upon such fine grained transactions would definitely prove a
bottleneck.  That said, the uniqueness of DST values is systemically
critical - so, even while I do not anticipate duplicate hashes in
practice, I need uniqueness to be verified - and for the second SRC
values asking to associate with an existing DST value to fail without
violating the one->one mapping from DST to SRC... and for this failure
to be notified ASAP.

It strikes me that a plausible design might be one that writes a log of
'insert/delete' with pairs of hashes which some background process
eventually indexes in a big batch... before clearing the transaction
log.  If this is "The Cassandra Way" - I'm surprised not to have found
any examples... am I looking in the wrong places for them?  Is my log of
'insert' and 'delete' operations something I'd implement myself using
ad-hoc techniques, or is there explicit support for this in Cassandra? 
Do I need to develop my own process (from scratch) to merge updates with
on-disk data - or is there a neat way to get Cassandra to do that for me?

Another issue I'm considering is if I should map from SRC to a list of
DST as my low-level representation with Cassandra... or should I map
individually.  A potential problem is that one SRC value can map to
arbitrarily many DST values.   At the level of the RPC API, I can
address this by returning an object resembling a scrollable cursor
instead of a static list - but, I guess, I'd need to be concerned about
resource limitations (memory, etc.) for the on-disk representation?  I
presume that there's a significant advantage to storing the one-to-many
map explicitly (locality of reference, for example) - as well as
minimising the size of the encoded data... I'm guessing that there is no
prefix-compression for keys?  Key compression would likely lead to the
opposite architectural decisions from a resource-use perspective... and
would eliminate concerns about maps from single SRC values to very large
numbers of DST values.

Any hints, tips, comments, pointers to relevant documentation etc. would
be much appreciated...  I'm guessing many others have tackled a problem
similar to this?


Re: A question of 'referential integrity'...

Posted by Tatu Saloranta <ts...@gmail.com>.
On Tue, Apr 6, 2010 at 2:12 PM, Steve <sj...@shic.co.uk> wrote:
...
> Should I assume that it isn't common practice to write updates
> atomically in-real time, and batch process them 'off-line' to increase
> the atomic granularity?  It seems an obvious strategy... possibly one
> for which an implementation might use "MapReduce" or something similar?
> I don't want to re-invent the wheel, of course.

I think this is indeed a common and common sense approach.

Traditionally this has actually been done by even simpler systems,
like writing log files on hosts, beaming them up to a central
location, and processing aggregates that way. But key/value stores can
be used to reduce latency that is the problematic part of simple
log-based aggregation.

-+ Tatu +-

Re: A question of 'referential integrity'...

Posted by Steve <sj...@shic.co.uk>.
On 06/04/2010 21:40, Benjamin Black wrote:
> I suggest the reasons you list (which are certainly great reasons!)
> are also the reasons there is no referential integrity or transaction
> support.  
Quite.  I'm not trying to make recommendations for how Cassandra should
be changed to be more like a traditional RDBMS... I just have a
requirement, at the logical level, that would be trivial with
traditional technology - so the analogy seemed an ideal way to
illustrate the issue.
> It seems the common practice of using a system like
> Zookeeper for the synchronization parts alongside Cassandra would be
> applicable here.  Have you investigated that?
>   
I started looking at Zookeeper when it was mentioned in an earlier
reply.  I've discovered it supports something called "Ledgers" - but I'm
still unclear if they'd be useful to me - I've only uncovered a very
high-level overview so far.  I'm concerned that Zookeeper looks as if it
might become a problematic bottleneck if all the updates must be routed
through it.  I don't see Zookeeper mutexes as being especially
helpful... because my problem isn't really about two incompatible
requests in quick succession - but, rather, about needing to ensure that
"referential integrity" is eventually established between two, otherwise
independent, keysets.  I need to eliminate the possibility that I end up
with 'dangling' inaccessible data should a hash-value become recorded in
the range of the first map but not the domain of the second (or vice-versa.)

Should I assume that it isn't common practice to write updates
atomically in-real time, and batch process them 'off-line' to increase
the atomic granularity?  It seems an obvious strategy... possibly one
for which an implementation might use "MapReduce" or something similar? 
I don't want to re-invent the wheel, of course.


Re: A question of 'referential integrity'...

Posted by Benjamin Black <b...@b3k.us>.
I suggest the reasons you list (which are certainly great reasons!)
are also the reasons there is no referential integrity or transaction
support.  It seems the common practice of using a system like
Zookeeper for the synchronization parts alongside Cassandra would be
applicable here.  Have you investigated that?


b

On Tue, Apr 6, 2010 at 11:47 AM, Steve <sj...@shic.co.uk> wrote:
> On 06/04/2010 18:50, Benjamin Black wrote:
>
> I'm finding this exchange very confusing.  What exactly about
> Cassandra 'looks absolutely ideal' to you for your project?  The write
> performance, the symmetric, peer to peer architecture, etc?
>
>
> Reasons I like Cassandra for this project:
>
> Columnar rather than tabular data structures with an extensible 'schemata' -
> permitting evolution of back-end data structures to support new features
> without down-time.
> Decentralised architecture with fault tolerance/redundancy permitting high
> availability on shoestring budget hardware in an easily scalable pool - in
> spite of needing to track rapidly changing data that precludes meaningful
> backup.
> Easy to establish that data will be efficiently sharded - allowing many
> concurrent reads and writes - i.e. systemic IO bandwidth is scalable - both
> for reading and writing.
> Lightweight, free and open-source physical data model that minimises risk of
> vendor lock-in or insurmountable problems with glitches in commercial
> closed-source libraries.
>
> A shorter answer might be that, in all ways other than depending upon
> 'referential integrity' between two 'maps' of hash-values, the data for the
> rest of my application looks remarkably like that of large sites that we
> know already use Cassandra.
>
> I'm trying to establish the most effective Cassandra approach to achieve the
> logical 'referential integrity' while minimising resource (memory/disk/CPU)
> use in order to minimise hardware costs for any given deployment scale - all
> the while, retaining the above advantages.
>
>

Re: A question of 'referential integrity'...

Posted by Steve <sj...@shic.co.uk>.
On 06/04/2010 18:50, Benjamin Black wrote:
> I'm finding this exchange very confusing.  What exactly about
> Cassandra 'looks absolutely ideal' to you for your project?  The write
> performance, the symmetric, peer to peer architecture, etc?
>   

Reasons I like Cassandra for this project:

    * Columnar rather than tabular data structures with an extensible
      'schemata' - permitting evolution of back-end data structures to
      support new features without down-time.
    * Decentralised architecture with fault tolerance/redundancy
      permitting high availability on shoestring budget hardware in an
      easily scalable pool - in spite of needing to track rapidly
      changing data that precludes meaningful backup.
    * Easy to establish that data will be efficiently sharded - allowing
      many concurrent reads and writes - i.e. systemic IO bandwidth is
      scalable - both for reading and writing.
    * Lightweight, free and open-source physical data model that
      minimises risk of vendor lock-in or insurmountable problems with
      glitches in commercial closed-source libraries.

A shorter answer might be that, in all ways other than depending upon
'referential integrity' between two 'maps' of hash-values, the data for
the rest of my application looks remarkably like that of large sites
that we know already use Cassandra.

I'm trying to establish the most effective Cassandra approach to achieve
the logical 'referential integrity' while minimising resource
(memory/disk/CPU) use in order to minimise hardware costs for any given
deployment scale - all the while, retaining the above advantages.


Re: A question of 'referential integrity'...

Posted by Benjamin Black <b...@b3k.us>.
I'm finding this exchange very confusing.  What exactly about
Cassandra 'looks absolutely ideal' to you for your project?  The write
performance, the symmetric, peer to peer architecture, etc?


b

Re: A question of 'referential integrity'...

Posted by Steve <sj...@shic.co.uk>.
On 06/04/2010 18:53, Tatu Saloranta wrote:
>> I've read all about QUORUM, and it is generally useful, but as far as I
>> can tell, it can't give me a transaction...
>>     
> Correct. Only individual operations are atomic, and ordering of
> insertions is not guaranteed.
>   
As I thought.
> I think there were some logged Jira issues to allow grouping of
> operations into what seems to amount to transactions, which could help
> a lot here... but I can't find it now (or maybe it has only been
> discussed so far?).
> If I understand this correctly, it would just mean that you could send
> a sequence of operations, to be completed as a unit (first into
> journal, then into memtable etc).
>   

I think we're on the same page.  I need an atomic 'transaction'
affecting multiple keys - so I write a tuple of all the updates
(inserts/deletes) as a single value into a 'merge-pending' keyset... and
(somehow - perhaps with memtable) I modify data read from  other keysets
to be as-if this 'merge-pending' data had already been been applied to
the independent keysets to which it relates.  A process/thread on each
node would continuously attempt to apply the multiple updates from the
merge-pending data before deleting it and dropping the associated
merge-data from the in-memory transformations. Latency should be very
low (like with a log-based file-system) and throughput should be
reasonably high because there should be a lot of flexibility in batch
processing the 'merge-pending' data.

This way, if there's a failure during merging, there's sufficient
durable record to complete the merge before serving any more remote
requests.  To the remote client, it appears indistinguishable from an
atomic transaction that affected more than one key.


Re: A question of 'referential integrity'...

Posted by Tatu Saloranta <ts...@gmail.com>.
On Tue, Apr 6, 2010 at 10:12 AM, Steve <sj...@shic.co.uk> wrote:
> On 06/04/2010 15:26, Eric Evans wrote:
...
> I've read all about QUORUM, and it is generally useful, but as far as I
> can tell, it can't give me a transaction...

Correct. Only individual operations are atomic, and ordering of
insertions is not guaranteed.

I think there were some logged Jira issues to allow grouping of
operations into what seems to amount to transactions, which could help
a lot here... but I can't find it now (or maybe it has only been
discussed so far?).
If I understand this correctly, it would just mean that you could send
a sequence of operations, to be completed as a unit (first into
journal, then into memtable etc).

-+ Tatu +-

Re: A question of 'referential integrity'...

Posted by Steve <sj...@shic.co.uk>.
On 06/04/2010 15:26, Eric Evans wrote:
> On Tue, 2010-04-06 at 12:00 +0100, Steve wrote:
>   
>> First, I apologise sending this to the 'dev' mailing list - I couldn't
>> find one for Cassandra users - and also for the basic nature of my
>> questions...
>>     
> user@cassandra.apache.org, (follow-ups there).
>   
Thanks...  I'm now subscribed.
>> I'm trying to get my head around the possibility of using Cassandra as
>> the back-end to a project... and while, in most respects, Cassandra
>> looks absolutely ideal... I'm finding it difficult to ascertain an
>> appropriate strategy to ensure consistency (which would come 'for free'
>> with a traditional, transactional, back end.)
>>
>> As a sufficient abstraction of my storage requirements, imagine two
>> (application oriented) universes of SHA-512 hashes - SRC and DST (each
>> will, in practice, be tagged with other application data).  I need to
>> support a remote API to manage a one-many mapping from SRC to DST, and a
>> consistent (efficiently addressable) one-one mapping from DST to SRC.  I
>> need to envisage millions of clients and tens of billions of mappings
>> with billions of updates and lookups daily...
>>
>> newAssoc(s:SRC,d:DST)
>> listAssoc(s:SRC) => List<d:DST>
>> delAssoc(d:DST)
>> lookupAssoc(d:DST) => s:SRC
>>
>> If I were using BDB, I'd have two maps - the first with s:SRC as key and
>> d:DST as value - the second with (d:DST,s:SRC) as key with no values....
>> and update these maps in a transaction.
>>     
> You could model it like this with Cassandra as well.
>
> It sounds like the real question though is, how can you structure this
> to work given Cassandra's eventual consistency and record-level
> atomicity?
>   
Yes, that's another way to ask the same question.

Based upon my understanding of Cassandra to date, it seems to me that
using two maps (i.e. the same data structure as I suggested for BDB) I'd
not have the same consistency... even eventually... as a consequence of
the record-level atomicity.  I have to assume a fail-stop between
writing the first and second keys... under those circumstances the
stored data will be inconsistent with the logical model's requirements.
Sure, I could write a program to verify that, for each SRC value, every
associated DST value is mapped back (assuming the SRC->DST mapping is
written first) but that would likely lead to prohibitively slow start-up
times and/or a significant overhead for every access.

> For example, QUORUM consistency for reads and writes are enough to
> ensure that your SRC->DST mappings remain unique and consistent from the
> perspective of the client. And, if you can't make your application
> resilient to inconsistencies in the inverted index (detect, repair,
> etc), you could always use a Zookeeper-based mutex.
>   
I think it looks impractical for me to make the client resilient to
inconsistent data in this context (though, I agree, that would be the
ideal solution if it were viable... and will be the preferred solution
elsewhere in the project.)

I've read all about QUORUM, and it is generally useful, but as far as I
can tell, it can't give me a transaction... if my replicated servers are
all supplied by a common mains electricity source, it's very likely that
all the replicated nodes will fail-stop at exactly the same time...
hence providing me with no protection against one map being updated and
not the other.  While I've only recently started looking at Zookeeper, I
suspect (at least partly because I understand mutexes) that it won't
address this issue either.  I guess it would be helpful if I needed to
be sure that two identical DST hashes are not mapped to different SRC
hashes at (almost exactly) the same time - but given that no collisions
are expected... I can probably back-out from that error condition
safely-enough without needing to introduce the bottleneck of running all
my updates through a single Zookeeper process.

> If you haven't already I'd suggest reading the Amazon whitepaper on
> Dynamo[1] to understand eventual consistency, and Cassandra's API[2]
> docs for how to apply it here.
>
> [1]:
> http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
> [2]: http://wiki.apache.org/cassandra/API#ConsistencyLevel
>   
I'd already read the API - the dynamo document is interesting, too...
though I didn't see anything especially relevant in it. Perhaps, I'm
overlooking what you wanted to point out?

>> If I were in SQL land, I'd need a table a bit like this:
>>
>> create table Assoc( src binary(64) , dst binary(64) unique, primary key
>> (src,dst) )
>>
>> The implementations of the API calls would be trivial insert, delete and
>> select operations - and consistency between the primary key and the
>> implicit (unique constraint) index would arise as a consequence of
>> transactions.  I realise that, with Cassandra, I need a different
>> approach - since I don't have the same notion of transactions on which
>> to rely... and, in any case, given a desire for scalability, relying
>> upon such fine grained transactions would definitely prove a
>> bottleneck.  That said, the uniqueness of DST values is systemically
>> critical - so, even while I do not anticipate duplicate hashes in
>> practice, I need uniqueness to be verified - and for the second SRC
>> values asking to associate with an existing DST value to fail without
>> violating the one->one mapping from DST to SRC... and for this failure
>> to be notified ASAP.
>>
>> It strikes me that a plausible design might be one that writes a log of
>> 'insert/delete' with pairs of hashes which some background process
>> eventually indexes in a big batch... before clearing the transaction
>> log.  If this is "The Cassandra Way" - I'm surprised not to have found
>> any examples... am I looking in the wrong places for them?  Is my log of
>> 'insert' and 'delete' operations something I'd implement myself using
>> ad-hoc techniques, or is there explicit support for this in Cassandra? 
>> Do I need to develop my own process (from scratch) to merge updates with
>> on-disk data - or is there a neat way to get Cassandra to do that for me?
>>
>> Another issue I'm considering is if I should map from SRC to a list of
>> DST as my low-level representation with Cassandra... or should I map
>> individually.  A potential problem is that one SRC value can map to
>> arbitrarily many DST values.   At the level of the RPC API, I can
>> address this by returning an object resembling a scrollable cursor
>> instead of a static list - but, I guess, I'd need to be concerned about
>> resource limitations (memory, etc.) for the on-disk representation?  I
>> presume that there's a significant advantage to storing the one-to-many
>> map explicitly (locality of reference, for example) - as well as
>> minimising the size of the encoded data... I'm guessing that there is no
>> prefix-compression for keys?  Key compression would likely lead to the
>> opposite architectural decisions from a resource-use perspective... and
>> would eliminate concerns about maps from single SRC values to very large
>> numbers of DST values.
>>     
I'm still interested in whether or not Cassandra does (or will) support
key prefix compression...  That could greatly influence architectural
choices.

I'm also interested to establish if others effectively gain ACID
transactions by first writing application-oriented transactional data to
a log (which can be applied on-the-fly to data read from the main
keymap) while a background process applies the transitions in the
background (in an idempotent style) - thus securing the appearance of
traditional transactions.  Is this something not recommended because,
where Cassandra has been typically deployed, it hasn't been necessary -
or is there a deeper reason?


Re: A question of 'referential integrity'...

Posted by Eric Evans <ee...@rackspace.com>.
On Tue, 2010-04-06 at 12:00 +0100, Steve wrote:
> First, I apologise sending this to the 'dev' mailing list - I couldn't
> find one for Cassandra users - and also for the basic nature of my
> questions...

user@cassandra.apache.org, (follow-ups there).

> I'm trying to get my head around the possibility of using Cassandra as
> the back-end to a project... and while, in most respects, Cassandra
> looks absolutely ideal... I'm finding it difficult to ascertain an
> appropriate strategy to ensure consistency (which would come 'for free'
> with a traditional, transactional, back end.)
> 
> As a sufficient abstraction of my storage requirements, imagine two
> (application oriented) universes of SHA-512 hashes - SRC and DST (each
> will, in practice, be tagged with other application data).  I need to
> support a remote API to manage a one-many mapping from SRC to DST, and a
> consistent (efficiently addressable) one-one mapping from DST to SRC.  I
> need to envisage millions of clients and tens of billions of mappings
> with billions of updates and lookups daily...
> 
> newAssoc(s:SRC,d:DST)
> listAssoc(s:SRC) => List<d:DST>
> delAssoc(d:DST)
> lookupAssoc(d:DST) => s:SRC
> 
> If I were using BDB, I'd have two maps - the first with s:SRC as key and
> d:DST as value - the second with (d:DST,s:SRC) as key with no values....
> and update these maps in a transaction.

You could model it like this with Cassandra as well.

It sounds like the real question though is, how can you structure this
to work given Cassandra's eventual consistency and record-level
atomicity?

For example, QUORUM consistency for reads and writes are enough to
ensure that your SRC->DST mappings remain unique and consistent from the
perspective of the client. And, if you can't make your application
resilient to inconsistencies in the inverted index (detect, repair,
etc), you could always use a Zookeeper-based mutex.

If you haven't already I'd suggest reading the Amazon whitepaper on
Dynamo[1] to understand eventual consistency, and Cassandra's API[2]
docs for how to apply it here.

[1]:
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
[2]: http://wiki.apache.org/cassandra/API#ConsistencyLevel

> If I were in SQL land, I'd need a table a bit like this:
> 
> create table Assoc( src binary(64) , dst binary(64) unique, primary key
> (src,dst) )
> 
> The implementations of the API calls would be trivial insert, delete and
> select operations - and consistency between the primary key and the
> implicit (unique constraint) index would arise as a consequence of
> transactions.  I realise that, with Cassandra, I need a different
> approach - since I don't have the same notion of transactions on which
> to rely... and, in any case, given a desire for scalability, relying
> upon such fine grained transactions would definitely prove a
> bottleneck.  That said, the uniqueness of DST values is systemically
> critical - so, even while I do not anticipate duplicate hashes in
> practice, I need uniqueness to be verified - and for the second SRC
> values asking to associate with an existing DST value to fail without
> violating the one->one mapping from DST to SRC... and for this failure
> to be notified ASAP.
> 
> It strikes me that a plausible design might be one that writes a log of
> 'insert/delete' with pairs of hashes which some background process
> eventually indexes in a big batch... before clearing the transaction
> log.  If this is "The Cassandra Way" - I'm surprised not to have found
> any examples... am I looking in the wrong places for them?  Is my log of
> 'insert' and 'delete' operations something I'd implement myself using
> ad-hoc techniques, or is there explicit support for this in Cassandra? 
> Do I need to develop my own process (from scratch) to merge updates with
> on-disk data - or is there a neat way to get Cassandra to do that for me?
> 
> Another issue I'm considering is if I should map from SRC to a list of
> DST as my low-level representation with Cassandra... or should I map
> individually.  A potential problem is that one SRC value can map to
> arbitrarily many DST values.   At the level of the RPC API, I can
> address this by returning an object resembling a scrollable cursor
> instead of a static list - but, I guess, I'd need to be concerned about
> resource limitations (memory, etc.) for the on-disk representation?  I
> presume that there's a significant advantage to storing the one-to-many
> map explicitly (locality of reference, for example) - as well as
> minimising the size of the encoded data... I'm guessing that there is no
> prefix-compression for keys?  Key compression would likely lead to the
> opposite architectural decisions from a resource-use perspective... and
> would eliminate concerns about maps from single SRC values to very large
> numbers of DST values.
> 
> Any hints, tips, comments, pointers to relevant documentation etc. would
> be much appreciated...  I'm guessing many others have tackled a problem
> similar to this?
> 


-- 
Eric Evans
eevans@rackspace.com


Re: A question of 'referential integrity'...

Posted by Eric Evans <ee...@rackspace.com>.
On Tue, 2010-04-06 at 12:00 +0100, Steve wrote:
> First, I apologise sending this to the 'dev' mailing list - I couldn't
> find one for Cassandra users - and also for the basic nature of my
> questions...

user@cassandra.apache.org, (follow-ups there).

> I'm trying to get my head around the possibility of using Cassandra as
> the back-end to a project... and while, in most respects, Cassandra
> looks absolutely ideal... I'm finding it difficult to ascertain an
> appropriate strategy to ensure consistency (which would come 'for free'
> with a traditional, transactional, back end.)
> 
> As a sufficient abstraction of my storage requirements, imagine two
> (application oriented) universes of SHA-512 hashes - SRC and DST (each
> will, in practice, be tagged with other application data).  I need to
> support a remote API to manage a one-many mapping from SRC to DST, and a
> consistent (efficiently addressable) one-one mapping from DST to SRC.  I
> need to envisage millions of clients and tens of billions of mappings
> with billions of updates and lookups daily...
> 
> newAssoc(s:SRC,d:DST)
> listAssoc(s:SRC) => List<d:DST>
> delAssoc(d:DST)
> lookupAssoc(d:DST) => s:SRC
> 
> If I were using BDB, I'd have two maps - the first with s:SRC as key and
> d:DST as value - the second with (d:DST,s:SRC) as key with no values....
> and update these maps in a transaction.

You could model it like this with Cassandra as well.

It sounds like the real question though is, how can you structure this
to work given Cassandra's eventual consistency and record-level
atomicity?

For example, QUORUM consistency for reads and writes are enough to
ensure that your SRC->DST mappings remain unique and consistent from the
perspective of the client. And, if you can't make your application
resilient to inconsistencies in the inverted index (detect, repair,
etc), you could always use a Zookeeper-based mutex.

If you haven't already I'd suggest reading the Amazon whitepaper on
Dynamo[1] to understand eventual consistency, and Cassandra's API[2]
docs for how to apply it here.

[1]:
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
[2]: http://wiki.apache.org/cassandra/API#ConsistencyLevel

> If I were in SQL land, I'd need a table a bit like this:
> 
> create table Assoc( src binary(64) , dst binary(64) unique, primary key
> (src,dst) )
> 
> The implementations of the API calls would be trivial insert, delete and
> select operations - and consistency between the primary key and the
> implicit (unique constraint) index would arise as a consequence of
> transactions.  I realise that, with Cassandra, I need a different
> approach - since I don't have the same notion of transactions on which
> to rely... and, in any case, given a desire for scalability, relying
> upon such fine grained transactions would definitely prove a
> bottleneck.  That said, the uniqueness of DST values is systemically
> critical - so, even while I do not anticipate duplicate hashes in
> practice, I need uniqueness to be verified - and for the second SRC
> values asking to associate with an existing DST value to fail without
> violating the one->one mapping from DST to SRC... and for this failure
> to be notified ASAP.
> 
> It strikes me that a plausible design might be one that writes a log of
> 'insert/delete' with pairs of hashes which some background process
> eventually indexes in a big batch... before clearing the transaction
> log.  If this is "The Cassandra Way" - I'm surprised not to have found
> any examples... am I looking in the wrong places for them?  Is my log of
> 'insert' and 'delete' operations something I'd implement myself using
> ad-hoc techniques, or is there explicit support for this in Cassandra? 
> Do I need to develop my own process (from scratch) to merge updates with
> on-disk data - or is there a neat way to get Cassandra to do that for me?
> 
> Another issue I'm considering is if I should map from SRC to a list of
> DST as my low-level representation with Cassandra... or should I map
> individually.  A potential problem is that one SRC value can map to
> arbitrarily many DST values.   At the level of the RPC API, I can
> address this by returning an object resembling a scrollable cursor
> instead of a static list - but, I guess, I'd need to be concerned about
> resource limitations (memory, etc.) for the on-disk representation?  I
> presume that there's a significant advantage to storing the one-to-many
> map explicitly (locality of reference, for example) - as well as
> minimising the size of the encoded data... I'm guessing that there is no
> prefix-compression for keys?  Key compression would likely lead to the
> opposite architectural decisions from a resource-use perspective... and
> would eliminate concerns about maps from single SRC values to very large
> numbers of DST values.
> 
> Any hints, tips, comments, pointers to relevant documentation etc. would
> be much appreciated...  I'm guessing many others have tackled a problem
> similar to this?
> 


-- 
Eric Evans
eevans@rackspace.com