You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by David King <dk...@ketralnis.com> on 2010/04/17 00:01:27 UTC
Maintaining secondary indices on mutable columns

I'm having trouble visualising how to maintain a secondary index on a mutable column. For instance, given some objects and the number that we have in inventory:

widget1: { count = 0, colour = blue }
widget2: { count = 5, colour = red }
widget3: { count = 8, colour = green }
widget4: { count = 8, colour = purple }
widget5: { count = 9, colour = green }
widget6: { count = 2, colour = red }
widget7: { count = 1, colour = blue }

If i want a list of widgets ordered by the number that we have, I might construct an ordered columnfamily that looks like:

count009_widget5: { id = widget5 }
count008_widget4: { id = widget4 }
count008_widget3: { id = widget3 }
count005_widget2: { id = widget2 }
count002_widget6: { id = widget6 }
count001_widget7: { id = widget7 }
count000_widget1: { id = widget1 }

And do slices against that CF. (The additional _widget5 bit of the key is just so that duplicates don't clobber each other.) This works great if the inventory-count is immutable with respect to the ID (like a timestamp, as in the Twissandra example, which does exactly this), but if it's not then whenever we update the count, we have to also update the secondary index CF. Let's say we want to increment the inventory count of widget2. This presents some problems:

1. We're going to have to remove the old count005_widget2 row (action A) and add a count006_widget2 row (action B). If we do A first, then B, then for some period widget2 doesn't show up in our inventory report at all. If we do B first then A, then it shows it twice for a bit. If this is a heavily-trafficed CF, then at any given time there are likely to be many such inconsistencies adding up and querying a slice is unlikely to be useful information at all.

2. How do we find the old column name "count005_widget2" to remove it? If it's a trafficked column we can't use the current count before the increment because that's rife for race-conditions, and if it's a big CF we can't scan the whole thing looking for that ID. We could devise a hack by guessing that the inventory is unlikely to change by more than N% at a time, and pull the slice count004*.. count010* and only scan that slice for the deletions, but that imposes a new race-condition where we have simultaneous agents incrementing the count, decrementing it, and incrementing it (that is, the decrement may delete the row that the second increment created, and now none of them are in the index).

Has anybody devised a toolkit or guidelines for working with this situation?