You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Peter Chang <pe...@gmail.com> on 2010/03/11 07:54:17 UTC

Strategies for storing lexically ordered data in supercolumns

I'm wondering about good strategies for picking keys that I want to be
lexically sorted in a super column family. For example, my data looks like
this:

[user1_uuid][connections][some_key_for_user2] = ""
[user1_uuid][connections][some_key_for_user3] = ""

I was thinking that I wanted some_key_for_user2 to be sorted by a user's
name. So I was thinking I set the subcolumn compareWith to UTF8Type or
BytesType and construct a key

[user's lastname + user's firstname + user's uuid]

This would result in sorted subcolumn and user list. That's fine. But I
wonder what would happen if, say, a user changes their last name. Happens
rarely but I imagine people getting married and modifying their name. Now
the sort is no longer correct. There seems to be some bad consequences to
creating keys based on data that can change.

So what is the general (elegant, easy to maintain) strategy here? Always
sort in your server-side code and don't bother trying to have the data
sorted?

I'm a cassandra noob with all my experience in relational DBMS.

TIA
Pete

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Peter Chang <pe...@gmail.com>.

To be more explicit:

['500c9280-2cdd-11df-869b-005056c00001'] ['connections']
['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
['500c9280-2cdd-11df-869b-005056c00001'] ['connections']
['Jones-Jim-1a6dd756b0-2ca1-11df-b937-005056c00001']

But Alyssa gets married and changes her name to Zamboni. The next time I
read these subcolumns the user's will not be sorted.




On Fri, Mar 12, 2010 at 5:21 PM, Peter Chang <pe...@gmail.com> wrote:

> My original post is probably confusing. I was originally talking about
> columns and I don't see what the solution is.
>
> * "So I was thinking I set the subcolumn compareWith to UTF8Type or
> BytesType and construct a key [for the subcolumn, not a row key] *
> *
> *
> *[user's lastname + user's firstname + user's uuid]*
> * *
> *This would result in sorted subcolumn and user list."*
> *
> *
> Nevertheless, I still don't see/understand the solution. Let's say the
> person's name changes. The sort is no longer valid. That column value would
> need to be changed in order for the sort to be correct.
>
>
> On Fri, Mar 12, 2010 at 5:10 PM, Brandon Williams <dr...@gmail.com>wrote:
>
>> On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang <pe...@gmail.com> wrote:
>>
>>> But wouldn't name + UUID be considered volatile? That was the crux of my
>>> questions.
>>
>>
>> It would, but the distinction here is that it is now a column, not a row
>> key.
>>
>>  -Brandon
>>
>
>

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Peter Chang <pe...@gmail.com>.

Just a follow-up on this discussion in case anybody else comes across it in
their search.

http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx

*"Fixing data inconsistency is now the job of the application. Let's say
each user has a list of the user names of all of their friends. What happens
when one of these users changes their user name? In a normalized database
that is a simple UPDATE query to change a single piece of data and then it
will be current everywhere it is shown on the site. In a denormalized
database, there now has to be a mechanism for fixing up this name in all of
the dozens, hundreds or thousands of places it appears. Most services that
create denormalized databases have "fixup" jobs that are constantly running
on the database to fix such inconsistencies."*

On Fri, Mar 12, 2010 at 5:50 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Fri, Mar 12, 2010 at 7:46 PM, Peter Chang <pe...@gmail.com> wrote:
>
>> Yes, I can update that one entry. But what if that subcolumn key is used
>> across many different places?
>>
>> ['Jones-Bob']['connections']
>> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
>> ['Crabtree-Sam']['connections']
>> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
>> ['Rice-Brown']['connections']
>> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
>> ...
>>
>> I can update every single entry but now I need to keep track of them
>> (which I guess I'm doing anyway). I was wondering if there was a more
>> elegant solution but it seems unlikely based on the given constraints.
>>
>
> You have to update them all and track them, correct.  What you're looking
> for sounds like transaction support, which Cassandra does not have.  On the
> bright side, writes are cheap.
>
> -Brandon
>

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Mar 12, 2010 at 7:46 PM, Peter Chang <pe...@gmail.com> wrote:

> Yes, I can update that one entry. But what if that subcolumn key is used
> across many different places?
>
> ['Jones-Bob']['connections']
> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
> ['Crabtree-Sam']['connections']
> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
> ['Rice-Brown']['connections']
> ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
> ...
>
> I can update every single entry but now I need to keep track of them (which
> I guess I'm doing anyway). I was wondering if there was a more elegant
> solution but it seems unlikely based on the given constraints.
>

You have to update them all and track them, correct.  What you're looking
for sounds like transaction support, which Cassandra does not have.  On the
bright side, writes are cheap.

-Brandon

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Peter Chang <pe...@gmail.com>.

Yes, I can update that one entry. But what if that subcolumn key is used
across many different places?

['Jones-Bob']['connections']
['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
['Crabtree-Sam']['connections']
['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
['Rice-Brown']['connections']
['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c00001']
...

I can update every single entry but now I need to keep track of them (which
I guess I'm doing anyway). I was wondering if there was a more elegant
solution but it seems unlikely based on the given constraints.


On Fri, Mar 12, 2010 at 5:26 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Fri, Mar 12, 2010 at 7:21 PM, Peter Chang <pe...@gmail.com> wrote:
>
>> My original post is probably confusing. I was originally talking about
>> columns and I don't see what the solution is.
>
>
> Sorry, I misunderstood.
>
> * "So I was thinking I set the subcolumn compareWith to UTF8Type or
>> BytesType and construct a key [for the subcolumn, not a row key] *
>> *
>> *
>> *[user's lastname + user's firstname + user's uuid]*
>> * *
>> *This would result in sorted subcolumn and user list."*
>> *
>> *
>> Nevertheless, I still don't see/understand the solution. Let's say the
>> person's name changes. The sort is no longer valid. That column value would
>> need to be changed in order for the sort to be correct.
>>
>
> When their name changes, you delete the existing column and insert a new
> one with the correct name, which will then sort correctly.
>
> -Brandon
>

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Mar 12, 2010 at 7:21 PM, Peter Chang <pe...@gmail.com> wrote:

> My original post is probably confusing. I was originally talking about
> columns and I don't see what the solution is.


Sorry, I misunderstood.

* "So I was thinking I set the subcolumn compareWith to UTF8Type or
> BytesType and construct a key [for the subcolumn, not a row key] *
> *
> *
> *[user's lastname + user's firstname + user's uuid]*
> * *
> *This would result in sorted subcolumn and user list."*
> *
> *
> Nevertheless, I still don't see/understand the solution. Let's say the
> person's name changes. The sort is no longer valid. That column value would
> need to be changed in order for the sort to be correct.
>

When their name changes, you delete the existing column and insert a new one
with the correct name, which will then sort correctly.

-Brandon

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Peter Chang <pe...@gmail.com>.

My original post is probably confusing. I was originally talking about
columns and I don't see what the solution is.

* "So I was thinking I set the subcolumn compareWith to UTF8Type or
BytesType and construct a key [for the subcolumn, not a row key] *
*
*
*[user's lastname + user's firstname + user's uuid]*
* *
*This would result in sorted subcolumn and user list."*
*
*
Nevertheless, I still don't see/understand the solution. Let's say the
person's name changes. The sort is no longer valid. That column value would
need to be changed in order for the sort to be correct.

On Fri, Mar 12, 2010 at 5:10 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang <pe...@gmail.com> wrote:
>
>> But wouldn't name + UUID be considered volatile? That was the crux of my
>> questions.
>
>
> It would, but the distinction here is that it is now a column, not a row
> key.
>
>  -Brandon
>

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang <pe...@gmail.com> wrote:

> But wouldn't name + UUID be considered volatile? That was the crux of my
> questions.


It would, but the distinction here is that it is now a column, not a row
key.

-Brandon

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Peter Chang <pe...@gmail.com>.

But wouldn't name + UUID be considered volatile? That was the crux of my
questions.

On Fri, Mar 12, 2010 at 1:07 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang <pe...@gmail.com> wrote:
>
>> I'm wondering about good strategies for picking keys that I want to be
>> lexically sorted in a super column family. For example, my data looks like
>> this:
>>
>> [user1_uuid][connections][some_key_for_user2] = ""
>> [user1_uuid][connections][some_key_for_user3] = ""
>>
>> I was thinking that I wanted some_key_for_user2 to be sorted by a user's
>> name. So I was thinking I set the subcolumn compareWith to UTF8Type or
>> BytesType and construct a key
>>
>> [user's lastname + user's firstname + user's uuid]
>>
>> This would result in sorted subcolumn and user list. That's fine. But I
>> wonder what would happen if, say, a user changes their last name. Happens
>> rarely but I imagine people getting married and modifying their name. Now
>> the sort is no longer correct. There seems to be some bad consequences to
>> creating keys based on data that can change.
>>
>> So what is the general (elegant, easy to maintain) strategy here? Always
>> sort in your server-side code and don't bother trying to have the data
>> sorted?
>>
>
> Having row keys based on something potentially volatile is something I
> would avoid since that determines which machine the row belongs to and
> moving data between machines isn't a cheap operation.
>
> What you'll probably want to do is make the key something unique (like a
> uuid), store the user's name as a column on the row (thus making it easy to
> update) and maintain a secondary index to get the named-based sorting you
> want.  If you're expecting a few million users, maintaining the index in a
> special row will work fine (eg, the row name is "NAMEINDEX" and the columns
> are the name+uuid similar to what you described.)  If you have billions of
> users, you'll need to get a bit fancier (partition based on letter of the
> last name, for example.)
>
> -Brandon
>

Re: Strategies for storing lexically ordered data in supercolumns

Posted by Brandon Williams <dr...@gmail.com>.

On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang <pe...@gmail.com> wrote:

> I'm wondering about good strategies for picking keys that I want to be
> lexically sorted in a super column family. For example, my data looks like
> this:
>
> [user1_uuid][connections][some_key_for_user2] = ""
> [user1_uuid][connections][some_key_for_user3] = ""
>
> I was thinking that I wanted some_key_for_user2 to be sorted by a user's
> name. So I was thinking I set the subcolumn compareWith to UTF8Type or
> BytesType and construct a key
>
> [user's lastname + user's firstname + user's uuid]
>
> This would result in sorted subcolumn and user list. That's fine. But I
> wonder what would happen if, say, a user changes their last name. Happens
> rarely but I imagine people getting married and modifying their name. Now
> the sort is no longer correct. There seems to be some bad consequences to
> creating keys based on data that can change.
>
> So what is the general (elegant, easy to maintain) strategy here? Always
> sort in your server-side code and don't bother trying to have the data
> sorted?
>

Having row keys based on something potentially volatile is something I would
avoid since that determines which machine the row belongs to and moving data
between machines isn't a cheap operation.

What you'll probably want to do is make the key something unique (like a
uuid), store the user's name as a column on the row (thus making it easy to
update) and maintain a secondary index to get the named-based sorting you
want.  If you're expecting a few million users, maintaining the index in a
special row will work fine (eg, the row name is "NAMEINDEX" and the columns
are the name+uuid similar to what you described.)  If you have billions of
users, you'll need to get a bit fancier (partition based on letter of the
last name, for example.)

-Brandon