You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Thomas Stets <th...@gmail.com> on 2013/11/04 12:20:48 UTC

Managing index tables

What is the best way to manage index tables on update/deletion of the
indexed data?

I have a table containing all kinds of data fora user, i.e. name, address,
contact data, company data etc. Key to this table is the user ID.

I also maintain about a dozen index tables matching my queries, like name,
email address, company D.U.N.S number, permissions the user has, etc. These
index tables contain the user IDs matching the search key as column names,
with the column values left empty.

Whenever a user is deleted or updated I have to make sure to update the
index tables, i.e. if the permissions of a user changes I have to remove
the user ID from the rows matching the permission he no longer has.

My problem is to find all matching entries, especially for data I no longer
have.

My solution so far is to keep a separate table to keep track of all index
tables and keys the user can be found in. In the case mentioned I look up
the keys for the permissions table, remove the user ID from there, then
remove the entry in the keys table.

This works so far (in production for more than a year and a half), and it
also allows me to clean up after something has gone wrong.

But still, all this additional level of meta information adds a lot of
complexity. I was wondering wether there is some kind of pattern that
addresses my problem. I found lots of information saying that creating the
index tables is the way to go, but nobody ever mentions maintaining the
index tables.

tia, Thomas

Re: Managing index tables

Posted by Thomas Stets <th...@gmail.com>.
Hi Tom,

thanks, I take your answer as "nobody else has found an elegant solution,
either" :-)

I guess I could use a secondary index for some cases, but there are several
reasons I can't use them in most cases. Especially the permissions are
problematic.

A user may have dozens of permission for different solutions, and the
permissions may have boundary values that limit their applicability. I
query for users with any roles for  a specific solution, for users with a
specific role within a specific solution, for users with a specific role,
solution and boundary value. I have index tables for all of these cases,
and since a lot of queries only need users from a specific company I also
have index tables where the key also contains the company. Plus, I need to
support wild cards in the stored boundary values as well as in the query
terms.

With several hundred thousand users in our DB all of this works fine, but
maintaining all the index tables is no fun...

We are still on 1.1.8, preparing to switch to 1.2. As far as I know, 2.0 is
not yet recommended for production. And while having transaction would
definitely be nice, most cases of inconsistent data result from bugs in the
code or changed logic.

The data we keep in Cassandra is just a denormalized copy of part of our
data in an Oracle DB, so theoretically we could just throw everything away
and rebuild. But we can't afford to be down that long.
I have a program that scans the data in our Cassandra DB for
inconsistencies, and spits our CQL code to fix it. One complete check alone
takes about 2 days.

Anyway, thanks for your help. :-)

Thomas


On Tue, Nov 5, 2013 at 11:00 PM, Tom van den Berge <to...@drillster.com>wrote:

> Hi Thomas,
>
> I understand your concerns about ensuring the integrity of your data when
> having to maintain the indexes yourself.
>
> In some situations, using Cassandra's built in secondary indexes is more
> efficient -- when many rows contained the indexed value. Maybe your
> permissions fall in this category? Obviously, the advantage is that
> Cassandra will do the maintenance on the index for you.
>
> For situations where secondary indexes are not recommended, you make your
> life a lot easier if all modifications of the indexed entity (like your
> user) is executed by one single piece of code, which is then also
> responsible for maintaining all associated indexes. And write tests to
> ensure that it works in all possible ways.
>
> I understood that Cassandra 2.0 supports transactions. I haven't looked at
> it yet, but this could also help maintaining your data integrity, when a
> failed update of one of your indexes results in a rollback of the entire
> transaction.
>
> I hope this is helpful to you.
> Tom
>
>
> On Mon, Nov 4, 2013 at 12:20 PM, Thomas Stets <th...@gmail.com>wrote:
>
>> What is the best way to manage index tables on update/deletion of the
>> indexed data?
>>
>> I have a table containing all kinds of data fora user, i.e. name,
>> address, contact data, company data etc. Key to this table is the user ID.
>>
>> I also maintain about a dozen index tables matching my queries, like
>> name, email address, company D.U.N.S number, permissions the user has, etc.
>> These index tables contain the user IDs matching the search key as column
>> names, with the column values left empty.
>>
>> Whenever a user is deleted or updated I have to make sure to update the
>> index tables, i.e. if the permissions of a user changes I have to remove
>> the user ID from the rows matching the permission he no longer has.
>>
>> My problem is to find all matching entries, especially for data I no
>> longer have.
>>
>> My solution so far is to keep a separate table to keep track of all index
>> tables and keys the user can be found in. In the case mentioned I look up
>> the keys for the permissions table, remove the user ID from there, then
>> remove the entry in the keys table.
>>
>> This works so far (in production for more than a year and a half), and it
>> also allows me to clean up after something has gone wrong.
>>
>> But still, all this additional level of meta information adds a lot of
>> complexity. I was wondering wether there is some kind of pattern that
>> addresses my problem. I found lots of information saying that creating the
>> index tables is the way to go, but nobody ever mentions maintaining the
>> index tables.
>>
>> tia, Thomas
>>
>
>
>
> --
>
> Drillster BV
> Middenburcht 136
> 3452MT Vleuten
> Netherlands
>
> +31 30 755 5330
>
> Open your free account at www.drillster.com
>

Re: Managing index tables

Posted by Tom van den Berge <to...@drillster.com>.
Hi Thomas,

I understand your concerns about ensuring the integrity of your data when
having to maintain the indexes yourself.

In some situations, using Cassandra's built in secondary indexes is more
efficient -- when many rows contained the indexed value. Maybe your
permissions fall in this category? Obviously, the advantage is that
Cassandra will do the maintenance on the index for you.

For situations where secondary indexes are not recommended, you make your
life a lot easier if all modifications of the indexed entity (like your
user) is executed by one single piece of code, which is then also
responsible for maintaining all associated indexes. And write tests to
ensure that it works in all possible ways.

I understood that Cassandra 2.0 supports transactions. I haven't looked at
it yet, but this could also help maintaining your data integrity, when a
failed update of one of your indexes results in a rollback of the entire
transaction.

I hope this is helpful to you.
Tom


On Mon, Nov 4, 2013 at 12:20 PM, Thomas Stets <th...@gmail.com>wrote:

> What is the best way to manage index tables on update/deletion of the
> indexed data?
>
> I have a table containing all kinds of data fora user, i.e. name, address,
> contact data, company data etc. Key to this table is the user ID.
>
> I also maintain about a dozen index tables matching my queries, like name,
> email address, company D.U.N.S number, permissions the user has, etc. These
> index tables contain the user IDs matching the search key as column names,
> with the column values left empty.
>
> Whenever a user is deleted or updated I have to make sure to update the
> index tables, i.e. if the permissions of a user changes I have to remove
> the user ID from the rows matching the permission he no longer has.
>
> My problem is to find all matching entries, especially for data I no
> longer have.
>
> My solution so far is to keep a separate table to keep track of all index
> tables and keys the user can be found in. In the case mentioned I look up
> the keys for the permissions table, remove the user ID from there, then
> remove the entry in the keys table.
>
> This works so far (in production for more than a year and a half), and it
> also allows me to clean up after something has gone wrong.
>
> But still, all this additional level of meta information adds a lot of
> complexity. I was wondering wether there is some kind of pattern that
> addresses my problem. I found lots of information saying that creating the
> index tables is the way to go, but nobody ever mentions maintaining the
> index tables.
>
> tia, Thomas
>



-- 

Drillster BV
Middenburcht 136
3452MT Vleuten
Netherlands

+31 30 755 5330

Open your free account at www.drillster.com