You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by tsuraan <ts...@gmail.com> on 2010/04/19 19:03:39 UTC

Modelling assets and user permissions

Suppose I have a CF that holds some sort of assets that some users of
my program have access to, and that some do not.  In SQL-ish terms it
would look something like this:

TABLE Assets (
  asset_id serial primary key,
  ...
);

TABLE Users (
  user_id serial primary key,
  user_name text
);

TABLE Permissions (
  asset_id integer references(Assets),
  user_id integer references(Users)
)

Now, I can generate UUIDs for my asset keys without any trouble, so
the serial that I have in my pseudo-SQL Assets table isn't a problem.
My problem is that I can't see a good way to model the relationship
between user ids and assets.  I see one way to do this, which has
problems, and I think I sort of see a second way.

The obvious way to do it is have the Assets CF have a SuperColumn that
somehow enumerates the users allowed to see it, so when retrieving a
specific Asset I can retrieve the users list and ensure that the user
doing the request is allowed to see it.  This has quite a few
problems.  The foremost is that Cassandra doesn't appear to have much
for conflict resolution (at least I can't find any docs on it), so if
two processes try to add permissions to the same Asset, it looks like
one process will win and I have no idea what happens to the loser.
Another problem is that Cassandra's SuperColumns don't appear to be
ideal for storing lists of things; they store maps, which isn't a
terrible problem, but it feels like a bit of a mismatch in my design.
A SuperColumn mapping from user_ids to an empty byte array seems like
it should work pretty efficiently for checking whether a user has
permissions on an Asset, but it also seems pretty evil.

The other idea that I have is a seperate CF for AssetPermissions that
somehow stores pairs of asset_ids and user_names.  I don't know what
I'd use for a key in that situation, so I haven't really gotten too
far in seeing what else is broken with that idea.  I think it would
get around the race condition, but I don't know how to do it, and I'm
not sure how efficient it could be.

What do people normally use in this situation?  I assume it's a pretty
common problem, but I haven't see it in the various data modelling
examples on the Wiki.

Re: Modelling assets and user permissions

Posted by Jonathan Ellis <jb...@gmail.com>.

if you want to look up "what permissions does user X have on asset Y"
then i would model that as a row keyed by userid, containing
supercolumns named by asset ids, and containing subcolumns of the
permissions granted.

On Mon, Apr 19, 2010 at 12:03 PM, tsuraan <ts...@gmail.com> wrote:
> Suppose I have a CF that holds some sort of assets that some users of
> my program have access to, and that some do not.  In SQL-ish terms it
> would look something like this:
>
> TABLE Assets (
>  asset_id serial primary key,
>  ...
> );
>
> TABLE Users (
>  user_id serial primary key,
>  user_name text
> );
>
> TABLE Permissions (
>  asset_id integer references(Assets),
>  user_id integer references(Users)
> )
>
> Now, I can generate UUIDs for my asset keys without any trouble, so
> the serial that I have in my pseudo-SQL Assets table isn't a problem.
> My problem is that I can't see a good way to model the relationship
> between user ids and assets.  I see one way to do this, which has
> problems, and I think I sort of see a second way.
>
> The obvious way to do it is have the Assets CF have a SuperColumn that
> somehow enumerates the users allowed to see it, so when retrieving a
> specific Asset I can retrieve the users list and ensure that the user
> doing the request is allowed to see it.  This has quite a few
> problems.  The foremost is that Cassandra doesn't appear to have much
> for conflict resolution (at least I can't find any docs on it), so if
> two processes try to add permissions to the same Asset, it looks like
> one process will win and I have no idea what happens to the loser.
> Another problem is that Cassandra's SuperColumns don't appear to be
> ideal for storing lists of things; they store maps, which isn't a
> terrible problem, but it feels like a bit of a mismatch in my design.
> A SuperColumn mapping from user_ids to an empty byte array seems like
> it should work pretty efficiently for checking whether a user has
> permissions on an Asset, but it also seems pretty evil.
>
> The other idea that I have is a seperate CF for AssetPermissions that
> somehow stores pairs of asset_ids and user_names.  I don't know what
> I'd use for a key in that situation, so I haven't really gotten too
> far in seeing what else is broken with that idea.  I think it would
> get around the race condition, but I don't know how to do it, and I'm
> not sure how efficient it could be.
>
> What do people normally use in this situation?  I assume it's a pretty
> common problem, but I haven't see it in the various data modelling
> examples on the Wiki.
>

Re: Re: Modelling assets and user permissions

Posted by tsuraan <ts...@gmail.com>.

> It seems to me you might get by with putting the actual assets into
> cassandra (possibly breaking them up into chunks depending on how big
> they are) and storing the pointers to them in Postgres along with all
> the other metadata.  If it were me, I'd split each file into a fixed
> chunksize and store it using its SHA1 checksum, and keep an ordered
> list of chunks that make up a file, then never delete a chunk.  Given
> billions of documents you just may end up with some savings due to
> file chunks that are identical.

The retrieval of documents is pretty key (people like getting their
files), so we store them on disk and use our http server's static file
serving to send them out.  I'm not sure what the best way to serve
files stored in cassandra would be, but the free replication offered
is interesting.  Is cassandra a sane way to store huge amounts (many
TB) of raw data?  I saw in the limitations page that people are using
cassandra to store files, but is it considered a good idea?

> You could partition the postgres tables and replicate the data to a
> handful of read-only nodes that could handle quite a bit of the work.
> I suppose it depends on your write-frequency how that might pan out as
> a scalability option.

Our system is pretty write-heavy; we currently do a bit under a
million files a day (which translates to about 5x number of db records
stored), but we're going for a few million per day.

Here's a quick question that should be answerable:  If I have a CF
with SuperColumns where one of the SuperColumns has keys that are
users allowed to see an asset, is it guaranteed to be safe to add keys
to that SuperColumn?  I noticed that each column has its own
timestamp, so it doesn't look like I actually need to write a full row
(which would introduce overwriting race-condition concerns).  It looks
like I can just use batch_mutate to add the keys that I want to the
permissions SuperColumn.  Is that correct, and would that avoid races?

Re: Re: Modelling assets and user permissions

Posted by Vick Khera <vi...@khera.org>.

On Tue, Apr 20, 2010 at 1:37 PM, tsuraan <ts...@gmail.com> wrote:
> The assets are binary files on a document tracking system.  Our
> current platform is postgres-backed; the entire system we've written
> is fairly easily distributed across multiple computers, but postgres
> isn't.  There are reliable databases that do scale out, but they tend
> to be a little on the pricey side...  Our current system works well in
> the tens to hundreds of millions of documents with hundreds of users,
> but we're hitting the billions of documents with thousands of users,
> so cassandra's scaling properties are pretty appealing there.

It seems to me you might get by with putting the actual assets into
cassandra (possibly breaking them up into chunks depending on how big
they are) and storing the pointers to them in Postgres along with all
the other metadata.  If it were me, I'd split each file into a fixed
chunksize and store it using its SHA1 checksum, and keep an ordered
list of chunks that make up a file, then never delete a chunk.  Given
billions of documents you just may end up with some savings due to
file chunks that are identical.

You could partition the postgres tables and replicate the data to a
handful of read-only nodes that could handle quite a bit of the work.
I suppose it depends on your write-frequency how that might pan out as
a scalability option.

Re: Re: Modelling assets and user permissions

Posted by tsuraan <ts...@gmail.com>.

> I'm curious as to how you would have so many asset / user permissions that
> you couldn't use a standard relational database to model them. Is this some
> sort of multi-tenant system where you're providing some generalized asset
> check-out mechanism to many, many customers? Even so, I'm not sure the
> eventually consistent model wouldn't open you up to check-out collisions, as
> you mention yourself.

The assets are binary files on a document tracking system.  Our
current platform is postgres-backed; the entire system we've written
is fairly easily distributed across multiple computers, but postgres
isn't.  There are reliable databases that do scale out, but they tend
to be a little on the pricey side...  Our current system works well in
the tens to hundreds of millions of documents with hundreds of users,
but we're hitting the billions of documents with thousands of users,
so cassandra's scaling properties are pretty appealing there.

I don't think eventual consistency would be a terrible problem; so
long as our system lives in a rack, or at least in a single data
center I think the database would become consistent before the
documents would be visible by any users of the system.

> Am I missing something about your example?

Just the scale, I think.  I like relational databases, but I'm really
interested in trying out cassandra's way, if I can come up with a sane
way to model my system in it.

Re: Re: Modelling assets and user permissions

Posted by ch...@gmail.com.

The short answer as to what people normally do is that they use a  
relational database for something like this.

I'm curious as to how you would have so many asset / user permissions that  
you couldn't use a standard relational database to model them. Is this some  
sort of multi-tenant system where you're providing some generalized asset  
check-out mechanism to many, many customers? Even so, I'm not sure the  
eventually consistent model wouldn't open you up to check-out collisions,  
as you mention yourself.

Am I missing something about your example?

On Apr 20, 2010 9:47am, tsuraan <ts...@gmail.com> wrote:
> > Suppose I have a CF that holds some sort of assets that some users of

> > my program have access to, and that some do not. In SQL-ish terms it

> > would look something like this:

> >

> > TABLE Assets (

> > asset_id serial primary key,

> > ...

> > );

> >

> > TABLE Users (

> > user_id serial primary key,

> > user_name text

> > );

> >

> > TABLE Permissions (

> > asset_id integer references(Assets),

> > user_id integer references(Users)

> > )

> >

> > Now, I can generate UUIDs for my asset keys without any trouble, so

> > the serial that I have in my pseudo-SQL Assets table isn'ta problem.

> > My problem is that I can't see a good way to model the relationship

> > between user ids and assets. I see one way to do this, which has

> > problems, and I think I sort of see a second way.

> >

> > The obvious way to do it is have the Assets CF have a SuperColumn that

> > somehow enumerates the users allowed to see it, so when retrieving a

> > specific Asset I can retrieve the users list and ensure that the user

> > doing the request is allowed to see it. This has quite a few

> > problems. The foremost is that Cassandra doesn't appear to have much

> > for conflict resolution (at least I can't find any docs on it), so if

> > two processes try to add permissions to the same Asset, it looks like

> > one process will win and I have no idea what happens to the loser.

> > Another problem is that Cassandra's SuperColumns don't appear to be

> > ideal for storing lists of things; they store maps, which isn'ta

> > terrible problem, but it feels like a bit of a mismatch in my design.

> > A SuperColumn mapping from user_ids to an empty byte array seems like

> > it should work pretty efficiently for checking whether a user has

> > permissions on an Asset, but it also seems pretty evil.

> >

> > The other idea that I have is a seperate CF for AssetPermissions that

> > somehow stores pairs of asset_ids and user_names. I don't know what

> > I'd use for a key in that situation, so I haven't really gotten too

> > far in seeing what else is broken with that idea. I think it would

> > get around the race condition, but I don't know how to do it, and I'm

> > not sure how efficient it could be.

> >

> > What do people normally use in this situation? I assume it's a pretty

> > common problem, but I haven't see it in the various data modelling

> > examples on the Wiki.



> I'm wondering, is my question too vague, too specific, off topic for

> this list, or answered in the docs somewhere that I missed?

Re: Modelling assets and user permissions

Posted by tsuraan <ts...@gmail.com>.

> Suppose I have a CF that holds some sort of assets that some users of
> my program have access to, and that some do not.  In SQL-ish terms it
> would look something like this:
>
> TABLE Assets (
>  asset_id serial primary key,
>  ...
> );
>
> TABLE Users (
>  user_id serial primary key,
>  user_name text
> );
>
> TABLE Permissions (
>  asset_id integer references(Assets),
>  user_id integer references(Users)
> )
>
> Now, I can generate UUIDs for my asset keys without any trouble, so
> the serial that I have in my pseudo-SQL Assets table isn't a problem.
> My problem is that I can't see a good way to model the relationship
> between user ids and assets.  I see one way to do this, which has
> problems, and I think I sort of see a second way.
>
> The obvious way to do it is have the Assets CF have a SuperColumn that
> somehow enumerates the users allowed to see it, so when retrieving a
> specific Asset I can retrieve the users list and ensure that the user
> doing the request is allowed to see it.  This has quite a few
> problems.  The foremost is that Cassandra doesn't appear to have much
> for conflict resolution (at least I can't find any docs on it), so if
> two processes try to add permissions to the same Asset, it looks like
> one process will win and I have no idea what happens to the loser.
> Another problem is that Cassandra's SuperColumns don't appear to be
> ideal for storing lists of things; they store maps, which isn't a
> terrible problem, but it feels like a bit of a mismatch in my design.
> A SuperColumn mapping from user_ids to an empty byte array seems like
> it should work pretty efficiently for checking whether a user has
> permissions on an Asset, but it also seems pretty evil.
>
> The other idea that I have is a seperate CF for AssetPermissions that
> somehow stores pairs of asset_ids and user_names.  I don't know what
> I'd use for a key in that situation, so I haven't really gotten too
> far in seeing what else is broken with that idea.  I think it would
> get around the race condition, but I don't know how to do it, and I'm
> not sure how efficient it could be.
>
> What do people normally use in this situation?  I assume it's a pretty
> common problem, but I haven't see it in the various data modelling
> examples on the Wiki.

I'm wondering, is my question too vague, too specific, off topic for
this list, or answered in the docs somewhere that I missed?