You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jun Rao <ju...@almaden.ibm.com> on 2009/03/24 18:21:33 UTC

secondary index support in Cassandra


We have an application that has groups and entities. A group has many
entities and an entity has a bunch of (attribute, value) pairs. A common
access pattern is to select some number of entities within a group with
attribute X equals to x and ordered by attribute Y. For efficiency, we want
to build a secondary index for each group and collocate a group and its
secondary index on the same node. Our current approach is to map a group to
a row in Cassandra and each entity to a column in a column family (CF).
Within the same row, we use a separate CF (ordered by name) to implement a
secondary index, say on attribute X and Y. In this family, each column name
has the form of X:x:Y:y:entityID. We extended the get_slice() function so
that it can get a slice of columns starting from a given column. The
extended function uses the column-level index to locate the starting column
quickly. (We'd be happy to contribute this extension back to Cassandra if
people find this useful). Using the extended get_slice(), we were able to
access the entities through the simulated secondary index.

We see a couple of problems with the current approach. First, our
application has to maintain the index. This is inefficient and could leave
the index inconsistent when failure occurs.  Second, mapping each entity to
a column may not be a good idea. Often, there is some sort of locking for
each row access. Putting many entities per row limits concurrency. Today,
in Cassandra, a full row is deserialized into memory during compaction.
This limits the number of entities that can be put in a single row. Also,
intuitively, an entity is more naturally represented as a row with
attributes stored as columns.

To address the above problems, we are thinking of the following new
implementation. Each entity is mapped to a row in Cassandra and uses a
two-part key (groupID, entityID). We use the groupID to hash an entity to a
node. This way, all entities for a group will be collocated in the same
node. We then define a special CF to serve as the secondary index. In the
definition, we specify what entity attributes need to be indexed  and in
what order. Within a node, this special CF will index all rows stored
locally. Every time we insert a new entity, the server automatically
extracts the index key based on the index definition (for example, the
index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
and add the index entry to the special CF. We can then access the entities
using an extended version of the query language in Cassandra. For example,
if we issue the following query and there is an index defined by
(attributeX, attributeY), the query can be evaluated using the index in the
special CF. (Note that AppEngine supports this flavor of queries.)

select attributeZ
from ROWS(HASH = hash(groupID))
where attributeX="x"
order by attributeY desc
limit 50

We are in the middle of prototyping this approach. We'd like to hear if
other people are interested in this too or if people think there are better
alternatives.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com

Re: secondary index support in Cassandra

Posted by Avinash Lakshman <av...@gmail.com>.
I think Prashant brought up some very good points. The response would be
very helpful to understand the best way to do this.
Avinash

On Tue, Mar 24, 2009 at 6:33 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

> Jonathan,
>
> Thanks for the comments.
>
> I agree with your first point. It will be useful to plug in a user-defined
> index analyzer. The analyzer takes a row with the indexed columns and can
> extract whatever index keys that it likes. This way, an application can
> choose what to index for different data types.
>
> As for queries vs. low-level api, we can make both available to the
> application developer. In general, what can be done in a single query may
> have to be translated to multiple low-level api calls. Some apps may prefer
> the former for efficiency.
>
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> junrao@almaden.ibm.com
>
> [image: Inactive hide details for Jonathan Ellis <jb...@gmail.com>]
> Jonathan Ellis <jb...@gmail.com>
>
>
>
>     *Jonathan Ellis <jb...@gmail.com>*
>
>             03/24/2009 10:48 AM
>             Please respond to
>             cassandra-dev@incubator.apache.org
>
>
>
> To
>
> cassandra-dev@incubator.apache.org
> cc
>
>
> Subject
>
> Re: secondary index support in Cassandra
>
>
> This adds a lot of complexity but I definitely see people wanting easy
> indexing out of the box.  So +1 in principle.
>
> A few high-level comments:
>
> First, for maximum flexibility, you probably want to allow indexes to
> be defined in code.  That is, you'd define something like
>
>  <ColumnFamily name="foo">
>    <Index generator="com.ibm.cassandra.indexGenerator"/>
>  </ColumnFamily>
>
> and allow index generators to be loaded at runtime.  Nobody else is
> going to need the specific case of
> hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
> make it pluggable for whatever weird-ass requirements people have.
>
> Second, I'm not a fan of queries by parsing strings.  The whole rdbms
> world has been moving _away_ from SQL and towards OO interfaces for
> the last 10 years.  I like the thrift API for this reason.  (It is a
> little clunky in Java, but _everything_ is a little clunky in Java.
> Much better in Python/Ruby/etc.)
>
> Finally, as an implementation detail, Cassandra already does too much
> in-memory when writing and merging sstables.  Don't make it worse. :)
>
> -Jonathan
>
> P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
> to do the per-node grouping you want without weird contortions.
>
> On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> > To address the above problems, we are thinking of the following new
> > implementation. Each entity is mapped to a row in Cassandra and uses a
> > two-part key (groupID, entityID). We use the groupID to hash an entity to
> a
> > node. This way, all entities for a group will be collocated in the same
> > node. We then define a special CF to serve as the secondary index. In the
> > definition, we specify what entity attributes need to be indexed  and in
> > what order. Within a node, this special CF will index all rows stored
> > locally. Every time we insert a new entity, the server automatically
> > extracts the index key based on the index definition (for example, the
> > index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> > and add the index entry to the special CF. We can then access the
> entities
> > using an extended version of the query language in Cassandra. For
> example,
> > if we issue the following query and there is an index defined by
> > (attributeX, attributeY), the query can be evaluated using the index in
> the
> > special CF. (Note that AppEngine supports this flavor of queries.)
> >
> > select attributeZ
> > from ROWS(HASH = hash(groupID))
> > where attributeX="x"
> > order by attributeY desc
> > limit 50
> >
> > We are in the middle of prototyping this approach. We'd like to hear if
> > other people are interested in this too or if people think there are
> better
> > alternatives.
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
>
>

Re: secondary index support in Cassandra

Posted by Brian McCallister <br...@skife.org>.
On Tue, Mar 24, 2009 at 9:39 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Mar 24, 2009 at 8:33 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>> As for queries vs. low-level api, we can make both available to the
>> application developer.
>
> "Just do both" is almost always a cop-out. :)

Define a query AST and then don't care how the AST is generated :-)

-Brian

Re: secondary index support in Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Mar 24, 2009 at 8:33 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
> As for queries vs. low-level api, we can make both available to the
> application developer.

"Just do both" is almost always a cop-out. :)

-Jonathan

Re: secondary index support in Cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
Jonathan,

Thanks for the comments.

I agree with your first point. It will be useful to plug in a user-defined
index analyzer. The analyzer takes a row with the indexed columns and can
extract whatever index keys that it likes. This way, an application can
choose what to index for different data types.

As for queries vs. low-level api, we can make both available to the
application developer. In general, what can be done in a single query may
have to be translated to multiple low-level api calls. Some apps may prefer
the former for efficiency.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com



                                                                           
             Jonathan Ellis                                                
             <jbellis@gmail.co                                             
             m>                                                         To 
                                       cassandra-dev@incubator.apache.org  
             03/24/2009 10:48                                           cc 
             AM                                                            
                                                                   Subject 
                                       Re: secondary index support in      
             Please respond to         Cassandra                           
             cassandra-dev@inc                                             
             ubator.apache.org                                             
                                                                           
                                                                           
                                                                           
                                                                           





This adds a lot of complexity but I definitely see people wanting easy
indexing out of the box.  So +1 in principle.

A few high-level comments:

First, for maximum flexibility, you probably want to allow indexes to
be defined in code.  That is, you'd define something like

  <ColumnFamily name="foo">
    <Index generator="com.ibm.cassandra.indexGenerator"/>
  </ColumnFamily>

and allow index generators to be loaded at runtime.  Nobody else is
going to need the specific case of
hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
make it pluggable for whatever weird-ass requirements people have.

Second, I'm not a fan of queries by parsing strings.  The whole rdbms
world has been moving _away_ from SQL and towards OO interfaces for
the last 10 years.  I like the thrift API for this reason.  (It is a
little clunky in Java, but _everything_ is a little clunky in Java.
Much better in Python/Ruby/etc.)

Finally, as an implementation detail, Cassandra already does too much
in-memory when writing and merging sstables.  Don't make it worse. :)

-Jonathan

P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
to do the per-node grouping you want without weird contortions.

On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to
a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the
entities
> using an extended version of the query language in Cassandra. For
example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in
the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are
better
> alternatives.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com

Re: secondary index support in Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
This adds a lot of complexity but I definitely see people wanting easy
indexing out of the box.  So +1 in principle.

A few high-level comments:

First, for maximum flexibility, you probably want to allow indexes to
be defined in code.  That is, you'd define something like

  <ColumnFamily name="foo">
    <Index generator="com.ibm.cassandra.indexGenerator"/>
  </ColumnFamily>

and allow index generators to be loaded at runtime.  Nobody else is
going to need the specific case of
hash(rowkey):attribute1:attribute2:rowkey so abstract that out and
make it pluggable for whatever weird-ass requirements people have.

Second, I'm not a fan of queries by parsing strings.  The whole rdbms
world has been moving _away_ from SQL and towards OO interfaces for
the last 10 years.  I like the thrift API for this reason.  (It is a
little clunky in Java, but _everything_ is a little clunky in Java.
Much better in Python/Ruby/etc.)

Finally, as an implementation detail, Cassandra already does too much
in-memory when writing and merging sstables.  Don't make it worse. :)

-Jonathan

P.S. the partitioner abstraction layer in CASSANDRA-3 will allow you
to do the per-node grouping you want without weird contortions.

On Tue, Mar 24, 2009 at 11:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the entities
> using an extended version of the query language in Cassandra. For example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are better
> alternatives.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com

Re: secondary index support in Cassandra

Posted by Avinash Lakshman <av...@gmail.com>.
Prashant had many other points out there whereby you don't need your second
approach. I guess that is what I was referring to.
Avinash

On Tue, Mar 24, 2009 at 8:19 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

> Prashant,
>
> I forgot about another point you mentioned.
>
> In the new approach, carving out a chunk of data by hash (needed for node
> removal/addition) may not be efficient. In the worse case, we have to make a
> full scan of the data. It is possible to make it more efficient by following
> the strategy that you guys implemented when using the random hash function:
> prefixing each index key with the hash value. On the other hand, I am
> wondering if we really need to worry about the performance on
> adding/removing nodes. These are infrequent events and are also
> non-blocking.
>
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> junrao@almaden.ibm.com
>
> [image: Inactive hide details for Prashant Malik <pm...@gmail.com>]
> Prashant Malik <pm...@gmail.com>
>
>
>
>     *Prashant Malik <pm...@gmail.com>*
>
>             03/24/2009 11:34 AM
>              Please respond to
>             cassandra-dev@incubator.apache.org
>
>
>
> To
>
> cassandra-dev@incubator.apache.org
> cc
>
>
> Subject
>
> Re: secondary index support in Cassandra
> Some questions Iline
>
> On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> >
> >
> > We have an application that has groups and entities. A group has many
> > entities and an entity has a bunch of (attribute, value) pairs. A common
> > access pattern is to select some number of entities within a group with
> > attribute X equals to x and ordered by attribute Y. For efficiency, we
> want
> > to build a secondary index for each group and collocate a group and its
> > secondary index on the same node. Our current approach is to map a group
> to
> > a row in Cassandra and each entity to a column in a column family (CF).
> > Within the same row, we use a separate CF (ordered by name) to implement
> a
> > secondary index, say on attribute X and Y. In this family, each column
> name
> > has the form of X:x:Y:y:entityID. We extended the get_slice() function so
> > that it can get a slice of columns starting from a given column. The
> > extended function uses the column-level index to locate the starting
> column
> > quickly. (We'd be happy to contribute this extension back to Cassandra if
> > people find this useful). Using the extended get_slice(), we were able to
> > access the entities through the simulated secondary index.
> >
> > We see a couple of problems with the current approach. First, our
> > application has to maintain the index. This is inefficient and could
> leave
> > the index inconsistent when failure occurs.  Second, mapping each entity
> to
> > a column may not be a good idea. Often, there is some sort of locking for
> > each row access. Putting many entities per row limits concurrency. Today,
> > in Cassandra, a full row is deserialized into memory during compaction.
> > This limits the number of entities that can be put in a single row. Also,
> > intuitively, an entity is more naturally represented as a row with
> > attributes stored as columns.
>
>
> Prashant
>
>   1. Application can send the index  and the entityId update in the same
> write , a write per row is always atomic given that teh index and the data
> have teh same key in the above
>       case the index will not be out of sync.
>    2. The maintainance of index by app can be moved into cassandra and I
> agree with fact that you can add support for it by a built in special CF
> which you have to do in either of the approaches you are taking.
>        Infact in the first approach that you are taking it will be easier
> to move the indexes in case of adding nodes to the cluster and when files
> are split and data is sent over. In the second approach this
>        process could get comlicated.
>    3. There is no locking for row access in cassandra.
>    4. Compactions can be opmtimized for name sorted columns , this is one
> of the workitems we have where we do not deserialize the entire row in
> compactiopn but only do it in slices , this is easily achievable for
>        name sorted columns.
>    5. The entity can still be represented naturally as a supercolumn where
> each of the super column name is the entity Id and each of the columns in
> the supercolumn are attribute value pairs.
>    6.  How many entities per  groupid are we talking about here ? why do
> you feel concurrency is limited by entities per row ?
>
>
> >
> > To address the above problems, we are thinking of the following new
> > implementation. Each entity is mapped to a row in Cassandra and uses a
> > two-part key (groupID, entityID). We use the groupID to hash an entity to
> a
> > node. This way, all entities for a group will be collocated in the same
> > node. We then define a special CF to serve as the secondary index. In the
> > definition, we specify what entity attributes need to be indexed  and in
> > what order. Within a node, this special CF will index all rows stored
> > locally. Every time we insert a new entity, the server automatically
> > extracts the index key based on the index definition (for example, the
> > index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> > and add the index entry to the special CF. We can then access the
> entities
> > using an extended version of the query language in Cassandra. For
> example,
> > if we issue the following query and there is an index defined by
> > (attributeX, attributeY), the query can be evaluated using the index in
> the
> > special CF. (Note that AppEngine supports this flavor of queries.)
> >
> > select attributeZ
> > from ROWS(HASH = hash(groupID))
> > where attributeX="x"
> > order by attributeY desc
> > limit 50
> >
> > We are in the middle of prototyping this approach. We'd like to hear if
> > other people are interested in this too or if people think there are
> better
> > alternatives.
> >
>
>
> Prashant
>
>   1. In this approach I see a number of problems during addition , removal
> and moving of  new nodes. These problems are not impossible to solve but
> just harder and inefficient with the above approach.
>   2. What are the wins of this approach over the other is not clear to me.
> could you please highlight those.
>
>
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
>
>

Re: secondary index support in Cassandra

Posted by Sandeep Tata <sa...@gmail.com>.
The compaction optimization that Prashant mentioned is likely to solve
many of the problems that Jun brings up.

We were thinking of tackling this problem ... I've opened a ticket in
JIRA (https://issues.apache.org/jira/browse/CASSANDRA-16)

Avinash, Prashant -- If you guys are already working on it, feel free
to assign it to yourself. Otherwise we'll sketch out a plan and send
it out, if the community agrees on the idea, we can start hacking
away.

Sandeep


On Wed, Mar 25, 2009 at 10:50 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> Some comments inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Avinash Lakshman <av...@gmail.com> wrote on 03/24/2009 10:08:45
> PM:
>
>> Comments inline.
>>
>> On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>>
>> >
>> > Prashant,
>> >
>> > Thanks for the comments. They are quite useful. Let me try to address
> some
>> > of the points that you made.
>> >
>> > 1. It is true that in our current implementation, we can glue the
> changes
>> > on both the data and the index in one batch_update() call. This way,
> the
>> > data and the index will be maintained synchronously. However,
> maintaining
>> > the index on the server is likely more efficient since there is less
>> > communication overhead. You seem to agree with this.
>>
>>
>> [Avinash] You can update multiple column families for a single key in one
>> mutation.
>>
>> >
>> >
>> > 2. Cassandra currently doesn't acquire row-lock for row accesses.
> However,
>> > the implication is that a reader may see partial updates of a row. For
>> > example, suppose that a writer updates two columns in different CFs.
> Then,
>> > it is possible for a concurrent reader to see the update on one column,
> but
>> > not the other one. For some applications, row-level consistency could
> be
>> > important. It's probably for this reason, in HBase, a region server
>> > acquires a row lock for every read and write.
>>
>>
>> [Avinash] Updates to a single row within a machine are atomic. Which
> means
>> what you are stating will not happen. Writes and reads will be serialized
> at
>> the Memtable.
>
> This problem doesn't show up in Cassandra today because there is no method
> that can read columns from different CFs in a row. If there were such a
> method, it would be hard to enforce that a reader always sees a complete
> update (updating multiple CFs) without some sort of row locks.
>
>>
>> >
>> >
>> > 3. For our current application, the size of all entities in a group is
> not
>> > too large and likely fits within the capacity of a single node.
> However,
>> > for other applications, being able to scale a group to more than a node
>> > could be useful. Storing a group within a single row will prevent
> scaling
>> > out the group.
>>
>> [Avinash] I guess the question is how many entities do you envision in a
>> group. What do you mean by fitting into one node?
>>
>
> A large group may not fit in memory, but should fit in a commodity disk.
> The compaction optimization Prashant mentioned will definitely make our
> current approach more feasible.
>
> However, in general, I am a bit concerned about putting too much stuff
> within a row. A row is a unit that has finite capacity and a user shouldn't
> expect to put an infinite number of columns within a row. I actually like
> the current assumption in Cassandra that a row has to fit in memory since
> it simplifies the implementation. On the other hand, a table can have
> arbitrary capacity (one just need to provision enough nodes in the cluster)
> and it can have as many rows as you want.
>
>> >
>> >
>> > Jun
>> > IBM Almaden Research Center
>> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>> >
>> > junrao@almaden.ibm.com
>> >
>> >
>> > Prashant Malik <pm...@gmail.com> wrote on 03/24/2009 11:34:51 AM:
>> >
>> > > Some questions Iline
>> > >
>> > > On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com>
>> > wrote:
>> > >
>> > > >
>> > > >
>> > > > We have an application that has groups and entities. A group has
> many
>> > > > entities and an entity has a bunch of (attribute, value) pairs. A
>> > common
>> > > > access pattern is to select some number of entities within a group
> with
>> > > > attribute X equals to x and ordered by attribute Y. For efficiency,
> we
>> > want
>> > > > to build a secondary index for each group and collocate a group and
> its
>> > > > secondary index on the same node. Our current approach is to map a
>> > group to
>> > > > a row in Cassandra and each entity to a column in a column family
> (CF).
>> > > > Within the same row, we use a separate CF (ordered by name) to
>> > implement a
>> > > > secondary index, say on attribute X and Y. In this family, each
> column
>> > name
>> > > > has the form of X:x:Y:y:entityID. We extended the get_slice()
> function
>> > so
>> > > > that it can get a slice of columns starting from a given column.
> The
>> > > > extended function uses the column-level index to locate the
> starting
>> > column
>> > > > quickly. (We'd be happy to contribute this extension back to
> Cassandra
>> > if
>> > > > people find this useful). Using the extended get_slice(), we were
> able
>> > to
>> > > > access the entities through the simulated secondary index.
>> > > >
>> > > > We see a couple of problems with the current approach. First, our
>> > > > application has to maintain the index. This is inefficient and
> could
>> > leave
>> > > > the index inconsistent when failure occurs.  Second, mapping each
>> > entity to
>> > > > a column may not be a good idea. Often, there is some sort of
> locking
>> > for
>> > > > each row access. Putting many entities per row limits concurrency.
>> > Today,
>> > > > in Cassandra, a full row is deserialized into memory during
> compaction.
>> > > > This limits the number of entities that can be put in a single row.
>> > Also,
>> > > > intuitively, an entity is more naturally represented as a row with
>> > > > attributes stored as columns.
>> > >
>> > >
>> > > Prashant
>> > >
>> > >    1. Application can send the index  and the entityId update in the
> same
>> > > write , a write per row is always atomic given that teh index and the
>> > data
>> > > have teh same key in the above
>> > >        case the index will not be out of sync.
>> > >     2. The maintainance of index by app can be moved into cassandra
> and I
>> > > agree with fact that you can add support for it by a built in special
> CF
>> > > which you have to do in either of the approaches you are taking.
>> > >         Infact in the first approach that you are taking it will be
>> > easier
>> > > to move the indexes in case of adding nodes to the cluster and when
> files
>> > > are split and data is sent over. In the second approach this
>> > >         process could get comlicated.
>> > >     3. There is no locking for row access in cassandra.
>> > >     4. Compactions can be opmtimized for name sorted columns , this
> is
>> > one
>> > > of the workitems we have where we do not deserialize the entire row
> in
>> > > compactiopn but only do it in slices , this is easily achievable for
>> > >         name sorted columns.
>> > >     5. The entity can still be represented naturally as a supercolumn
>> > where
>> > > each of the super column name is the entity Id and each of the
> columns in
>> > > the supercolumn are attribute value pairs.
>> > >     6.  How many entities per  groupid are we talking about here ?
> why do
>> > > you feel concurrency is limited by entities per row ?
>> > >
>> > >
>> > > >
>> > > > To address the above problems, we are thinking of the following new
>> > > > implementation. Each entity is mapped to a row in Cassandra and
> uses a
>> > > > two-part key (groupID, entityID). We use the groupID to hash an
> entity
>> > to a
>> > > > node. This way, all entities for a group will be collocated in the
> same
>> > > > node. We then define a special CF to serve as the secondary index.
> In
>> > the
>> > > > definition, we specify what entity attributes need to be indexed
> and
>> > in
>> > > > what order. Within a node, this special CF will index all rows
> stored
>> > > > locally. Every time we insert a new entity, the server
> automatically
>> > > > extracts the index key based on the index definition (for example,
> the
>> > > > index key can be of the form "hash
>> > (rowkey):attribute1:attribute2:rowkey)
>> > > > and add the index entry to the special CF. We can then access the
>> > entities
>> > > > using an extended version of the query language in Cassandra. For
>> > example,
>> > > > if we issue the following query and there is an index defined by
>> > > > (attributeX, attributeY), the query can be evaluated using the
> index in
>> > the
>> > > > special CF. (Note that AppEngine supports this flavor of queries.)
>> > > >
>> > > > select attributeZ
>> > > > from ROWS(HASH = hash(groupID))
>> > > > where attributeX="x"
>> > > > order by attributeY desc
>> > > > limit 50
>> > > >
>> > > > We are in the middle of prototyping this approach. We'd like to
> hear if
>> > > > other people are interested in this too or if people think there
> are
>> > better
>> > > > alternatives.
>> > > >
>> > >
>> > >
>> > > Prashant
>> > >
>> > >    1. In this approach I see a number of problems during addition ,
>> > removal
>> > > and moving of  new nodes. These problems are not impossible to solve
> but
>> > > just harder and inefficient with the above approach.
>> > >    2. What are the wins of this approach over the other is not clear
> to
>> > me.
>> > > could you please highlight those.
>> > >
>> > >
>> > > >
>> > > > Jun
>> > > > IBM Almaden Research Center
>> > > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>> > > >
>> > > > junrao@almaden.ibm.com
>> >

Re: secondary index support in Cassandra

Posted by Avinash Lakshman <av...@gmail.com>.
That would depend on the app. What you are describing is serializability?
Which was never one of the design goals. Locking business will not help over
here. You could write to a replica and read from another to which the write
has not yet propagated and you will see the same issues. Locking will help
in a single machine but when you go across machines. Client app should be
able to handle this and this was the motivation. No strict guarantees
Avinash

On Wed, Mar 25, 2009 at 10:50 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

>
> Some comments inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Avinash Lakshman <av...@gmail.com> wrote on 03/24/2009 10:08:45
> PM:
>
> > Comments inline.
> >
> > On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
> >
> > >
> > > Prashant,
> > >
> > > Thanks for the comments. They are quite useful. Let me try to address
> some
> > > of the points that you made.
> > >
> > > 1. It is true that in our current implementation, we can glue the
> changes
> > > on both the data and the index in one batch_update() call. This way,
> the
> > > data and the index will be maintained synchronously. However,
> maintaining
> > > the index on the server is likely more efficient since there is less
> > > communication overhead. You seem to agree with this.
> >
> >
> > [Avinash] You can update multiple column families for a single key in one
> > mutation.
> >
> > >
> > >
> > > 2. Cassandra currently doesn't acquire row-lock for row accesses.
> However,
> > > the implication is that a reader may see partial updates of a row. For
> > > example, suppose that a writer updates two columns in different CFs.
> Then,
> > > it is possible for a concurrent reader to see the update on one column,
> but
> > > not the other one. For some applications, row-level consistency could
> be
> > > important. It's probably for this reason, in HBase, a region server
> > > acquires a row lock for every read and write.
> >
> >
> > [Avinash] Updates to a single row within a machine are atomic. Which
> means
> > what you are stating will not happen. Writes and reads will be serialized
> at
> > the Memtable.
>
> This problem doesn't show up in Cassandra today because there is no method
> that can read columns from different CFs in a row. If there were such a
> method, it would be hard to enforce that a reader always sees a complete
> update (updating multiple CFs) without some sort of row locks.
>
> >
> > >
> > >
> > > 3. For our current application, the size of all entities in a group is
> not
> > > too large and likely fits within the capacity of a single node.
> However,
> > > for other applications, being able to scale a group to more than a node
> > > could be useful. Storing a group within a single row will prevent
> scaling
> > > out the group.
> >
> > [Avinash] I guess the question is how many entities do you envision in a
> > group. What do you mean by fitting into one node?
> >
>
> A large group may not fit in memory, but should fit in a commodity disk.
> The compaction optimization Prashant mentioned will definitely make our
> current approach more feasible.
>
> However, in general, I am a bit concerned about putting too much stuff
> within a row. A row is a unit that has finite capacity and a user shouldn't
> expect to put an infinite number of columns within a row. I actually like
> the current assumption in Cassandra that a row has to fit in memory since
> it simplifies the implementation. On the other hand, a table can have
> arbitrary capacity (one just need to provision enough nodes in the cluster)
> and it can have as many rows as you want.
>
> > >
> > >
> > > Jun
> > > IBM Almaden Research Center
> > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> > >
> > > junrao@almaden.ibm.com
> > >
> > >
> > > Prashant Malik <pm...@gmail.com> wrote on 03/24/2009 11:34:51 AM:
> > >
> > > > Some questions Iline
> > > >
> > > > On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com>
> > > wrote:
> > > >
> > > > >
> > > > >
> > > > > We have an application that has groups and entities. A group has
> many
> > > > > entities and an entity has a bunch of (attribute, value) pairs. A
> > > common
> > > > > access pattern is to select some number of entities within a group
> with
> > > > > attribute X equals to x and ordered by attribute Y. For efficiency,
> we
> > > want
> > > > > to build a secondary index for each group and collocate a group and
> its
> > > > > secondary index on the same node. Our current approach is to map a
> > > group to
> > > > > a row in Cassandra and each entity to a column in a column family
> (CF).
> > > > > Within the same row, we use a separate CF (ordered by name) to
> > > implement a
> > > > > secondary index, say on attribute X and Y. In this family, each
> column
> > > name
> > > > > has the form of X:x:Y:y:entityID. We extended the get_slice()
> function
> > > so
> > > > > that it can get a slice of columns starting from a given column.
> The
> > > > > extended function uses the column-level index to locate the
> starting
> > > column
> > > > > quickly. (We'd be happy to contribute this extension back to
> Cassandra
> > > if
> > > > > people find this useful). Using the extended get_slice(), we were
> able
> > > to
> > > > > access the entities through the simulated secondary index.
> > > > >
> > > > > We see a couple of problems with the current approach. First, our
> > > > > application has to maintain the index. This is inefficient and
> could
> > > leave
> > > > > the index inconsistent when failure occurs.  Second, mapping each
> > > entity to
> > > > > a column may not be a good idea. Often, there is some sort of
> locking
> > > for
> > > > > each row access. Putting many entities per row limits concurrency.
> > > Today,
> > > > > in Cassandra, a full row is deserialized into memory during
> compaction.
> > > > > This limits the number of entities that can be put in a single row.
> > > Also,
> > > > > intuitively, an entity is more naturally represented as a row with
> > > > > attributes stored as columns.
> > > >
> > > >
> > > > Prashant
> > > >
> > > >    1. Application can send the index  and the entityId update in the
> same
> > > > write , a write per row is always atomic given that teh index and the
> > > data
> > > > have teh same key in the above
> > > >        case the index will not be out of sync.
> > > >     2. The maintainance of index by app can be moved into cassandra
> and I
> > > > agree with fact that you can add support for it by a built in special
> CF
> > > > which you have to do in either of the approaches you are taking.
> > > >         Infact in the first approach that you are taking it will be
> > > easier
> > > > to move the indexes in case of adding nodes to the cluster and when
> files
> > > > are split and data is sent over. In the second approach this
> > > >         process could get comlicated.
> > > >     3. There is no locking for row access in cassandra.
> > > >     4. Compactions can be opmtimized for name sorted columns , this
> is
> > > one
> > > > of the workitems we have where we do not deserialize the entire row
> in
> > > > compactiopn but only do it in slices , this is easily achievable for
> > > >         name sorted columns.
> > > >     5. The entity can still be represented naturally as a supercolumn
> > > where
> > > > each of the super column name is the entity Id and each of the
> columns in
> > > > the supercolumn are attribute value pairs.
> > > >     6.  How many entities per  groupid are we talking about here ?
> why do
> > > > you feel concurrency is limited by entities per row ?
> > > >
> > > >
> > > > >
> > > > > To address the above problems, we are thinking of the following new
> > > > > implementation. Each entity is mapped to a row in Cassandra and
> uses a
> > > > > two-part key (groupID, entityID). We use the groupID to hash an
> entity
> > > to a
> > > > > node. This way, all entities for a group will be collocated in the
> same
> > > > > node. We then define a special CF to serve as the secondary index.
> In
> > > the
> > > > > definition, we specify what entity attributes need to be indexed
> and
> > > in
> > > > > what order. Within a node, this special CF will index all rows
> stored
> > > > > locally. Every time we insert a new entity, the server
> automatically
> > > > > extracts the index key based on the index definition (for example,
> the
> > > > > index key can be of the form "hash
> > > (rowkey):attribute1:attribute2:rowkey)
> > > > > and add the index entry to the special CF. We can then access the
> > > entities
> > > > > using an extended version of the query language in Cassandra. For
> > > example,
> > > > > if we issue the following query and there is an index defined by
> > > > > (attributeX, attributeY), the query can be evaluated using the
> index in
> > > the
> > > > > special CF. (Note that AppEngine supports this flavor of queries.)
> > > > >
> > > > > select attributeZ
> > > > > from ROWS(HASH = hash(groupID))
> > > > > where attributeX="x"
> > > > > order by attributeY desc
> > > > > limit 50
> > > > >
> > > > > We are in the middle of prototyping this approach. We'd like to
> hear if
> > > > > other people are interested in this too or if people think there
> are
> > > better
> > > > > alternatives.
> > > > >
> > > >
> > > >
> > > > Prashant
> > > >
> > > >    1. In this approach I see a number of problems during addition ,
> > > removal
> > > > and moving of  new nodes. These problems are not impossible to solve
> but
> > > > just harder and inefficient with the above approach.
> > > >    2. What are the wins of this approach over the other is not clear
> to
> > > me.
> > > > could you please highlight those.
> > > >
> > > >
> > > > >
> > > > > Jun
> > > > > IBM Almaden Research Center
> > > > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> > > > >
> > > > > junrao@almaden.ibm.com
> > >
>

Re: secondary index support in Cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
Some comments inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Avinash Lakshman <av...@gmail.com> wrote on 03/24/2009 10:08:45
PM:

> Comments inline.
>
> On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> >
> > Prashant,
> >
> > Thanks for the comments. They are quite useful. Let me try to address
some
> > of the points that you made.
> >
> > 1. It is true that in our current implementation, we can glue the
changes
> > on both the data and the index in one batch_update() call. This way,
the
> > data and the index will be maintained synchronously. However,
maintaining
> > the index on the server is likely more efficient since there is less
> > communication overhead. You seem to agree with this.
>
>
> [Avinash] You can update multiple column families for a single key in one
> mutation.
>
> >
> >
> > 2. Cassandra currently doesn't acquire row-lock for row accesses.
However,
> > the implication is that a reader may see partial updates of a row. For
> > example, suppose that a writer updates two columns in different CFs.
Then,
> > it is possible for a concurrent reader to see the update on one column,
but
> > not the other one. For some applications, row-level consistency could
be
> > important. It's probably for this reason, in HBase, a region server
> > acquires a row lock for every read and write.
>
>
> [Avinash] Updates to a single row within a machine are atomic. Which
means
> what you are stating will not happen. Writes and reads will be serialized
at
> the Memtable.

This problem doesn't show up in Cassandra today because there is no method
that can read columns from different CFs in a row. If there were such a
method, it would be hard to enforce that a reader always sees a complete
update (updating multiple CFs) without some sort of row locks.

>
> >
> >
> > 3. For our current application, the size of all entities in a group is
not
> > too large and likely fits within the capacity of a single node.
However,
> > for other applications, being able to scale a group to more than a node
> > could be useful. Storing a group within a single row will prevent
scaling
> > out the group.
>
> [Avinash] I guess the question is how many entities do you envision in a
> group. What do you mean by fitting into one node?
>

A large group may not fit in memory, but should fit in a commodity disk.
The compaction optimization Prashant mentioned will definitely make our
current approach more feasible.

However, in general, I am a bit concerned about putting too much stuff
within a row. A row is a unit that has finite capacity and a user shouldn't
expect to put an infinite number of columns within a row. I actually like
the current assumption in Cassandra that a row has to fit in memory since
it simplifies the implementation. On the other hand, a table can have
arbitrary capacity (one just need to provision enough nodes in the cluster)
and it can have as many rows as you want.

> >
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
> >
> >
> > Prashant Malik <pm...@gmail.com> wrote on 03/24/2009 11:34:51 AM:
> >
> > > Some questions Iline
> > >
> > > On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com>
> > wrote:
> > >
> > > >
> > > >
> > > > We have an application that has groups and entities. A group has
many
> > > > entities and an entity has a bunch of (attribute, value) pairs. A
> > common
> > > > access pattern is to select some number of entities within a group
with
> > > > attribute X equals to x and ordered by attribute Y. For efficiency,
we
> > want
> > > > to build a secondary index for each group and collocate a group and
its
> > > > secondary index on the same node. Our current approach is to map a
> > group to
> > > > a row in Cassandra and each entity to a column in a column family
(CF).
> > > > Within the same row, we use a separate CF (ordered by name) to
> > implement a
> > > > secondary index, say on attribute X and Y. In this family, each
column
> > name
> > > > has the form of X:x:Y:y:entityID. We extended the get_slice()
function
> > so
> > > > that it can get a slice of columns starting from a given column.
The
> > > > extended function uses the column-level index to locate the
starting
> > column
> > > > quickly. (We'd be happy to contribute this extension back to
Cassandra
> > if
> > > > people find this useful). Using the extended get_slice(), we were
able
> > to
> > > > access the entities through the simulated secondary index.
> > > >
> > > > We see a couple of problems with the current approach. First, our
> > > > application has to maintain the index. This is inefficient and
could
> > leave
> > > > the index inconsistent when failure occurs.  Second, mapping each
> > entity to
> > > > a column may not be a good idea. Often, there is some sort of
locking
> > for
> > > > each row access. Putting many entities per row limits concurrency.
> > Today,
> > > > in Cassandra, a full row is deserialized into memory during
compaction.
> > > > This limits the number of entities that can be put in a single row.
> > Also,
> > > > intuitively, an entity is more naturally represented as a row with
> > > > attributes stored as columns.
> > >
> > >
> > > Prashant
> > >
> > >    1. Application can send the index  and the entityId update in the
same
> > > write , a write per row is always atomic given that teh index and the
> > data
> > > have teh same key in the above
> > >        case the index will not be out of sync.
> > >     2. The maintainance of index by app can be moved into cassandra
and I
> > > agree with fact that you can add support for it by a built in special
CF
> > > which you have to do in either of the approaches you are taking.
> > >         Infact in the first approach that you are taking it will be
> > easier
> > > to move the indexes in case of adding nodes to the cluster and when
files
> > > are split and data is sent over. In the second approach this
> > >         process could get comlicated.
> > >     3. There is no locking for row access in cassandra.
> > >     4. Compactions can be opmtimized for name sorted columns , this
is
> > one
> > > of the workitems we have where we do not deserialize the entire row
in
> > > compactiopn but only do it in slices , this is easily achievable for
> > >         name sorted columns.
> > >     5. The entity can still be represented naturally as a supercolumn
> > where
> > > each of the super column name is the entity Id and each of the
columns in
> > > the supercolumn are attribute value pairs.
> > >     6.  How many entities per  groupid are we talking about here ?
why do
> > > you feel concurrency is limited by entities per row ?
> > >
> > >
> > > >
> > > > To address the above problems, we are thinking of the following new
> > > > implementation. Each entity is mapped to a row in Cassandra and
uses a
> > > > two-part key (groupID, entityID). We use the groupID to hash an
entity
> > to a
> > > > node. This way, all entities for a group will be collocated in the
same
> > > > node. We then define a special CF to serve as the secondary index.
In
> > the
> > > > definition, we specify what entity attributes need to be indexed
and
> > in
> > > > what order. Within a node, this special CF will index all rows
stored
> > > > locally. Every time we insert a new entity, the server
automatically
> > > > extracts the index key based on the index definition (for example,
the
> > > > index key can be of the form "hash
> > (rowkey):attribute1:attribute2:rowkey)
> > > > and add the index entry to the special CF. We can then access the
> > entities
> > > > using an extended version of the query language in Cassandra. For
> > example,
> > > > if we issue the following query and there is an index defined by
> > > > (attributeX, attributeY), the query can be evaluated using the
index in
> > the
> > > > special CF. (Note that AppEngine supports this flavor of queries.)
> > > >
> > > > select attributeZ
> > > > from ROWS(HASH = hash(groupID))
> > > > where attributeX="x"
> > > > order by attributeY desc
> > > > limit 50
> > > >
> > > > We are in the middle of prototyping this approach. We'd like to
hear if
> > > > other people are interested in this too or if people think there
are
> > better
> > > > alternatives.
> > > >
> > >
> > >
> > > Prashant
> > >
> > >    1. In this approach I see a number of problems during addition ,
> > removal
> > > and moving of  new nodes. These problems are not impossible to solve
but
> > > just harder and inefficient with the above approach.
> > >    2. What are the wins of this approach over the other is not clear
to
> > me.
> > > could you please highlight those.
> > >
> > >
> > > >
> > > > Jun
> > > > IBM Almaden Research Center
> > > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> > > >
> > > > junrao@almaden.ibm.com
> >

Re: secondary index support in Cassandra

Posted by Avinash Lakshman <av...@gmail.com>.
Comments inline.

On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

>
> Prashant,
>
> Thanks for the comments. They are quite useful. Let me try to address some
> of the points that you made.
>
> 1. It is true that in our current implementation, we can glue the changes
> on both the data and the index in one batch_update() call. This way, the
> data and the index will be maintained synchronously. However, maintaining
> the index on the server is likely more efficient since there is less
> communication overhead. You seem to agree with this.


[Avinash] You can update multiple column families for a single key in one
mutation.

>
>
> 2. Cassandra currently doesn't acquire row-lock for row accesses. However,
> the implication is that a reader may see partial updates of a row. For
> example, suppose that a writer updates two columns in different CFs. Then,
> it is possible for a concurrent reader to see the update on one column, but
> not the other one. For some applications, row-level consistency could be
> important. It's probably for this reason, in HBase, a region server
> acquires a row lock for every read and write.


[Avinash] Updates to a single row within a machine are atomic. Which means
what you are stating will not happen. Writes and reads will be serialized at
the Memtable.

>
>
> 3. For our current application, the size of all entities in a group is not
> too large and likely fits within the capacity of a single node. However,
> for other applications, being able to scale a group to more than a node
> could be useful. Storing a group within a single row will prevent scaling
> out the group.

[Avinash] I guess the question is how many entities do you envision in a
group. What do you mean by fitting into one node?

>
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Prashant Malik <pm...@gmail.com> wrote on 03/24/2009 11:34:51 AM:
>
> > Some questions Iline
> >
> > On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com>
> wrote:
> >
> > >
> > >
> > > We have an application that has groups and entities. A group has many
> > > entities and an entity has a bunch of (attribute, value) pairs. A
> common
> > > access pattern is to select some number of entities within a group with
> > > attribute X equals to x and ordered by attribute Y. For efficiency, we
> want
> > > to build a secondary index for each group and collocate a group and its
> > > secondary index on the same node. Our current approach is to map a
> group to
> > > a row in Cassandra and each entity to a column in a column family (CF).
> > > Within the same row, we use a separate CF (ordered by name) to
> implement a
> > > secondary index, say on attribute X and Y. In this family, each column
> name
> > > has the form of X:x:Y:y:entityID. We extended the get_slice() function
> so
> > > that it can get a slice of columns starting from a given column. The
> > > extended function uses the column-level index to locate the starting
> column
> > > quickly. (We'd be happy to contribute this extension back to Cassandra
> if
> > > people find this useful). Using the extended get_slice(), we were able
> to
> > > access the entities through the simulated secondary index.
> > >
> > > We see a couple of problems with the current approach. First, our
> > > application has to maintain the index. This is inefficient and could
> leave
> > > the index inconsistent when failure occurs.  Second, mapping each
> entity to
> > > a column may not be a good idea. Often, there is some sort of locking
> for
> > > each row access. Putting many entities per row limits concurrency.
> Today,
> > > in Cassandra, a full row is deserialized into memory during compaction.
> > > This limits the number of entities that can be put in a single row.
> Also,
> > > intuitively, an entity is more naturally represented as a row with
> > > attributes stored as columns.
> >
> >
> > Prashant
> >
> >    1. Application can send the index  and the entityId update in the same
> > write , a write per row is always atomic given that teh index and the
> data
> > have teh same key in the above
> >        case the index will not be out of sync.
> >     2. The maintainance of index by app can be moved into cassandra and I
> > agree with fact that you can add support for it by a built in special CF
> > which you have to do in either of the approaches you are taking.
> >         Infact in the first approach that you are taking it will be
> easier
> > to move the indexes in case of adding nodes to the cluster and when files
> > are split and data is sent over. In the second approach this
> >         process could get comlicated.
> >     3. There is no locking for row access in cassandra.
> >     4. Compactions can be opmtimized for name sorted columns , this is
> one
> > of the workitems we have where we do not deserialize the entire row in
> > compactiopn but only do it in slices , this is easily achievable for
> >         name sorted columns.
> >     5. The entity can still be represented naturally as a supercolumn
> where
> > each of the super column name is the entity Id and each of the columns in
> > the supercolumn are attribute value pairs.
> >     6.  How many entities per  groupid are we talking about here ? why do
> > you feel concurrency is limited by entities per row ?
> >
> >
> > >
> > > To address the above problems, we are thinking of the following new
> > > implementation. Each entity is mapped to a row in Cassandra and uses a
> > > two-part key (groupID, entityID). We use the groupID to hash an entity
> to a
> > > node. This way, all entities for a group will be collocated in the same
> > > node. We then define a special CF to serve as the secondary index. In
> the
> > > definition, we specify what entity attributes need to be indexed  and
> in
> > > what order. Within a node, this special CF will index all rows stored
> > > locally. Every time we insert a new entity, the server automatically
> > > extracts the index key based on the index definition (for example, the
> > > index key can be of the form "hash
> (rowkey):attribute1:attribute2:rowkey)
> > > and add the index entry to the special CF. We can then access the
> entities
> > > using an extended version of the query language in Cassandra. For
> example,
> > > if we issue the following query and there is an index defined by
> > > (attributeX, attributeY), the query can be evaluated using the index in
> the
> > > special CF. (Note that AppEngine supports this flavor of queries.)
> > >
> > > select attributeZ
> > > from ROWS(HASH = hash(groupID))
> > > where attributeX="x"
> > > order by attributeY desc
> > > limit 50
> > >
> > > We are in the middle of prototyping this approach. We'd like to hear if
> > > other people are interested in this too or if people think there are
> better
> > > alternatives.
> > >
> >
> >
> > Prashant
> >
> >    1. In this approach I see a number of problems during addition ,
> removal
> > and moving of  new nodes. These problems are not impossible to solve but
> > just harder and inefficient with the above approach.
> >    2. What are the wins of this approach over the other is not clear to
> me.
> > could you please highlight those.
> >
> >
> > >
> > > Jun
> > > IBM Almaden Research Center
> > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> > >
> > > junrao@almaden.ibm.com
>

Re: secondary index support in Cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
Prashant,

I forgot about another point you mentioned.

In the new approach, carving out a chunk of data by hash (needed for node
removal/addition) may not be efficient. In the worse case, we have to make
a full scan of the data. It is possible to make it more efficient by
following the strategy that you guys implemented when using the random hash
function: prefixing each index key with the hash value. On the other hand,
I am wondering if we really need to worry about the performance on
adding/removing nodes. These are infrequent events and are also
non-blocking.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com



                                                                           
             Prashant Malik                                                
             <pmalik@gmail.com                                             
             >                                                          To 
                                       cassandra-dev@incubator.apache.org  
             03/24/2009 11:34                                           cc 
             AM                                                            
                                                                   Subject 
                                       Re: secondary index support in      
             Please respond to         Cassandra                           
             cassandra-dev@inc                                             
             ubator.apache.org                                             
                                                                           
                                                                           
                                                                           
                                                                           




Some questions Iline

On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

>
>
> We have an application that has groups and entities. A group has many
> entities and an entity has a bunch of (attribute, value) pairs. A common
> access pattern is to select some number of entities within a group with
> attribute X equals to x and ordered by attribute Y. For efficiency, we
want
> to build a secondary index for each group and collocate a group and its
> secondary index on the same node. Our current approach is to map a group
to
> a row in Cassandra and each entity to a column in a column family (CF).
> Within the same row, we use a separate CF (ordered by name) to implement
a
> secondary index, say on attribute X and Y. In this family, each column
name
> has the form of X:x:Y:y:entityID. We extended the get_slice() function so
> that it can get a slice of columns starting from a given column. The
> extended function uses the column-level index to locate the starting
column
> quickly. (We'd be happy to contribute this extension back to Cassandra if
> people find this useful). Using the extended get_slice(), we were able to
> access the entities through the simulated secondary index.
>
> We see a couple of problems with the current approach. First, our
> application has to maintain the index. This is inefficient and could
leave
> the index inconsistent when failure occurs.  Second, mapping each entity
to
> a column may not be a good idea. Often, there is some sort of locking for
> each row access. Putting many entities per row limits concurrency. Today,
> in Cassandra, a full row is deserialized into memory during compaction.
> This limits the number of entities that can be put in a single row. Also,
> intuitively, an entity is more naturally represented as a row with
> attributes stored as columns.


Prashant

   1. Application can send the index  and the entityId update in the same
write , a write per row is always atomic given that teh index and the data
have teh same key in the above
       case the index will not be out of sync.
    2. The maintainance of index by app can be moved into cassandra and I
agree with fact that you can add support for it by a built in special CF
which you have to do in either of the approaches you are taking.
        Infact in the first approach that you are taking it will be easier
to move the indexes in case of adding nodes to the cluster and when files
are split and data is sent over. In the second approach this
        process could get comlicated.
    3. There is no locking for row access in cassandra.
    4. Compactions can be opmtimized for name sorted columns , this is one
of the workitems we have where we do not deserialize the entire row in
compactiopn but only do it in slices , this is easily achievable for
        name sorted columns.
    5. The entity can still be represented naturally as a supercolumn where
each of the super column name is the entity Id and each of the columns in
the supercolumn are attribute value pairs.
    6.  How many entities per  groupid are we talking about here ? why do
you feel concurrency is limited by entities per row ?


>
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to
a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the
entities
> using an extended version of the query language in Cassandra. For
example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in
the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are
better
> alternatives.
>


Prashant

   1. In this approach I see a number of problems during addition , removal
and moving of  new nodes. These problems are not impossible to solve but
just harder and inefficient with the above approach.
   2. What are the wins of this approach over the other is not clear to me.
could you please highlight those.


>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com

Re: secondary index support in Cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
Prashant,

Thanks for the comments. They are quite useful. Let me try to address some
of the points that you made.

1. It is true that in our current implementation, we can glue the changes
on both the data and the index in one batch_update() call. This way, the
data and the index will be maintained synchronously. However, maintaining
the index on the server is likely more efficient since there is less
communication overhead. You seem to agree with this.

2. Cassandra currently doesn't acquire row-lock for row accesses. However,
the implication is that a reader may see partial updates of a row. For
example, suppose that a writer updates two columns in different CFs. Then,
it is possible for a concurrent reader to see the update on one column, but
not the other one. For some applications, row-level consistency could be
important. It's probably for this reason, in HBase, a region server
acquires a row lock for every read and write.

3. For our current application, the size of all entities in a group is not
too large and likely fits within the capacity of a single node. However,
for other applications, being able to scale a group to more than a node
could be useful. Storing a group within a single row will prevent scaling
out the group.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Prashant Malik <pm...@gmail.com> wrote on 03/24/2009 11:34:51 AM:

> Some questions Iline
>
> On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> >
> >
> > We have an application that has groups and entities. A group has many
> > entities and an entity has a bunch of (attribute, value) pairs. A
common
> > access pattern is to select some number of entities within a group with
> > attribute X equals to x and ordered by attribute Y. For efficiency, we
want
> > to build a secondary index for each group and collocate a group and its
> > secondary index on the same node. Our current approach is to map a
group to
> > a row in Cassandra and each entity to a column in a column family (CF).
> > Within the same row, we use a separate CF (ordered by name) to
implement a
> > secondary index, say on attribute X and Y. In this family, each column
name
> > has the form of X:x:Y:y:entityID. We extended the get_slice() function
so
> > that it can get a slice of columns starting from a given column. The
> > extended function uses the column-level index to locate the starting
column
> > quickly. (We'd be happy to contribute this extension back to Cassandra
if
> > people find this useful). Using the extended get_slice(), we were able
to
> > access the entities through the simulated secondary index.
> >
> > We see a couple of problems with the current approach. First, our
> > application has to maintain the index. This is inefficient and could
leave
> > the index inconsistent when failure occurs.  Second, mapping each
entity to
> > a column may not be a good idea. Often, there is some sort of locking
for
> > each row access. Putting many entities per row limits concurrency.
Today,
> > in Cassandra, a full row is deserialized into memory during compaction.
> > This limits the number of entities that can be put in a single row.
Also,
> > intuitively, an entity is more naturally represented as a row with
> > attributes stored as columns.
>
>
> Prashant
>
>    1. Application can send the index  and the entityId update in the same
> write , a write per row is always atomic given that teh index and the
data
> have teh same key in the above
>        case the index will not be out of sync.
>     2. The maintainance of index by app can be moved into cassandra and I
> agree with fact that you can add support for it by a built in special CF
> which you have to do in either of the approaches you are taking.
>         Infact in the first approach that you are taking it will be
easier
> to move the indexes in case of adding nodes to the cluster and when files
> are split and data is sent over. In the second approach this
>         process could get comlicated.
>     3. There is no locking for row access in cassandra.
>     4. Compactions can be opmtimized for name sorted columns , this is
one
> of the workitems we have where we do not deserialize the entire row in
> compactiopn but only do it in slices , this is easily achievable for
>         name sorted columns.
>     5. The entity can still be represented naturally as a supercolumn
where
> each of the super column name is the entity Id and each of the columns in
> the supercolumn are attribute value pairs.
>     6.  How many entities per  groupid are we talking about here ? why do
> you feel concurrency is limited by entities per row ?
>
>
> >
> > To address the above problems, we are thinking of the following new
> > implementation. Each entity is mapped to a row in Cassandra and uses a
> > two-part key (groupID, entityID). We use the groupID to hash an entity
to a
> > node. This way, all entities for a group will be collocated in the same
> > node. We then define a special CF to serve as the secondary index. In
the
> > definition, we specify what entity attributes need to be indexed  and
in
> > what order. Within a node, this special CF will index all rows stored
> > locally. Every time we insert a new entity, the server automatically
> > extracts the index key based on the index definition (for example, the
> > index key can be of the form "hash
(rowkey):attribute1:attribute2:rowkey)
> > and add the index entry to the special CF. We can then access the
entities
> > using an extended version of the query language in Cassandra. For
example,
> > if we issue the following query and there is an index defined by
> > (attributeX, attributeY), the query can be evaluated using the index in
the
> > special CF. (Note that AppEngine supports this flavor of queries.)
> >
> > select attributeZ
> > from ROWS(HASH = hash(groupID))
> > where attributeX="x"
> > order by attributeY desc
> > limit 50
> >
> > We are in the middle of prototyping this approach. We'd like to hear if
> > other people are interested in this too or if people think there are
better
> > alternatives.
> >
>
>
> Prashant
>
>    1. In this approach I see a number of problems during addition ,
removal
> and moving of  new nodes. These problems are not impossible to solve but
> just harder and inefficient with the above approach.
>    2. What are the wins of this approach over the other is not clear to
me.
> could you please highlight those.
>
>
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com

Re: secondary index support in Cassandra

Posted by Prashant Malik <pm...@gmail.com>.
Some questions Iline

On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao <ju...@almaden.ibm.com> wrote:

>
>
> We have an application that has groups and entities. A group has many
> entities and an entity has a bunch of (attribute, value) pairs. A common
> access pattern is to select some number of entities within a group with
> attribute X equals to x and ordered by attribute Y. For efficiency, we want
> to build a secondary index for each group and collocate a group and its
> secondary index on the same node. Our current approach is to map a group to
> a row in Cassandra and each entity to a column in a column family (CF).
> Within the same row, we use a separate CF (ordered by name) to implement a
> secondary index, say on attribute X and Y. In this family, each column name
> has the form of X:x:Y:y:entityID. We extended the get_slice() function so
> that it can get a slice of columns starting from a given column. The
> extended function uses the column-level index to locate the starting column
> quickly. (We'd be happy to contribute this extension back to Cassandra if
> people find this useful). Using the extended get_slice(), we were able to
> access the entities through the simulated secondary index.
>
> We see a couple of problems with the current approach. First, our
> application has to maintain the index. This is inefficient and could leave
> the index inconsistent when failure occurs.  Second, mapping each entity to
> a column may not be a good idea. Often, there is some sort of locking for
> each row access. Putting many entities per row limits concurrency. Today,
> in Cassandra, a full row is deserialized into memory during compaction.
> This limits the number of entities that can be put in a single row. Also,
> intuitively, an entity is more naturally represented as a row with
> attributes stored as columns.


Prashant

   1. Application can send the index  and the entityId update in the same
write , a write per row is always atomic given that teh index and the data
have teh same key in the above
       case the index will not be out of sync.
    2. The maintainance of index by app can be moved into cassandra and I
agree with fact that you can add support for it by a built in special CF
which you have to do in either of the approaches you are taking.
        Infact in the first approach that you are taking it will be easier
to move the indexes in case of adding nodes to the cluster and when files
are split and data is sent over. In the second approach this
        process could get comlicated.
    3. There is no locking for row access in cassandra.
    4. Compactions can be opmtimized for name sorted columns , this is one
of the workitems we have where we do not deserialize the entire row in
compactiopn but only do it in slices , this is easily achievable for
        name sorted columns.
    5. The entity can still be represented naturally as a supercolumn where
each of the super column name is the entity Id and each of the columns in
the supercolumn are attribute value pairs.
    6.  How many entities per  groupid are we talking about here ? why do
you feel concurrency is limited by entities per row ?


>
> To address the above problems, we are thinking of the following new
> implementation. Each entity is mapped to a row in Cassandra and uses a
> two-part key (groupID, entityID). We use the groupID to hash an entity to a
> node. This way, all entities for a group will be collocated in the same
> node. We then define a special CF to serve as the secondary index. In the
> definition, we specify what entity attributes need to be indexed  and in
> what order. Within a node, this special CF will index all rows stored
> locally. Every time we insert a new entity, the server automatically
> extracts the index key based on the index definition (for example, the
> index key can be of the form "hash(rowkey):attribute1:attribute2:rowkey)
> and add the index entry to the special CF. We can then access the entities
> using an extended version of the query language in Cassandra. For example,
> if we issue the following query and there is an index defined by
> (attributeX, attributeY), the query can be evaluated using the index in the
> special CF. (Note that AppEngine supports this flavor of queries.)
>
> select attributeZ
> from ROWS(HASH = hash(groupID))
> where attributeX="x"
> order by attributeY desc
> limit 50
>
> We are in the middle of prototyping this approach. We'd like to hear if
> other people are interested in this too or if people think there are better
> alternatives.
>


Prashant

   1. In this approach I see a number of problems during addition , removal
and moving of  new nodes. These problems are not impossible to solve but
just harder and inefficient with the above approach.
   2. What are the wins of this approach over the other is not clear to me.
could you please highlight those.


>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com