You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Henrik Schröder <sk...@gmail.com> on 2010/03/25 14:33:48 UTC

Range scan performance in 0.6.0 beta2

Hi everyone,

We're trying to implement a virtual datastore for our users where they can
set up "tables" and "indexes" to store objects and have them indexed on
arbitrary properties. And we did a test implementation for Cassandra in the
following way:

Objects are stored in one columnfamily, each key is made up of tableid +
"object key", and each row has one column where the value is the serialized
object. This part is super-simple, we're just using Cassandra as a
key-value-store, and this part performs really well.

The indexes are a bit tricker, but basically for each index and each object
that is stored, we compute a fixed-length bytearray based on the object that
make up the indexvalue. We then store these bytearray indexvalues in another
columnfamily, with the indexid as row key, the indexvalue as the column
name, and the object key as the column value.

The idea is then that to perform a range query on an "index" in this virtual
datastore, we do a get_slice to get the range of indexvalues and their
corresponding object keys, and we can then multi_get the actual objects from
the other column family.

Since these virtual tables and indexes will be created by our users the
whole system has to be very dynamic, and we can't make any assumptions about
the actual objects they will store and the distribution of these. We do know
that it must be able to scale well, and this is what attracted us to
Cassandra in the first place. We do however have some performance targets we
want to hit, we have one use-case where there will be about 30 million
records in a "table", and knowing that it can go up to 100 million records
would be nice. As for speed, would like to get thousands of writes and range
reads per second.

Given these requirements and our design, we will then have rows in Cassandra
with millions of columns, from which we want to fetch large column slices.
We set it all up on a single developer machine (MacPro, QuadCore 2.66ghz)
running Windows, and we used the thrift compiler to generate a C# client
library. We tested just the "index" part of our design, and these are the
numbers we got:
inserts (15 threads, batches of 10): 4000/second
get_slices (10 threads, random range sizes, count 1000): 50/second at start,
dies at about 6 million columns inserted. (OutOfMemoryException)
get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
down the more columns there are.


When we saw that the above results were bad, we tried a different approach
storing the indexvalues in the key instead, using the
OrderPreservingPartitioner and using get_range_slice to get ranges of rows,
but we got even worse results:
inserts (15 threads, in batches of 10): 4000/second
get_range_slice (10 threads, random key ranges, count 1000): 20/second at
start, 5/second with 30 million rows


Finally, we did a similar test using MySQL instead and then we got these
numbers:
inserts (15 threads, in batches of 10): 4000/second
select (10 threads, limit 1000): 500/second

So for us, the MySQL version delivers the speed that we want, but none of
the scaling that Cassandra gives us. We set up our columnfamilies like this:

<ColumnFamily CompareWith="BytesType" Name="Objects" RowsCached="0"
KeysCached="0"/>
<ColumnFamily CompareWith="BytesType" Name="Indexes" RowsCached="0"
KeysCached="0"/>

And we now have these questions:
a) Is there a better way of structuring our data and building the virtual
indexes?
b) Are our Cassandra numbers too low? Or is this the expected performance?
c) Did we miss to change some important setting (in the conf xml or java
config) since our rows are this large?
d) Can we avoid hitting the Out of memory exception?


/Henrik

Re: Range scan performance in 0.6.0 beta2

Posted by Nathan McCall <na...@vervewireless.com>.

I noticed you turned Key caching off in your ColumnFamily declaration,
have you tried experimenting with this on and playing key caching
configuration? Also, have you looked at the JMX output for what
commands are pending execution? That is always helpful to me in
hunting down bottlenecks.

-Nate

On Thu, Mar 25, 2010 at 9:31 AM, Henrik Schröder <sk...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne <sy...@yakaz.com> wrote:
>>
>> I don't know If that could play any role, but if ever you have
>> disabled the assertions
>> when running cassandra (that is, you removed the -ea line in
>> cassandra.in.sh), there
>> was a bug in 0.6beta2 that will make read in row with lots of columns
>> quite slow.
>
> We tried it with beta3 and got the same results, so that didn't do anything.
>
>>
>> Another problem you may have is if you have the commitLog directory on the
>> same
>> hard drive than the data directory. If that's the case and you read
>> and write at the
>> same time, that may be a reason for poor read performances (and write
>> too).
>
> We also tested doing only reads, and got about the same read speeds
>
>>
>> As for the row with 30 millions columns, you have to be aware that right
>> now,
>> cassandra will deserialize whole rows during compaction
>> (http://wiki.apache.org/cassandra/CassandraLimitations).
>> So depending on the size of what you store in you column, you could
>> very well hit
>> that limitation (that could be why you OOM). In which case, I see two
>> choices:
>> 1) add more RAM to the machine or 2) change your data structure to
>> avoid that (maybe
>> can you split rows with too many columns somehow ?).
>
> Splitting the rows would be an option if we got anything near decent speed
> for small rows, but even if we only have a few hundred thousand columns in
> one row, the read speed is still slow.
>
> What kind of numbers are common for this type of operation? Say that you
> have a row with 500000 columns whose names range from 0x0 to 0x7A120, and
> you do get_slice operations on that with ranges of random numbers in the
> interval but with a fixed count of 1000, and that you multithread it with
> ~10 of threads, can't you get more than 50 reads/s?
>
> When we've been reading up on Cassandra we've seen posts that billions of
> columns in a row shouldn't be a problem, and sure enough, writing all that
> data goes pretty fast, but as soon as you want to retrieve it, it is really
> slow. We also tried doing counts on the number of columns in a row, and that
> was really, really slow, it took half a minute to count the columns in a row
> with 500000 columns, and when doing the same on a row with millions, it just
> crashed with an OOM exception after a few minutes.
>
>
> /Henrik
>

Re: Range scan performance in 0.6.0 beta2

Posted by Sylvain Lebresne <sy...@yakaz.com>.

On Thu, Mar 25, 2010 at 5:31 PM, Henrik Schröder <sk...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne <sy...@yakaz.com> wrote:
>>
>> I don't know If that could play any role, but if ever you have
>> disabled the assertions
>> when running cassandra (that is, you removed the -ea line in
>> cassandra.in.sh), there
>> was a bug in 0.6beta2 that will make read in row with lots of columns
>> quite slow.
>
> We tried it with beta3 and got the same results, so that didn't do anything.

I'm not sure the patch has made it for beta3. If you haven't removed
the assertions,
then it's not your problem. If you have, I could only suggest you to
try the svn
branche for 0.6 (svn checkout
https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6).
Just saying.


>
>>
>> Another problem you may have is if you have the commitLog directory on the
>> same
>> hard drive than the data directory. If that's the case and you read
>> and write at the
>> same time, that may be a reason for poor read performances (and write
>> too).
>
> We also tested doing only reads, and got about the same read speeds
>
>>
>> As for the row with 30 millions columns, you have to be aware that right
>> now,
>> cassandra will deserialize whole rows during compaction
>> (http://wiki.apache.org/cassandra/CassandraLimitations).
>> So depending on the size of what you store in you column, you could
>> very well hit
>> that limitation (that could be why you OOM). In which case, I see two
>> choices:
>> 1) add more RAM to the machine or 2) change your data structure to
>> avoid that (maybe
>> can you split rows with too many columns somehow ?).
>
> Splitting the rows would be an option if we got anything near decent speed
> for small rows, but even if we only have a few hundred thousand columns in
> one row, the read speed is still slow.
>
> What kind of numbers are common for this type of operation? Say that you
> have a row with 500000 columns whose names range from 0x0 to 0x7A120, and
> you do get_slice operations on that with ranges of random numbers in the
> interval but with a fixed count of 1000, and that you multithread it with
> ~10 of threads, can't you get more than 50 reads/s?
>
> When we've been reading up on Cassandra we've seen posts that billions of
> columns in a row shouldn't be a problem, and sure enough, writing all that
> data goes pretty fast, but as soon as you want to retrieve it, it is really
> slow. We also tried doing counts on the number of columns in a row, and that
> was really, really slow, it took half a minute to count the columns in a row
> with 500000 columns, and when doing the same on a row with millions, it just
> crashed with an OOM exception after a few minutes.
>
>
> /Henrik
>

Re: Range scan performance in 0.6.0 beta2

Posted by Henrik Schröder <sk...@gmail.com>.

On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne <sy...@yakaz.com> wrote:

> I don't know If that could play any role, but if ever you have
> disabled the assertions
> when running cassandra (that is, you removed the -ea line in
> cassandra.in.sh), there
> was a bug in 0.6beta2 that will make read in row with lots of columns
> quite slow.
>

We tried it with beta3 and got the same results, so that didn't do anything.

> Another problem you may have is if you have the commitLog directory on the
> same
> hard drive than the data directory. If that's the case and you read
> and write at the
> same time, that may be a reason for poor read performances (and write too).
>

We also tested doing only reads, and got about the same read speeds

> As for the row with 30 millions columns, you have to be aware that right
> now,
> cassandra will deserialize whole rows during compaction
> (http://wiki.apache.org/cassandra/CassandraLimitations).
> So depending on the size of what you store in you column, you could
> very well hit
> that limitation (that could be why you OOM). In which case, I see two
> choices:
> 1) add more RAM to the machine or 2) change your data structure to
> avoid that (maybe
> can you split rows with too many columns somehow ?).
>

Splitting the rows would be an option if we got anything near decent speed
for small rows, but even if we only have a few hundred thousand columns in
one row, the read speed is still slow.

What kind of numbers are common for this type of operation? Say that you
have a row with 500000 columns whose names range from 0x0 to 0x7A120, and
you do get_slice operations on that with ranges of random numbers in the
interval but with a fixed count of 1000, and that you multithread it with
~10 of threads, can't you get more than 50 reads/s?

When we've been reading up on Cassandra we've seen posts that billions of
columns in a row shouldn't be a problem, and sure enough, writing all that
data goes pretty fast, but as soon as you want to retrieve it, it is really
slow. We also tried doing counts on the number of columns in a row, and that
was really, really slow, it took half a minute to count the columns in a row
with 500000 columns, and when doing the same on a row with millions, it just
crashed with an OOM exception after a few minutes.

/Henrik

Re: Range scan performance in 0.6.0 beta2

Posted by Sylvain Lebresne <sy...@yakaz.com>.

I don't know If that could play any role, but if ever you have
disabled the assertions
when running cassandra (that is, you removed the -ea line in
cassandra.in.sh), there
was a bug in 0.6beta2 that will make read in row with lots of columns
quite slow.

Another problem you may have is if you have the commitLog directory on the same
hard drive than the data directory. If that's the case and you read
and write at the
same time, that may be a reason for poor read performances (and write too).

As for the row with 30 millions columns, you have to be aware that right now,
cassandra will deserialize whole rows during compaction
(http://wiki.apache.org/cassandra/CassandraLimitations).
So depending on the size of what you store in you column, you could
very well hit
that limitation (that could be why you OOM). In which case, I see two choices:
1) add more RAM to the machine or 2) change your data structure to
avoid that (maybe
can you split rows with too many columns somehow ?).

--
Sylvain

On Thu, Mar 25, 2010 at 2:33 PM, Henrik Schröder <sk...@gmail.com> wrote:
> Hi everyone,
>
> We're trying to implement a virtual datastore for our users where they can
> set up "tables" and "indexes" to store objects and have them indexed on
> arbitrary properties. And we did a test implementation for Cassandra in the
> following way:
>
> Objects are stored in one columnfamily, each key is made up of tableid +
> "object key", and each row has one column where the value is the serialized
> object. This part is super-simple, we're just using Cassandra as a
> key-value-store, and this part performs really well.
>
> The indexes are a bit tricker, but basically for each index and each object
> that is stored, we compute a fixed-length bytearray based on the object that
> make up the indexvalue. We then store these bytearray indexvalues in another
> columnfamily, with the indexid as row key, the indexvalue as the column
> name, and the object key as the column value.
>
> The idea is then that to perform a range query on an "index" in this virtual
> datastore, we do a get_slice to get the range of indexvalues and their
> corresponding object keys, and we can then multi_get the actual objects from
> the other column family.
>
> Since these virtual tables and indexes will be created by our users the
> whole system has to be very dynamic, and we can't make any assumptions about
> the actual objects they will store and the distribution of these. We do know
> that it must be able to scale well, and this is what attracted us to
> Cassandra in the first place. We do however have some performance targets we
> want to hit, we have one use-case where there will be about 30 million
> records in a "table", and knowing that it can go up to 100 million records
> would be nice. As for speed, would like to get thousands of writes and range
> reads per second.
>
> Given these requirements and our design, we will then have rows in Cassandra
> with millions of columns, from which we want to fetch large column slices.
> We set it all up on a single developer machine (MacPro, QuadCore 2.66ghz)
> running Windows, and we used the thrift compiler to generate a C# client
> library. We tested just the "index" part of our design, and these are the
> numbers we got:
> inserts (15 threads, batches of 10): 4000/second
> get_slices (10 threads, random range sizes, count 1000): 50/second at start,
> dies at about 6 million columns inserted. (OutOfMemoryException)
> get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
> down the more columns there are.
>
>
> When we saw that the above results were bad, we tried a different approach
> storing the indexvalues in the key instead, using the
> OrderPreservingPartitioner and using get_range_slice to get ranges of rows,
> but we got even worse results:
> inserts (15 threads, in batches of 10): 4000/second
> get_range_slice (10 threads, random key ranges, count 1000): 20/second at
> start, 5/second with 30 million rows
>
>
> Finally, we did a similar test using MySQL instead and then we got these
> numbers:
> inserts (15 threads, in batches of 10): 4000/second
> select (10 threads, limit 1000): 500/second
>
> So for us, the MySQL version delivers the speed that we want, but none of
> the scaling that Cassandra gives us. We set up our columnfamilies like this:
>
> <ColumnFamily CompareWith="BytesType" Name="Objects" RowsCached="0"
> KeysCached="0"/>
> <ColumnFamily CompareWith="BytesType" Name="Indexes" RowsCached="0"
> KeysCached="0"/>
>
> And we now have these questions:
> a) Is there a better way of structuring our data and building the virtual
> indexes?
> b) Are our Cassandra numbers too low? Or is this the expected performance?
> c) Did we miss to change some important setting (in the conf xml or java
> config) since our rows are this large?
> d) Can we avoid hitting the Out of memory exception?
>
>
> /Henrik
>

Re: Range scan performance in 0.6.0 beta2

Posted by Jonathan Ellis <jb...@gmail.com>.

I see what you mean -- you have understood correctly.

On Mon, Mar 29, 2010 at 8:13 AM, Henrik Schröder <sk...@gmail.com> wrote:
> On Mon, Mar 29, 2010 at 14:15, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder <sk...@gmail.com>
>> wrote:
>> > On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jb...@gmail.com> wrote:
>> >> It's a unique index then?  And you're trying to read things ordered by
>> >> the index, not just "give me keys with that have a column with this
>> >> value?"
>> >
>> > Yes, because if we have more than one column per row, there's no way of
>> > (easily) limiting the result.
>>
>> That's exactly what the count parameter of SliceRange is for... ?
>
> I thought that only limited the number of columns per key?
>
> We're using the get_range_slices method, which takes both a SlicePredicate
> (which contains a range, which contains a count) and a KeyRange (which also
> contains a count). Say that we have a bunch of keys that each contain 10
> columns, and we do a get_range_slices over those, how do we get the first 25
> columns? If we put it in the SliceRange count, we'll get all matching rows,
> and the 25 first columns of each, right? And if we put it in the KeyRange
> count, we'll get the 25 first rows that match, and all their columns, right?
>
> But if we have only one column per row, then we can limit the results the
> way we want to. Or have we misunderstood the api somehow?
>
>
> /Henrik
>

Re: Range scan performance in 0.6.0 beta2

Posted by Mike Malone <mi...@simplegeo.com>.

On Mon, Mar 29, 2010 at 7:13 AM, Henrik Schröder <sk...@gmail.com> wrote:

> On Mon, Mar 29, 2010 at 14:15, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder <sk...@gmail.com>
>> wrote:
>> > On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >> It's a unique index then?  And you're trying to read things ordered by
>> >> the index, not just "give me keys with that have a column with this
>> >> value?"
>> >
>> > Yes, because if we have more than one column per row, there's no way of
>> > (easily) limiting the result.
>>
>> That's exactly what the count parameter of SliceRange is for... ?
>>
>
> I thought that only limited the number of columns per key?
>
> We're using the get_range_slices method, which takes both a SlicePredicate
> (which contains a range, which contains a count) and a KeyRange (which also
> contains a count). Say that we have a bunch of keys that each contain 10
> columns, and we do a get_range_slices over those, how do we get the first 25
> columns? If we put it in the SliceRange count, we'll get all matching rows,
> and the 25 first columns of each, right? And if we put it in the KeyRange
> count, we'll get the 25 first rows that match, and all their columns, right?
>
> But if we have only one column per row, then we can limit the results the
> way we want to. Or have we misunderstood the api somehow?
>

We've run into the same issue and have a patch that limits the _total_
number of columns returned instead of limiting on number of rows / number of
columns per row. This makes it convenient to do a two dimensional index -
first key is the row key, second is the column name, column value is the
thing you're indexing. Then you do a get_range_slice on the two keys,
limiting on total columns returned.

We haven't run any real performance metrics yet. I don't think this query is
particularly performant, but it's certainly faster than doing the same
operation on the client side.

Another thing to keep in mind is that rows must fit in memory because
they're serialized / deserialized into memory from time to time. I believe
this happens during SSTable serialization. Feel free to verify/correct me on
this.

If people are interested I can probably get that patch pushed back upstream
soon. We're in crunch mode right now for launch though so, unfortunately,
it'll probably be a week or so before we can finish it up and properly vet
it.

Mike

Re: Range scan performance in 0.6.0 beta2

Posted by Henrik Schröder <sk...@gmail.com>.

On Mon, Mar 29, 2010 at 14:15, Jonathan Ellis <jb...@gmail.com> wrote:

> On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder <sk...@gmail.com>
> wrote:
> > On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jb...@gmail.com> wrote:
> >> It's a unique index then?  And you're trying to read things ordered by
> >> the index, not just "give me keys with that have a column with this
> >> value?"
> >
> > Yes, because if we have more than one column per row, there's no way of
> > (easily) limiting the result.
>
> That's exactly what the count parameter of SliceRange is for... ?
>

I thought that only limited the number of columns per key?

We're using the get_range_slices method, which takes both a SlicePredicate
(which contains a range, which contains a count) and a KeyRange (which also
contains a count). Say that we have a bunch of keys that each contain 10
columns, and we do a get_range_slices over those, how do we get the first 25
columns? If we put it in the SliceRange count, we'll get all matching rows,
and the 25 first columns of each, right? And if we put it in the KeyRange
count, we'll get the 25 first rows that match, and all their columns, right?

But if we have only one column per row, then we can limit the results the
way we want to. Or have we misunderstood the api somehow?

/Henrik

Re: Range scan performance in 0.6.0 beta2

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Mar 29, 2010 at 4:06 AM, Henrik Schröder <sk...@gmail.com> wrote:
> On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jb...@gmail.com> wrote:
>> It's a unique index then?  And you're trying to read things ordered by
>> the index, not just "give me keys with that have a column with this
>> value?"
>
> Yes, because if we have more than one column per row, there's no way of
> (easily) limiting the result.

That's exactly what the count parameter of SliceRange is for... ?

-Jonathan

Re: Range scan performance in 0.6.0 beta2

Posted by Henrik Schröder <sk...@gmail.com>.

On Fri, Mar 26, 2010 at 14:47, Jonathan Ellis <jb...@gmail.com> wrote:

> On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder <sk...@gmail.com>
> wrote:
> > For each indexvalue we insert a row where the key is indexid + ":" +
> > indexvalue encoded as hex string, and the row contains only one column,
> > where the name is the object key encoded as a bytearray, and the value is
> > empty.
>
> It's a unique index then?  And you're trying to read things ordered by
> the index, not just "give me keys with that have a column with this
> value?"
>

Yes, because if we have more than one column per row, there's no way of
(easily) limiting the result. As it is now we rarely want all object keys
associated with a range of indexvalues. However, this means we will have a
lot of rows if we do it in Cassandra.

/Henrik

Re: Range scan performance in 0.6.0 beta2

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Mar 26, 2010 at 7:40 AM, Henrik Schröder <sk...@gmail.com> wrote:
> For each indexvalue we insert a row where the key is indexid + ":" +
> indexvalue encoded as hex string, and the row contains only one column,
> where the name is the object key encoded as a bytearray, and the value is
> empty.

It's a unique index then?  And you're trying to read things ordered by
the index, not just "give me keys with that have a column with this
value?"

> These numbers are slightly better than our previous OPP tries, but nothing
> significant. For what it's worth, if we're only doing writes, the machine
> bottlenecks on disk I/O as expected, but whenever we do reads, it
> bottlenecks on CPU usage instead. Is this expected?

Yes.

> Also, how would dynamic column families help us?

You don't have to mess with key prefixes, since each CF contains only
one type of index.

-Jonathan

Re: Range scan performance in 0.6.0 beta2

Posted by Henrik Schröder <sk...@gmail.com>.

>
> So all the values for an entire index will be in one row?  That
> doesn't sound good.
>
> You really want to put each index [and each table] in its own CF, but
> until we can do that dynamically (0.7) you could at least make the
> index row keys a tuple of (indexid, indexvalue) and the column names
> in each row the object keys (empty column values).
>
> This works pretty well for a lot of users, including Digg.
>

We tested your suggestions like this:
We're using the OrderPreservingPartitioner.
We set the keycache and rowcache to 40%.
We're using the same machine as before, but we switched to a 64-bit JVM and
gave it 5GB of memory
For each indexvalue we insert a row where the key is indexid + ":" +
indexvalue encoded as hex string, and the row contains only one column,
where the name is the object key encoded as a bytearray, and the value is
empty.
When reading, we do a get_range_slice with an empty slice_range (start and
finish are 0-length byte-arrays), and randomly generated start_key and
finish_key where we know they both have been inserted, and finally a
row_count of 1000.

These are the numbers we got this time:
inserts (15 threads, batches of 10): 4000/second
get_range_slices (10 threads, row_count 1000): 50/seconds at start, down to
10/second at 250k inserts.

These numbers are slightly better than our previous OPP tries, but nothing
significant. For what it's worth, if we're only doing writes, the machine
bottlenecks on disk I/O as expected, but whenever we do reads, it
bottlenecks on CPU usage instead. Is this expected?


Also, how would dynamic column families help us? In our tests, we only
tested a single "index", so even if we had one column family per "index", we
would still only write to one of them and then get the exact same results as
above, right?

We're really grateful for any help with both how to tune Cassandra and how
to design our data model. The designs we've tested so far is the best we
could come up with ourselves, all we really need is a way to store groups of
mappings of indexvalue->objectkey, and be able to get a range of objectkeys
back given a group and a start and stop indexvalue.


/Henrik

Re: Range scan performance in 0.6.0 beta2

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Mar 25, 2010 at 8:33 AM, Henrik Schröder <sk...@gmail.com> wrote:
> Hi everyone,
>
> We're trying to implement a virtual datastore for our users where they can
> set up "tables" and "indexes" to store objects and have them indexed on
> arbitrary properties. And we did a test implementation for Cassandra in the
> following way:
>
> Objects are stored in one columnfamily, each key is made up of tableid +
> "object key", and each row has one column where the value is the serialized
> object. This part is super-simple, we're just using Cassandra as a
> key-value-store, and this part performs really well.
>
> The indexes are a bit tricker, but basically for each index and each object
> that is stored, we compute a fixed-length bytearray based on the object that
> make up the indexvalue. We then store these bytearray indexvalues in another
> columnfamily, with the indexid as row key, the indexvalue as the column
> name, and the object key as the column value.

So all the values for an entire index will be in one row?  That
doesn't sound good.

You really want to put each index [and each table] in its own CF, but
until we can do that dynamically (0.7) you could at least make the
index row keys a tuple of (indexid, indexvalue) and the column names
in each row the object keys (empty column values).

This works pretty well for a lot of users, including Digg.

> We tested just the "index" part of our design, and these are the
> numbers we got:
> inserts (15 threads, batches of 10): 4000/second
> get_slices (10 threads, random range sizes, count 1000): 50/second at start,
> dies at about 6 million columns inserted. (OutOfMemoryException)
> get_slices (10 threads, random range sizes, count 10): 200/s at start, slows
> down the more columns there are.

Those are really low read numbers, but I'd make the schema change
above before digging deeper there.

Also, if you are OOMing, you're probably getting really crappy
performance for some time before that, as the JVM tries desperately to
collect enough space to keep going.  The easiest solution is to just
let it use more memory, assuming you can do so.
http://wiki.apache.org/cassandra/RunningCassandra

-Jonathan