You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ron Siemens <rs...@greatergood.com> on 2011/02/25 00:27:33 UTC

Homebrew CF-indexing vs secondary indexing

I am doing some experimenting with indexing.  My data CF has about 25000 rows around 1KB each.  I set up a special column of boolean value to use as the secondary index.  I also created my own index in a separate CF where each index is one row and the column names are the data keys.

The implementation is in Hector 0.7.0-27, and run options are -Xms64m -Xmx256m

Below are two sample runs, the first using the secondary index with IndexedSlicesQuery.  The second using my homebrew CF index and createSliceQuery for the index followed by createMultigetSliceQuery for the data.  The timing output is from result.getExecutionTimeMicro(), but it looks like ms.  I'm not sure if its purpose is as I'm assuming and using here.  By the way, THS is just the same of the index, which is a subset of 7293 rows of the some 25000.

Anyway, it looks like the custom index does significantly better.  Is this expected?  Why?  I expected them to be about the same, having read the secondary index also uses a column family internally.  But more disconcerting, the secondary index implementation runs out of space, while the custom one runs along with only a few notable slow downs.  Both implementations are using the same column-processing/deserialization code so that doesn't seem to be to blame.  What gives?

Ron


Sample run: Secondary index.

DEBUG Retrieved THS / 7293 rows, in 2012 ms
DEBUG Retrieved THS / 7293 rows, in 1956 ms
DEBUG Retrieved THS / 7293 rows, in 1843 ms
DEBUG Retrieved THS / 7293 rows, in 2295 ms
DEBUG Retrieved THS / 7293 rows, in 1828 ms
DEBUG Retrieved THS / 7293 rows, in 1740 ms
DEBUG Retrieved THS / 7293 rows, in 1899 ms
DEBUG Retrieved THS / 7293 rows, in 2266 ms
DEBUG Retrieved THS / 7293 rows, in 2310 ms
DEBUG Retrieved THS / 7293 rows, in 2395 ms
DEBUG Retrieved THS / 7293 rows, in 2829 ms
DEBUG Retrieved THS / 7293 rows, in 2725 ms
DEBUG Retrieved THS / 7293 rows, in 3752 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.nio.CharBuffer.wrap(CharBuffer.java:350)
	at java.nio.CharBuffer.wrap(CharBuffer.java:373)
	at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
	at java.lang.StringCoding.decode(StringCoding.java:173)
	at java.lang.String.<init>(String.java:443)
	at me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:40)
	at me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:13)
	at me.prettyprint.cassandra.serializers.AbstractSerializer.fromBytes(AbstractSerializer.java:38)
	at me.prettyprint.cassandra.model.HColumnImpl.<init>(HColumnImpl.java:48)
	at me.prettyprint.cassandra.model.ColumnSliceImpl.<init>(ColumnSliceImpl.java:27)
	at me.prettyprint.cassandra.model.RowImpl.<init>(RowImpl.java:32)
	at me.prettyprint.cassandra.model.RowsImpl.<init>(RowsImpl.java:33)
	at me.prettyprint.cassandra.model.OrderedRowsImpl.<init>(OrderedRowsImpl.java:30)
	at me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:143)
	at me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:131)
	at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
	at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
	at me.prettyprint.cassandra.model.IndexedSlicesQuery.execute(IndexedSlicesQuery.java:130)



Sample run: Homebrew CF-indexing

DEBUG CFIndex THS / 7293 read in 262 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 44 ms
DEBUG Retrieved THS / 7293 rows, in 1771 ms
DEBUG CFIndex THS / 7293 read in 38 ms
DEBUG Retrieved THS / 7293 rows, in 1275 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1364 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1590 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1118 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1280 ms
DEBUG CFIndex THS / 7293 read in 21 ms
DEBUG Retrieved THS / 7293 rows, in 1466 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1589 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1772 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1660 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1931 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1626 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1750 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1557 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 9409 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1709 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1750 ms
DEBUG CFIndex THS / 7293 read in 45 ms
DEBUG Retrieved THS / 7293 rows, in 1629 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1596 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1879 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1597 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1662 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9362 ms
DEBUG CFIndex THS / 7293 read in 26 ms
DEBUG Retrieved THS / 7293 rows, in 1900 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1972 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1631 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1606 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1582 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1784 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9522 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1628 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1551 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1627 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1539 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1563 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1623 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1804 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9010 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1444 ms
DEBUG CFIndex THS / 7293 read in 41 ms
DEBUG Retrieved THS / 7293 rows, in 1528 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1451 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1558 ms
DEBUG CFIndex THS / 7293 read in 16 ms
DEBUG Retrieved THS / 7293 rows, in 1585 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1659 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1708 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9195 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1590 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1582 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1568 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1689 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1810 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1556 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 8922 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1549 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1782 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1824 ms
DEBUG CFIndex THS / 7293 read in 16 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1531 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1576 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1533 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9533 ms
DEBUG CFIndex THS / 7293 read in 48 ms
DEBUG Retrieved THS / 7293 rows, in 1544 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1467 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1557 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1714 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1888 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1588 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1612 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9529 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1653 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1813 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1650 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1646 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1727 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9480 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1577 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1529 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1555 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1570 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1550 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1455 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 10318 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1576 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1654 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1578 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1571 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1710 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9903 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1571 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1596 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1556 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1607 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1655 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1882 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1535 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 8502 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1538 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1578 ms
DEBUG CFIndex THS / 7293 read in 24 ms
DEBUG Retrieved THS / 7293 rows, in 1540 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1561 ms
DEBUG CFIndex THS / 7293 read in 56 ms
DEBUG Retrieved THS / 7293 rows, in 1745 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1454 ms

Re: Homebrew CF-indexing vs secondary indexing

Posted by Ed Anuff <ed...@anuff.com>.

At the risk of recapitulating a conversation that seems to happen with some
frequency on this list, the answer is going to boil down to "depends on your
data model", but using rows as indexes is one of the core usage patterns of
Cassandra, whether to store the list of keys to rows in another column
family as column names or to build inverted indexes.  That's why columns are
sorted and can be easily retrieved by sort range, so you can do things like
that.  If you're building test instances, then you're going to find out the
answer of what's best for your particular application pretty quickly.  I
think the best advice I've ever seen on this list about how to do something
with Cassandra has been "do a test with both and see what happens", and of
course, share what you find with the rest of us :)


On Fri, Feb 25, 2011 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Does it mean that we should design data model such that row keys
> actually become columns (and create secondary index) so that the data
> retrieval is faster. I am soon setting up big test instances to test
> all this.
>
> On Fri, Feb 25, 2011 at 11:18 AM, Ed Anuff <ed...@anuff.com> wrote:
> > It's nice to see some testing in this regard, however, it's worth
> pointing
> > out something that gets lost in CF index vs secondary index discussions.
> > What you're really proving is that get_slice (across columns) is faster
> than
> > get_indexed_slices (across keys).  For up to a certain size (and it would
> be
> > nice if there were some emperical testing to determine what that size
> is),
> > get_slice should be one of the most performant operations Cassandra can
> do.
> > CF index approaches are basically all about getting your data into a
> > situation where you can use get_slice to quickly perform the search.  The
> > reasons for using Cassandra's built in secondary index support, IMHO, is
> > that (1) it's easy to use whereas CF indexes are managed by the client
> and
> > (2) there's concern about how large an index you'd be able to effectively
> > store in a CF index row.  The first point is more about Cassandra being
> > easier for newcomers, the latter point is something I'd like to see some
> > more data around.  Maybe you want to run your tests up to much larger
> sizes
> > and see if there's a point where the results change?  FWIW, I recently
> > switched back to CF-based indexes from secondary indexes, largely for the
> > flexibility in the types of queries that became possible, but it's nice
> to
> > see there's some performance benefit.  The other thing would be good to
> look
> > at is timing the overhead of what it takes to update your index as you
> > change the values that are being indexed.
> >
> >
> >
> > On Fri, Feb 25, 2011 at 10:23 AM, Ron Siemens <rs...@greatergood.com>
> > wrote:
> >>
> >> I updated the cassandra version in the hector package from 7.0 to 7.2.
> >>  The occasional slow-down in the CF-index went away.  I then upped the
> heap
> >> to 512MB, and the secondary-indexing then works.  Seems awfully memory
> >> hungry for my small dataset.  Even the CF-index was faster with more
> heap.
> >>  These are the times with Cassandra-0.7.2 and 512M heap.  Slightly
> different
> >> testing: I'm varying the index used which give different data size
> results.
> >>  It still surprises me that the CF index does substantially better.
> >>
> >> Secondary Index
> >>
> >> DEBUG Retrieved THS / 7293 rows, in 1051 ms
> >> DEBUG Retrieved TRS / 7289 rows, in 1448 ms
> >> DEBUG Retrieved BCS / 7788 rows, in 1553 ms
> >> DEBUG Retrieved ARS / 7426 rows, in 1479 ms
> >> DEBUG Retrieved CHS / 7290 rows, in 1575 ms
> >> DEBUG Retrieved MS / 4523 rows, in 766 ms
> >> DEBUG Retrieved PRS / 562 rows, in 40 ms
> >> DEBUG Retrieved GGF / 1162 rows, in 122 ms
> >> DEBUG Retrieved VET / 7313 rows, in 1193 ms
> >> DEBUG Retrieved AUT / 7287 rows, in 1746 ms
> >> DEBUG Retrieved LIT / 7291 rows, in 1331 ms
> >>
> >> CF Index
> >>
> >> DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
> >> DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
> >> DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
> >> DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
> >> DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
> >> DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
> >> DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
> >> DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
> >> DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
> >> DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
> >> DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms
> >>
> >> On Feb 24, 2011, at 3:39 PM, Ron Siemens wrote:
> >>
> >> >
> >> > I failed to mention: this is just doing repeated data retrievals using
> >> > the index.
> >> >
> >> >> ...
> >> >>
> >> >> Sample run: Secondary index.
> >> >>
> >> >> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> >> >> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> >> >> DEBUG Retrieved THS / 7293 rows, in 1843 ms
> >> > ...
> >> >
> >>
> >
> >
>

Re: Homebrew CF-indexing vs secondary indexing

Posted by Mohit Anchlia <mo...@gmail.com>.

Does it mean that we should design data model such that row keys
actually become columns (and create secondary index) so that the data
retrieval is faster. I am soon setting up big test instances to test
all this.

On Fri, Feb 25, 2011 at 11:18 AM, Ed Anuff <ed...@anuff.com> wrote:
> It's nice to see some testing in this regard, however, it's worth pointing
> out something that gets lost in CF index vs secondary index discussions.
> What you're really proving is that get_slice (across columns) is faster than
> get_indexed_slices (across keys).  For up to a certain size (and it would be
> nice if there were some emperical testing to determine what that size is),
> get_slice should be one of the most performant operations Cassandra can do.
> CF index approaches are basically all about getting your data into a
> situation where you can use get_slice to quickly perform the search.  The
> reasons for using Cassandra's built in secondary index support, IMHO, is
> that (1) it's easy to use whereas CF indexes are managed by the client  and
> (2) there's concern about how large an index you'd be able to effectively
> store in a CF index row.  The first point is more about Cassandra being
> easier for newcomers, the latter point is something I'd like to see some
> more data around.  Maybe you want to run your tests up to much larger sizes
> and see if there's a point where the results change?  FWIW, I recently
> switched back to CF-based indexes from secondary indexes, largely for the
> flexibility in the types of queries that became possible, but it's nice to
> see there's some performance benefit.  The other thing would be good to look
> at is timing the overhead of what it takes to update your index as you
> change the values that are being indexed.
>
>
>
> On Fri, Feb 25, 2011 at 10:23 AM, Ron Siemens <rs...@greatergood.com>
> wrote:
>>
>> I updated the cassandra version in the hector package from 7.0 to 7.2.
>>  The occasional slow-down in the CF-index went away.  I then upped the heap
>> to 512MB, and the secondary-indexing then works.  Seems awfully memory
>> hungry for my small dataset.  Even the CF-index was faster with more heap.
>>  These are the times with Cassandra-0.7.2 and 512M heap.  Slightly different
>> testing: I'm varying the index used which give different data size results.
>>  It still surprises me that the CF index does substantially better.
>>
>> Secondary Index
>>
>> DEBUG Retrieved THS / 7293 rows, in 1051 ms
>> DEBUG Retrieved TRS / 7289 rows, in 1448 ms
>> DEBUG Retrieved BCS / 7788 rows, in 1553 ms
>> DEBUG Retrieved ARS / 7426 rows, in 1479 ms
>> DEBUG Retrieved CHS / 7290 rows, in 1575 ms
>> DEBUG Retrieved MS / 4523 rows, in 766 ms
>> DEBUG Retrieved PRS / 562 rows, in 40 ms
>> DEBUG Retrieved GGF / 1162 rows, in 122 ms
>> DEBUG Retrieved VET / 7313 rows, in 1193 ms
>> DEBUG Retrieved AUT / 7287 rows, in 1746 ms
>> DEBUG Retrieved LIT / 7291 rows, in 1331 ms
>>
>> CF Index
>>
>> DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
>> DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
>> DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
>> DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
>> DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
>> DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
>> DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
>> DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
>> DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
>> DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
>> DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms
>>
>> On Feb 24, 2011, at 3:39 PM, Ron Siemens wrote:
>>
>> >
>> > I failed to mention: this is just doing repeated data retrievals using
>> > the index.
>> >
>> >> ...
>> >>
>> >> Sample run: Secondary index.
>> >>
>> >> DEBUG Retrieved THS / 7293 rows, in 2012 ms
>> >> DEBUG Retrieved THS / 7293 rows, in 1956 ms
>> >> DEBUG Retrieved THS / 7293 rows, in 1843 ms
>> > ...
>> >
>>
>
>

Re: Homebrew CF-indexing vs secondary indexing

Posted by Ed Anuff <ed...@anuff.com>.

It's nice to see some testing in this regard, however, it's worth pointing
out something that gets lost in CF index vs secondary index discussions.
What you're really proving is that get_slice (across columns) is faster than
get_indexed_slices (across keys).  For up to a certain size (and it would be
nice if there were some emperical testing to determine what that size is),
get_slice should be one of the most performant operations Cassandra can do.
CF index approaches are basically all about getting your data into a
situation where you can use get_slice to quickly perform the search.  The
reasons for using Cassandra's built in secondary index support, IMHO, is
that (1) it's easy to use whereas CF indexes are managed by the client  and
(2) there's concern about how large an index you'd be able to effectively
store in a CF index row.  The first point is more about Cassandra being
easier for newcomers, the latter point is something I'd like to see some
more data around.  Maybe you want to run your tests up to much larger sizes
and see if there's a point where the results change?  FWIW, I recently
switched back to CF-based indexes from secondary indexes, largely for the
flexibility in the types of queries that became possible, but it's nice to
see there's some performance benefit.  The other thing would be good to look
at is timing the overhead of what it takes to update your index as you
change the values that are being indexed.

On Fri, Feb 25, 2011 at 10:23 AM, Ron Siemens <rs...@greatergood.com>wrote:

>
> I updated the cassandra version in the hector package from 7.0 to 7.2.  The
> occasional slow-down in the CF-index went away.  I then upped the heap to
> 512MB, and the secondary-indexing then works.  Seems awfully memory hungry
> for my small dataset.  Even the CF-index was faster with more heap.  These
> are the times with Cassandra-0.7.2 and 512M heap.  Slightly different
> testing: I'm varying the index used which give different data size results.
>  It still surprises me that the CF index does substantially better.
>
> Secondary Index
>
> DEBUG Retrieved THS / 7293 rows, in 1051 ms
> DEBUG Retrieved TRS / 7289 rows, in 1448 ms
> DEBUG Retrieved BCS / 7788 rows, in 1553 ms
> DEBUG Retrieved ARS / 7426 rows, in 1479 ms
> DEBUG Retrieved CHS / 7290 rows, in 1575 ms
> DEBUG Retrieved MS / 4523 rows, in 766 ms
> DEBUG Retrieved PRS / 562 rows, in 40 ms
> DEBUG Retrieved GGF / 1162 rows, in 122 ms
> DEBUG Retrieved VET / 7313 rows, in 1193 ms
> DEBUG Retrieved AUT / 7287 rows, in 1746 ms
> DEBUG Retrieved LIT / 7291 rows, in 1331 ms
>
> CF Index
>
> DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
> DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
> DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
> DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
> DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
> DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
> DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
> DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
> DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
> DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
> DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms
>
> On Feb 24, 2011, at 3:39 PM, Ron Siemens wrote:
>
> >
> > I failed to mention: this is just doing repeated data retrievals using
> the index.
> >
> >> ...
> >>
> >> Sample run: Secondary index.
> >>
> >> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> >> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> >> DEBUG Retrieved THS / 7293 rows, in 1843 ms
> > ...
> >
>
>

Re: Homebrew CF-indexing vs secondary indexing

Posted by Ron Siemens <rs...@greatergood.com>.

I updated the cassandra version in the hector package from 7.0 to 7.2.  The occasional slow-down in the CF-index went away.  I then upped the heap to 512MB, and the secondary-indexing then works.  Seems awfully memory hungry for my small dataset.  Even the CF-index was faster with more heap.  These are the times with Cassandra-0.7.2 and 512M heap.  Slightly different testing: I'm varying the index used which give different data size results.  It still surprises me that the CF index does substantially better.

Secondary Index

DEBUG Retrieved THS / 7293 rows, in 1051 ms
DEBUG Retrieved TRS / 7289 rows, in 1448 ms
DEBUG Retrieved BCS / 7788 rows, in 1553 ms
DEBUG Retrieved ARS / 7426 rows, in 1479 ms
DEBUG Retrieved CHS / 7290 rows, in 1575 ms
DEBUG Retrieved MS / 4523 rows, in 766 ms
DEBUG Retrieved PRS / 562 rows, in 40 ms
DEBUG Retrieved GGF / 1162 rows, in 122 ms
DEBUG Retrieved VET / 7313 rows, in 1193 ms
DEBUG Retrieved AUT / 7287 rows, in 1746 ms
DEBUG Retrieved LIT / 7291 rows, in 1331 ms

CF Index

DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms

On Feb 24, 2011, at 3:39 PM, Ron Siemens wrote:

> 
> I failed to mention: this is just doing repeated data retrievals using the index.
> 
>> ...
>> 
>> Sample run: Secondary index.
>> 
>> DEBUG Retrieved THS / 7293 rows, in 2012 ms
>> DEBUG Retrieved THS / 7293 rows, in 1956 ms
>> DEBUG Retrieved THS / 7293 rows, in 1843 ms
> ...
>

Re: Homebrew CF-indexing vs secondary indexing

Posted by Ron Siemens <rs...@greatergood.com>.

I failed to mention: this is just doing repeated data retrievals using the index.

> ...
> 
> Sample run: Secondary index.
> 
> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> DEBUG Retrieved THS / 7293 rows, in 1843 ms
...

Re: Homebrew CF-indexing vs secondary indexing

Posted by buddhasystem <po...@bnl.gov>.

FWIW, for me the advantage of homebrew indexes is that they can be a lot more
sophisticated than the standard -- I can hash combinations of column values
to whatever I want. I also put counters on column values in the index, so
there is lots of functionality. Of course, I can do it because my data
becomes read-only, I know it's a luxury.

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Homebrew-CF-indexing-vs-secondary-indexing-tp6062677p6062705.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.