You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Eldad Yamin <el...@gmail.com> on 2011/07/09 21:15:00 UTC

Cassandra Secondary index/Twissandra

Hi,
I have few questions:

*Secondary index*

   1. Is there a limit on the number of columns in a single column family
   that serve as secondary indexes?
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s values is high?


*Twissandra*

   1. Why in the source (or any tutorial I've read):
   The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not
   TimeUUID?

   https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
   lots of columns?


Thanks!
Eldad

Re: Cassandra Secondary index/Twissandra

Posted by Eldad Yamin <el...@gmail.com>.

Hi Aaron,
Thank you again for your response.

I've read the article but I didn't understand everything. it would be great
if the benchmark will include the actual CLI/Python comments (that way it
will be easier to understand the query). in addition, an explanation about
row pages - what is it?.

Anyway, for a scale proportion, we can take as example
the average Facebook/Twitter user which can get 100K columns per user
(Userline).
So what is needed is to take the first 50 columns (order by TimeUUID), then
column 51 to 100, 101 to 150 etc.
Any suggestion on fast will it be? or how you recommend on configuring
Cassandra? or even a different way of achieving that goal?

Thanks,
Eldad.

On Sun, Jul 10, 2011 at 8:31 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Can you recommend on a better way of doing that or a way to tune Cassandra
> to support those 2 CF?
>
> A select with no start or finish column name, a column count and not in
> reversed order is about the fastest read query.
>
> You will need to do a reversed query, which will be a little slower. But
> may still be plenty fast enough, depending on scale and throughput and all
> those other things. see
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10 Jul 2011, at 00:14, Eldad Yamin wrote:
>
> Aaron - Thank you for the fast response!
>
>
>    1. Does performance decrease (significantly) if the uniqueness of the
>    column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
>    lots of columns?
>
> >Depends on what sort of operations you are doing. Some read operations
> have to pay a constant cost to decode the row level column index, this can
> be tuned though. AFAIK the comparator type has very little to do with the
> performance.
>
> In Twissandra, the columns are used as "alternative" index for the
> Userline/Timeline. therefore the operation I'm going to do is slice_range.
> I'm going to get (for example) the first 50  columns (using comparator of
> TimeUUID/LONG).
> Can you recommend on a better way of doing that or a way to tune Cassandra
> to support those 2 CF?
>
>
> Thanks!
>
> On Sun, Jul 10, 2011 at 3:26 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>>
>>    1. Is there a limit on the number of columns in a single column family
>>    that serve as secondary indexes?
>>
>> AFAIK there is no coded limit, however every index is implemented as
>> another (hidden) Column Family that inherits the settings of the parent CF.
>> So under 0.7 you may run out of memory, under 0.8 you may flush  a lot.
>> Also, when an indexed column is updated there are potentially 3 operations
>> that have to happen: read the old value, delete the old value, write the new
>> value. More indexes == more index updating, just like any other database.
>>
>>
>>    1. Does performance decrease (significantly) if the uniqueness of the
>>    column’s values is high?
>>
>> Low cardinality is recommended
>>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html
>>
>>
>>    1. The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE"
>>    and not TimeUUID?
>>
>> Probably just to make the demo easier. It's used to order tweets in the
>> user and public timelines by the current time
>> https://github.com/twissandra/twissandra/blob/master/cass.py#L204
>>
>>
>>    1. Does performance decrease (significantly) if the uniqueness of the
>>    column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
>>    lots of columns?
>>
>> Depends on what sort of operations you are doing. Some read operations
>> have to pay a constant cost to decode the row level column index, this can
>> be tuned though. AFAIK the comparator type has very little to do with the
>> performance.
>>
>> Hope that helps.
>>
>> -----------------
>>  -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 9 Jul 2011, at 12:15, Eldad Yamin wrote:
>>
>> Hi,
>> I have few questions:
>>
>> *Secondary index*
>>
>>    1. Is there a limit on the number of columns in a single column family
>>    that serve as secondary indexes?
>>    2. Does performance decrease (significantly) if the uniqueness of the
>>    column’s values is high?
>>
>>
>> *Twissandra*
>>
>>    1. Why in the source (or any tutorial I've read):
>>    The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and
>>    not TimeUUID?
>>
>>    https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
>>    2. Does performance decrease (significantly) if the uniqueness of the
>>    column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
>>    lots of columns?
>>
>>
>> Thanks!
>> Eldad
>>
>>
>>
>
>

Re: Cassandra Secondary index/Twissandra

Posted by aaron morton <aa...@thelastpickle.com>.

> Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF?
A select with no start or finish column name, a column count and not in reversed order is about the fastest read query. 

You will need to do a reversed query, which will be a little slower. But may still be plenty fast enough, depending on scale and throughput and all those other things. see http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/

Cheers


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Jul 2011, at 00:14, Eldad Yamin wrote:

> Aaron - Thank you for the fast response!
> 
>> Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns?
> 
> >Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. 
> 
> In Twissandra, the columns are used as "alternative" index for the Userline/Timeline. therefore the operation I'm going to do is slice_range.
> I'm going to get (for example) the first 50  columns (using comparator of TimeUUID/LONG).
> Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF?
> 
> 
> Thanks!
> 
> On Sun, Jul 10, 2011 at 3:26 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> Is there a limit on the number of columns in a single column family that serve as secondary indexes? 
> 
> AFAIK there is no coded limit, however every index is implemented as another (hidden) Column Family that inherits the settings of the parent CF. So under 0.7 you may run out of memory, under 0.8 you may flush  a lot. Also, when an indexed column is updated there are potentially 3 operations that have to happen: read the old value, delete the old value, write the new value. More indexes == more index updating, just like any other database. 
>> Does performance decrease (significantly) if the uniqueness of the column’s values is high?
> Low cardinality is recommended
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html
> 
>> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not TimeUUID?
> 
> Probably just to make the demo easier. It's used to order tweets in the user and public timelines by the current time 
> https://github.com/twissandra/twissandra/blob/master/cass.py#L204
> 
>> Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns?
> 
> Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. 
> 
> Hope that helps. 
> 
> -----------------
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 9 Jul 2011, at 12:15, Eldad Yamin wrote:
> 
>> Hi,
>> I have few questions:
>> 
>> Secondary index
>> Is there a limit on the number of columns in a single column family that serve as secondary indexes? 
>> Does performance decrease (significantly) if the uniqueness of the column’s values is high?
>> 
>> Twissandra
>> Why in the source (or any tutorial I've read):
>> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not TimeUUID?
>> https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
>> Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns?
>> 
>> Thanks!
>> Eldad
> 
>

Re: Cassandra Secondary index/Twissandra

Posted by Eldad Yamin <el...@gmail.com>.

Aaron - Thank you for the fast response!


   1. Does performance decrease (significantly) if the uniqueness of the
   column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
   lots of columns?

>Depends on what sort of operations you are doing. Some read operations have
to pay a constant cost to decode the row level column index, this can be
tuned though. AFAIK the comparator type has very little to do with the
performance.

In Twissandra, the columns are used as "alternative" index for the
Userline/Timeline. therefore the operation I'm going to do is slice_range.
I'm going to get (for example) the first 50  columns (using comparator of
TimeUUID/LONG).
Can you recommend on a better way of doing that or a way to tune Cassandra
to support those 2 CF?


Thanks!

On Sun, Jul 10, 2011 at 3:26 AM, aaron morton <aa...@thelastpickle.com>wrote:

>
>    1. Is there a limit on the number of columns in a single column family
>    that serve as secondary indexes?
>
> AFAIK there is no coded limit, however every index is implemented as
> another (hidden) Column Family that inherits the settings of the parent CF.
> So under 0.7 you may run out of memory, under 0.8 you may flush  a lot.
> Also, when an indexed column is updated there are potentially 3 operations
> that have to happen: read the old value, delete the old value, write the new
> value. More indexes == more index updating, just like any other database.
>
>
>    1. Does performance decrease (significantly) if the uniqueness of the
>    column’s values is high?
>
> Low cardinality is recommended
>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html
>
>
>    1. The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE"
>    and not TimeUUID?
>
> Probably just to make the demo easier. It's used to order tweets in the
> user and public timelines by the current time
> https://github.com/twissandra/twissandra/blob/master/cass.py#L204
>
>
>    1. Does performance decrease (significantly) if the uniqueness of the
>    column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
>    lots of columns?
>
> Depends on what sort of operations you are doing. Some read operations have
> to pay a constant cost to decode the row level column index, this can be
> tuned though. AFAIK the comparator type has very little to do with the
> performance.
>
> Hope that helps.
>
> -----------------
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 9 Jul 2011, at 12:15, Eldad Yamin wrote:
>
> Hi,
> I have few questions:
>
> *Secondary index*
>
>    1. Is there a limit on the number of columns in a single column family
>    that serve as secondary indexes?
>    2. Does performance decrease (significantly) if the uniqueness of the
>    column’s values is high?
>
>
> *Twissandra*
>
>    1. Why in the source (or any tutorial I've read):
>    The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and
>    not TimeUUID?
>
>    https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
>    2. Does performance decrease (significantly) if the uniqueness of the
>    column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
>    lots of columns?
>
>
> Thanks!
> Eldad
>
>
>

Re: Cassandra Secondary index/Twissandra

Posted by aaron morton <aa...@thelastpickle.com>.

> Is there a limit on the number of columns in a single column family that serve as secondary indexes? 
AFAIK there is no coded limit, however every index is implemented as another (hidden) Column Family that inherits the settings of the parent CF. So under 0.7 you may run out of memory, under 0.8 you may flush  a lot. Also, when an indexed column is updated there are potentially 3 operations that have to happen: read the old value, delete the old value, write the new value. More indexes == more index updating, just like any other database. 
> Does performance decrease (significantly) if the uniqueness of the column’s values is high?
Low cardinality is recommended
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html

> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not TimeUUID?
Probably just to make the demo easier. It's used to order tweets in the user and public timelines by the current time 
https://github.com/twissandra/twissandra/blob/master/cass.py#L204

> Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns?
Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. 

Hope that helps. 

-----------------
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jul 2011, at 12:15, Eldad Yamin wrote:

> Hi,
> I have few questions:
> 
> Secondary index
> Is there a limit on the number of columns in a single column family that serve as secondary indexes? 
> Does performance decrease (significantly) if the uniqueness of the column’s values is high?
> 
> Twissandra
> Why in the source (or any tutorial I've read):
> The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not TimeUUID?
> https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
> Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns?
> 
> Thanks!
> Eldad

Fwd: Cassandra Secondary index/Twissandra

Posted by Eldad Yamin <el...@gmail.com>.

Hi,
I have few questions:

*Secondary index*

   1. Is there a limit on the number of columns in a single column family
   that serve as secondary indexes?
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s values is high?


*Twissandra*

   1. Why in the source (or any tutorial I've read):
   The CF for "Userline"/"Uimeline" - have comparator of "LONG_TYPE" and not
   TimeUUID?

   https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py
   2. Does performance decrease (significantly) if the uniqueness of the
   column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has
   lots of columns?


Thanks!
Eldad