You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kevin Irwig <ke...@yahoo.com> on 2010/12/24 05:20:52 UTC

complexity

Hi,

I've seen a similar question has been asked in this forum in Sept, but not 
answered.

What is the complexity of get(row) and get(row, column-name) operations, and 
insert(row, column)? What about accessing or inserting a column within a 
SuperColumn by name?

In Arin Sarkissian's "WTF is a Supercolumn" he stresses that columns are stored 
sorted by name. Are they simply stored in a sorted list, requiring a search when 
inserting and accessing (I forget the worst case complexity of 
searching/inserting into/deleting from a sorted list), or in some data structure 
with faster access? Even if you don't know the big-O complexity, a description 
of the implementation data structure(s), and discussion of what is fast and what 
is not, would help.

Thanks for your help,
Kevin.

Re: complexity

Posted by Peter Schuller <pe...@infidyne.com>.

>> When the row is stored on disk as SSTable, the complexity of getting a row
>> is constant, as it always know where to get the row by in-memory indices.
>
> BTW: not the whole indices are kept in memory, just part of them are. This
> is controlled by "IndexInterval".  That is, 1/IndexInterval of whole indices
> are kept in memory. The default value is 128.

And the reason why the cost is constant, is that there is a single
seek (logically) done to the index, followed by a single seek to the
data.

The index sampling controls the granularity of the seek in the index.
With the default value of 128, Cassandra may need to read and
deserialize up to 128 index entries before finding the one being
looked for, but that data is read sequentially from disk.

(Whether or not this translates into exactly one seek in reality will
be dependent on several factors such as read-ahead logic. But it
scales constantly with respect to data size - at the cost of some
memory (the index sampling))

-- 
/ Peter Schuller

Re: complexity

Posted by Zhu Han <sc...@gmail.com>.

On Fri, Dec 24, 2010 at 6:42 PM, Zhu Han <sc...@gmail.com> wrote:

> When the row is stored on disk as SSTable, the complexity of getting a row
> is constant, as it always know where to get the row by in-memory indices.
>

BTW: not the whole indices are kept in memory, just part of them are. This
is controlled by "IndexInterval".  That is, 1/IndexInterval of whole indices
are kept in memory. The default value is 128.


>
> When the row is stored in memory as memtable,  it is stored as skip
> list[1]. The complexity is O(logN).  N is the total number of rows in the
> skip list.
>
> To get a column inside a row, the complexity is similar. When there are
> many columns in a single row, cassandra will build indices for these columns
> when write the row to SStable.  This controlled by "ColumnIndexSizeInKB".
>
> If I made any mistake, please correct me. Thank you!
>
> [1] http://en.wikipedia.org/wiki/Skip_list
>
> best regards,
> hanzhu
>
>
>
> On Fri, Dec 24, 2010 at 12:20 PM, Kevin Irwig <ke...@yahoo.com>wrote:
>
>> Hi,
>>
>> I've seen a similar question has been asked in this forum in Sept, but not
>> answered.
>>
>> What is the complexity of get(row) and get(row, column-name) operations,
>> and
>> insert(row, column)? What about accessing or inserting a column within a
>> SuperColumn by name?
>>
>> In Arin Sarkissian's "WTF is a Supercolumn" he stresses that columns are
>> stored
>> sorted by name. Are they simply stored in a sorted list, requiring a
>> search when
>> inserting and accessing (I forget the worst case complexity of
>> searching/inserting into/deleting from a sorted list), or in some data
>> structure
>> with faster access? Even if you don't know the big-O complexity, a
>> description
>> of the implementation data structure(s), and discussion of what is fast
>> and what
>> is not, would help.
>>
>> Thanks for your help,
>> Kevin.
>>
>>
>>
>>
>>
>

Re: complexity

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Dec 24, 2010 at 10:35 PM, Kevin Irwig <ke...@yahoo.com> wrote:

> thanks all - and just to clarify the cost of getting a column (and a column
> given a SuperColumn) is also O(log N) ?
>

 Since subcolumns are not indexed (
https://issues.apache.org/jira/browse/CASSANDRA-598) retrieving them from a
SuperColumn is O(N).

-Brandon

Re: complexity

Posted by Kevin Irwig <ke...@yahoo.com>.

thanks all - and just to clarify the cost of getting a column (and a column 
given a SuperColumn) is also O(log N) ?




________________________________
From: Zhu Han <sc...@gmail.com>
To: user@cassandra.apache.org
Sent: Sat, 25 December, 2010 3:16:08 AM
Subject: Re: complexity

Yep. I forgot about the binary search part.

Thank you!

regards,
hanzhu



On Fri, Dec 24, 2010 at 9:35 PM, Jonathan Ellis <jb...@gmail.com> wrote:

On Fri, Dec 24, 2010 at 4:42 AM, Zhu Han <sc...@gmail.com> wrote:
>
>> When the row is stored on disk as SSTable, the complexity of getting a row
>> is constant, as it always know where to get the row by in-memory indices.
>
>Technically, it's O(log N) because of the binary search on the in-memory index.
>
>--
>Jonathan Ellis
>Project Chair, Apache Cassandra
>co-founder of Riptano, the source for professional Cassandra support
>http://riptano.com
>

Re: complexity

Posted by Zhu Han <sc...@gmail.com>.

Yep. I forgot about the binary search part.

Thank you!

regards,
hanzhu


On Fri, Dec 24, 2010 at 9:35 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Fri, Dec 24, 2010 at 4:42 AM, Zhu Han <sc...@gmail.com> wrote:
> > When the row is stored on disk as SSTable, the complexity of getting a
> row
> > is constant, as it always know where to get the row by in-memory indices.
>
> Technically, it's O(log N) because of the binary search on the in-memory
> index.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: complexity

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Dec 24, 2010 at 4:42 AM, Zhu Han <sc...@gmail.com> wrote:
> When the row is stored on disk as SSTable, the complexity of getting a row
> is constant, as it always know where to get the row by in-memory indices.

Technically, it's O(log N) because of the binary search on the in-memory index.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: complexity

Posted by Zhu Han <sc...@gmail.com>.

When the row is stored on disk as SSTable, the complexity of getting a row
is constant, as it always know where to get the row by in-memory indices.

When the row is stored in memory as memtable,  it is stored as skip list[1].
The complexity is O(logN).  N is the total number of rows in the skip list.

To get a column inside a row, the complexity is similar. When there are many
columns in a single row, cassandra will build indices for these columns when
write the row to SStable.  This controlled by "ColumnIndexSizeInKB".

If I made any mistake, please correct me. Thank you!

[1] http://en.wikipedia.org/wiki/Skip_list

best regards,
hanzhu

On Fri, Dec 24, 2010 at 12:20 PM, Kevin Irwig <ke...@yahoo.com> wrote:

> Hi,
>
> I've seen a similar question has been asked in this forum in Sept, but not
> answered.
>
> What is the complexity of get(row) and get(row, column-name) operations,
> and
> insert(row, column)? What about accessing or inserting a column within a
> SuperColumn by name?
>
> In Arin Sarkissian's "WTF is a Supercolumn" he stresses that columns are
> stored
> sorted by name. Are they simply stored in a sorted list, requiring a search
> when
> inserting and accessing (I forget the worst case complexity of
> searching/inserting into/deleting from a sorted list), or in some data
> structure
> with faster access? Even if you don't know the big-O complexity, a
> description
> of the implementation data structure(s), and discussion of what is fast and
> what
> is not, would help.
>
> Thanks for your help,
> Kevin.
>
>
>
>
>