You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by yangfeng <ye...@gmail.com> on 2010/04/20 14:59:05 UTC

How to increase cassandra's performance in read?

I  get 10 columns Family by keys and  one columns Family has 30 columns.
I use multigetSlice once to get 10 column Family.but the performance is so
poor.
anyone has other  thought to increase the performance.

Re: How to increase cassandra's performance in read?

Posted by Benjamin Black <b...@b3k.us>.

On Tue, Apr 20, 2010 at 11:54 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> When I look at this arrangement, I see one lookup by key for the user, followed by a large read for all the "email indexes"  (these are all columns in the same row, right?)
>
> Then one lookup by key for each email....  Seems very seek intensive.
>

Do you need to grab every single email every single time?  Seems to me
you only need the recent ones or a page full.  A single multiget would
do it, and the load is spread across the cluster.

>...
>
>
> Ok, so If I do it this way, the # of keys rapidly goes into the billions, does that not cause other problems?

Not generally.  Cassandra is built to handle enormous numbers of rows
efficiently.

>Seems like many more data/index files....
>

Only if you aren't compacting for some reason.

b

RE: How to increase cassandra's performance in read?

Posted by Mark Jones <MJ...@imagehawk.com>.

When I look at this arrangement, I see one lookup by key for the user, followed by a large read for all the "email indexes"  (these are all columns in the same row, right?)

Then one lookup by key for each email....  Seems very seek intensive.


Would a better way be to index each email with a key of

UserID:ConvoID:Time

And then use the Order Preserving Partitioner?

That way I could at least use a get_range and the inbox is clustered together which should greatly shorten the amount of time seeking for keys.

However if I rolled all the inbox details into each column  (subject/date/sender/flags), I would only have to seek when I want to display the entire message.....

Hmmm, definitely presents a different way to think of things.

Ok, so If I do it this way, the # of keys rapidly goes into the billions, does that not cause other problems?  Seems like many more data/index files....


-----Original Message-----
From: Benjamin Black [mailto:b@b3k.us]
Sent: Tuesday, April 20, 2010 1:00 PM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

I can't answer for its sanity, but I would not do it that way.  I'd
have a CF for Emails, with 1 email per row, and another CF for
UserEmails with per-user index rows referencing the Emails rows.


b

On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> To make sure I'm clear on what you are saying:
>
>  Are the "Individual Emails" in the example below, Supercolumns and the {body, header, tags...} the subcolumns?
>
> Is that a sane data layout for an email system?  Where the Supercolumn identifier is the "conversation label"
>
> Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is a bit confusing when you are coming from a SQL background.  I can't see why you would want multiple emails in the same row since they each have the same "columns" of information and therefore make good logical entities as outlined below.
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 11:16 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> Not all the data associated w/ the key is brought into memory, just
> all the data associated w/ the supercolumns being queried.
>
> Supercolumns are so you can update a smallish number of subcolumns
> independently (e.g. when denormalizing an entire narrow row, usually
> with a finite set of columns).  If you want lots of subcolumns you
> need to turn that supercolumn into a new row.
>
> On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>> When I first read this, it bothered me because it seemed like it couldn't be so.  So I read the link, and it says the whole thing, so I have to ask for some classification here.
>>
>> I had always assumed a super column was similar to a local keyspace, and that the SubColumns under it were similar to keys, that way you could localize the data for a user or a website.
>>
>> So Keyspace:Email
>>  Key:UserID
>>     SuperColumn Entries:
>>        Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}
>>        Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}
>>        Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}
>>
>> I think now this is probably the wrong concept.
>>
>> It is really more like:
>>        Primary Key: Name:Value pairs
>>
>> And with Supercolumns, the Value part can be another Hash:
>>        Primary Key: Name: {Name:Value pairs} pairs
>>
>> But when I lookup by Primary Key, ALL of the data associated with the key will be brought into memory!  So, when if I wanted to display the inbox of a user with several years of email, it would be one HUGE read to suck his entire inbox into memory to get down to the point I could display one message.
>>
>> Is this more correct?
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Tuesday, April 20, 2010 10:47 AM
>> To: user@cassandra.apache.org
>> Subject: Re: How to increase cassandra's performance in read?
>>
>> How many columns are in the supercolumn total?
>>
>> "in super columnfamilies there is a third level of subcolumns; these
>> are not indexed, and any request for a subcolumn deserializes _all_
>> the subcolumns in that supercolumn"
>>
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>>> I too am seeing very slow performance while testing worst case scenarios of
>>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>>
>>>
>>>
>>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>>
>>>
>>>
>>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>>> (With NO swapping)  So far, I've found nothing that helps, including
>>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>>> prevents better cache performance.
>>>
>>>
>>>
>>> Read performance is definitely not 3 IOs based on the utilization factors on
>>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>>> as to how to calculate how many IOs were being done for each read.  I've
>>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>>> with multiple machines, is lower performance in a cluster than alone.  I
>>> keep assuming that at some number of nodes, the performance will begin to
>>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>>> the fastest performer on inserts, but definitely not the fastest on reads.
>>>
>>>
>>>
>>> I'm suspecting the read path is relying heavily on the fact that you want to
>>> get many columns that are closely related, because lookup by key appears to
>>> be incredibly slow.
>>>
>>>
>>>
>>> From: yangfeng [mailto:yeahyf@gmail.com]
>>> Sent: Tuesday, April 20, 2010 7:59 AM
>>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>>> Subject: How to increase cassandra's performance in read?
>>>
>>>
>>>
>>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>>
>>> I use multigetSlice once to get 10 column Family.but the performance is so
>>> poor.
>>>
>>> anyone has other  thought to increase the performance.
>>>
>>>
>>
>

Re: How to increase cassandra's performance in read?

Posted by Benjamin Black <b...@b3k.us>.

I can't answer for its sanity, but I would not do it that way.  I'd
have a CF for Emails, with 1 email per row, and another CF for
UserEmails with per-user index rows referencing the Emails rows.


b

On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> To make sure I'm clear on what you are saying:
>
>  Are the "Individual Emails" in the example below, Supercolumns and the {body, header, tags...} the subcolumns?
>
> Is that a sane data layout for an email system?  Where the Supercolumn identifier is the "conversation label"
>
> Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is a bit confusing when you are coming from a SQL background.  I can't see why you would want multiple emails in the same row since they each have the same "columns" of information and therefore make good logical entities as outlined below.
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 11:16 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> Not all the data associated w/ the key is brought into memory, just
> all the data associated w/ the supercolumns being queried.
>
> Supercolumns are so you can update a smallish number of subcolumns
> independently (e.g. when denormalizing an entire narrow row, usually
> with a finite set of columns).  If you want lots of subcolumns you
> need to turn that supercolumn into a new row.
>
> On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>> When I first read this, it bothered me because it seemed like it couldn't be so.  So I read the link, and it says the whole thing, so I have to ask for some classification here.
>>
>> I had always assumed a super column was similar to a local keyspace, and that the SubColumns under it were similar to keys, that way you could localize the data for a user or a website.
>>
>> So Keyspace:Email
>>  Key:UserID
>>     SuperColumn Entries:
>>        Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}
>>        Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}
>>        Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}
>>
>> I think now this is probably the wrong concept.
>>
>> It is really more like:
>>        Primary Key: Name:Value pairs
>>
>> And with Supercolumns, the Value part can be another Hash:
>>        Primary Key: Name: {Name:Value pairs} pairs
>>
>> But when I lookup by Primary Key, ALL of the data associated with the key will be brought into memory!  So, when if I wanted to display the inbox of a user with several years of email, it would be one HUGE read to suck his entire inbox into memory to get down to the point I could display one message.
>>
>> Is this more correct?
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Tuesday, April 20, 2010 10:47 AM
>> To: user@cassandra.apache.org
>> Subject: Re: How to increase cassandra's performance in read?
>>
>> How many columns are in the supercolumn total?
>>
>> "in super columnfamilies there is a third level of subcolumns; these
>> are not indexed, and any request for a subcolumn deserializes _all_
>> the subcolumns in that supercolumn"
>>
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>>> I too am seeing very slow performance while testing worst case scenarios of
>>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>>
>>>
>>>
>>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>>
>>>
>>>
>>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>>> (With NO swapping)  So far, I've found nothing that helps, including
>>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>>> prevents better cache performance.
>>>
>>>
>>>
>>> Read performance is definitely not 3 IOs based on the utilization factors on
>>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>>> as to how to calculate how many IOs were being done for each read.  I've
>>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>>> with multiple machines, is lower performance in a cluster than alone.  I
>>> keep assuming that at some number of nodes, the performance will begin to
>>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>>> the fastest performer on inserts, but definitely not the fastest on reads.
>>>
>>>
>>>
>>> I'm suspecting the read path is relying heavily on the fact that you want to
>>> get many columns that are closely related, because lookup by key appears to
>>> be incredibly slow.
>>>
>>>
>>>
>>> From: yangfeng [mailto:yeahyf@gmail.com]
>>> Sent: Tuesday, April 20, 2010 7:59 AM
>>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>>> Subject: How to increase cassandra's performance in read?
>>>
>>>
>>>
>>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>>
>>> I use multigetSlice once to get 10 column Family.but the performance is so
>>> poor.
>>>
>>> anyone has other  thought to increase the performance.
>>>
>>>
>>
>

RE: How to increase cassandra's performance in read?

Posted by Mark Jones <MJ...@imagehawk.com>.

To make sure I'm clear on what you are saying:

  Are the "Individual Emails" in the example below, Supercolumns and the {body, header, tags...} the subcolumns?

Is that a sane data layout for an email system?  Where the Supercolumn identifier is the "conversation label"

Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is a bit confusing when you are coming from a SQL background.  I can't see why you would want multiple emails in the same row since they each have the same "columns" of information and therefore make good logical entities as outlined below.

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Tuesday, April 20, 2010 11:16 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

Not all the data associated w/ the key is brought into memory, just
all the data associated w/ the supercolumns being queried.

Supercolumns are so you can update a smallish number of subcolumns
independently (e.g. when denormalizing an entire narrow row, usually
with a finite set of columns).  If you want lots of subcolumns you
need to turn that supercolumn into a new row.

On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> When I first read this, it bothered me because it seemed like it couldn't be so.  So I read the link, and it says the whole thing, so I have to ask for some classification here.
>
> I had always assumed a super column was similar to a local keyspace, and that the SubColumns under it were similar to keys, that way you could localize the data for a user or a website.
>
> So Keyspace:Email
>  Key:UserID
>     SuperColumn Entries:
>        Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}
>        Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}
>        Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}
>
> I think now this is probably the wrong concept.
>
> It is really more like:
>        Primary Key: Name:Value pairs
>
> And with Supercolumns, the Value part can be another Hash:
>        Primary Key: Name: {Name:Value pairs} pairs
>
> But when I lookup by Primary Key, ALL of the data associated with the key will be brought into memory!  So, when if I wanted to display the inbox of a user with several years of email, it would be one HUGE read to suck his entire inbox into memory to get down to the point I could display one message.
>
> Is this more correct?
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 10:47 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> How many columns are in the supercolumn total?
>
> "in super columnfamilies there is a third level of subcolumns; these
> are not indexed, and any request for a subcolumn deserializes _all_
> the subcolumns in that supercolumn"
>
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>> I too am seeing very slow performance while testing worst case scenarios of
>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>
>>
>>
>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>
>>
>>
>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>> (With NO swapping)  So far, I've found nothing that helps, including
>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>> prevents better cache performance.
>>
>>
>>
>> Read performance is definitely not 3 IOs based on the utilization factors on
>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>> as to how to calculate how many IOs were being done for each read.  I've
>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>> with multiple machines, is lower performance in a cluster than alone.  I
>> keep assuming that at some number of nodes, the performance will begin to
>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>> the fastest performer on inserts, but definitely not the fastest on reads.
>>
>>
>>
>> I'm suspecting the read path is relying heavily on the fact that you want to
>> get many columns that are closely related, because lookup by key appears to
>> be incredibly slow.
>>
>>
>>
>> From: yangfeng [mailto:yeahyf@gmail.com]
>> Sent: Tuesday, April 20, 2010 7:59 AM
>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>> Subject: How to increase cassandra's performance in read?
>>
>>
>>
>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>
>> I use multigetSlice once to get 10 column Family.but the performance is so
>> poor.
>>
>> anyone has other  thought to increase the performance.
>>
>>
>

Re: How to increase cassandra's performance in read?

Posted by Jonathan Ellis <jb...@gmail.com>.

Not all the data associated w/ the key is brought into memory, just
all the data associated w/ the supercolumns being queried.

Supercolumns are so you can update a smallish number of subcolumns
independently (e.g. when denormalizing an entire narrow row, usually
with a finite set of columns).  If you want lots of subcolumns you
need to turn that supercolumn into a new row.

On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> When I first read this, it bothered me because it seemed like it couldn't be so.  So I read the link, and it says the whole thing, so I have to ask for some classification here.
>
> I had always assumed a super column was similar to a local keyspace, and that the SubColumns under it were similar to keys, that way you could localize the data for a user or a website.
>
> So Keyspace:Email
>  Key:UserID
>     SuperColumn Entries:
>                Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}                  Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}                  Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}
>
> I think now this is probably the wrong concept.
>
> It is really more like:
>        Primary Key: Name:Value pairs
>
> And with Supercolumns, the Value part can be another Hash:
>        Primary Key: Name: {Name:Value pairs} pairs
>
> But when I lookup by Primary Key, ALL of the data associated with the key will be brought into memory!  So, when if I wanted to display the inbox of a user with several years of email, it would be one HUGE read to suck his entire inbox into memory to get down to the point I could display one message.
>
> Is this more correct?
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, April 20, 2010 10:47 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> How many columns are in the supercolumn total?
>
> "in super columnfamilies there is a third level of subcolumns; these
> are not indexed, and any request for a subcolumn deserializes _all_
> the subcolumns in that supercolumn"
>
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
>> I too am seeing very slow performance while testing worst case scenarios of
>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>
>>
>>
>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>
>>
>>
>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>> (With NO swapping)  So far, I've found nothing that helps, including
>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>> prevents better cache performance.
>>
>>
>>
>> Read performance is definitely not 3 IOs based on the utilization factors on
>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>> as to how to calculate how many IOs were being done for each read.  I've
>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>> with multiple machines, is lower performance in a cluster than alone.  I
>> keep assuming that at some number of nodes, the performance will begin to
>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>> the fastest performer on inserts, but definitely not the fastest on reads.
>>
>>
>>
>> I'm suspecting the read path is relying heavily on the fact that you want to
>> get many columns that are closely related, because lookup by key appears to
>> be incredibly slow.
>>
>>
>>
>> From: yangfeng [mailto:yeahyf@gmail.com]
>> Sent: Tuesday, April 20, 2010 7:59 AM
>> To: user@cassandra.apache.org; dev@cassandra.apache.org
>> Subject: How to increase cassandra's performance in read?
>>
>>
>>
>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>
>> I use multigetSlice once to get 10 column Family.but the performance is so
>> poor.
>>
>> anyone has other  thought to increase the performance.
>>
>>
>

RE: How to increase cassandra's performance in read?

Posted by Mark Jones <MJ...@imagehawk.com>.

When I first read this, it bothered me because it seemed like it couldn't be so.  So I read the link, and it says the whole thing, so I have to ask for some classification here.

I had always assumed a super column was similar to a local keyspace, and that the SubColumns under it were similar to keys, that way you could localize the data for a user or a website.

So Keyspace:Email
  Key:UserID
     SuperColumn Entries:
                Individual Email 1:  Columns {body, header, tags, recipients, flags, whatever}                  Individual Email 2:  Columns {body, header, tags, recipients, flags, whatever}                  Individual Email 3:  Columns {body, header, tags, recipients, flags, whatever}

I think now this is probably the wrong concept.

It is really more like:
        Primary Key: Name:Value pairs

And with Supercolumns, the Value part can be another Hash:
        Primary Key: Name: {Name:Value pairs} pairs

But when I lookup by Primary Key, ALL of the data associated with the key will be brought into memory!  So, when if I wanted to display the inbox of a user with several years of email, it would be one HUGE read to suck his entire inbox into memory to get down to the point I could display one message.

Is this more correct?

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Tuesday, April 20, 2010 10:47 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
> (With NO swapping)  So far, I've found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yeahyf@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>

RE: How to increase cassandra's performance in read?

Posted by Mark Jones <MJ...@imagehawk.com>.

Sorry, I didn't answer your question in my response, I have at this point:


Key(ID)
    When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, tags, flags)}


Under some keys (very small #) there will be 2 values like:

Key(ID)
    When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, tags, flags)}
    When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, tags, flags)}
    Long term this list will be in the 1000's possibly millions

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Tuesday, April 20, 2010 10:47 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
> (With NO swapping)  So far, I've found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yeahyf@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>

Re: How to increase cassandra's performance in read?

Posted by Jonathan Ellis <jb...@gmail.com>.

How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I’m only dealing with 50-70 million rows.
> (With NO swapping)  So far, I’ve found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I’m seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yeahyf@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; dev@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>

RE: How to increase cassandra's performance in read?

Posted by Mark Jones <MJ...@imagehawk.com>.

I too am seeing very slow performance while testing worst case scenarios of 1 key leading to 1 supercolumn and 1 column beyond that.

Key -> SuperColumn -> 1 Column (of ~ 500 bytes)

Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.  (With NO swapping)  So far, I've found nothing that helps, including increasing the keycache FROM 200k-500k keys, I'm guessing the hashing prevents better cache performance.

Read performance is definitely not 3 IOs based on the utilization factors on my drives.  I'm not sure the issue was ever settled in the previous e-mails as to how to calculate how many IOs were being done for each read.  I've been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing with multiple machines, is lower performance in a cluster than alone.  I keep assuming that at some number of nodes, the performance will begin to pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is the fastest performer on inserts, but definitely not the fastest on reads.

I'm suspecting the read path is relying heavily on the fact that you want to get many columns that are closely related, because lookup by key appears to be incredibly slow.

From: yangfeng [mailto:yeahyf@gmail.com]
Sent: Tuesday, April 20, 2010 7:59 AM
To: user@cassandra.apache.org; dev@cassandra.apache.org
Subject: How to increase cassandra's performance in read?

I  get 10 columns Family by keys and  one columns Family has 30 columns.
I use multigetSlice once to get 10 column Family.but the performance is so poor.
anyone has other  thought to increase the performance.