You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Renato Bacelar da Silveira <re...@indabamobile.co.za> on 2011/08/31 11:13:30 UTC

Cassandra Reads a little slow - 900 keys takes 4 seconds.

Hi All

I am running a query against a node with about 50 Column Families.

At present One of the column families has 2,502,000 rows, each row
contains 100 columns.

I am searching for 3 columns specifically, and am doing so with Thrift's
multiget_slice(). I prepare a statement with about 900 row  keys, each
searching for a slice of 3 specific columns.

My average time taken to return from the multiget_slice() is about 4
seconds. I performed a comparative query in mysql, and the results
were returned to me in 0.75 seconds or avarage.

Is 4 seconds way too much time for Cassandra? I am sure this could
be under 1 second, like MySql.

I have resized the Thrift transport size to just 1MB so to not encounter
any timeouts, as noted if you push too many queries through. Is this
a correct assumption?

So is it too much to push 900 keys in a multiget_slice() at once? I read
that it does a concurrent fetch. I can understand threads racing for
cycles, causing waits, but somehow I think I am wrong somewhere.

Regards to ALL!



Renato da Silveira
Senior Developer
www.indabamobile.co.za



--

Re: Cassandra Reads a little slow - 900 keys takes 4 seconds.

Posted by Renato Bacelar da Silveira <re...@indabamobile.co.za>.

Thank you Yan and Dan

I have done all the tweaking that I can do,
even looked at latencies in the wire to add
to the 4 seconds, but that is the margin I
keep coming up with.

Concerning the memory issue mentioned, the machine
is working fine, there is alot of fluctuation on the Cassandra
heap itself, but no Out of Memory errors yet, even with
full blown query loads, like 200 threads issuing 10000 queries each,
so things seem to be stable.

But if one thread is issuing a single query, as mentioned bellow,
with 900 keys, on a CF with 2.5 million rows, the query takes
4 secs. A comparative query in MySql, with the similar data both
in the query string, and the tables, resulted in 0.75ms average
to the same machine in the cluster. So wire latency was not
the issue, and I think hardware also.

I will do some more tweaking and when the result time gets to something
comparative, then I will post my findings.

Regards to ALL!

On 01/09/2011 03:34, Yang wrote:
> you might also want to try to see if it's due to disk seeking.
>
> you verify this by increasing your memory/heap size, or writing your 
> files to a ram disk /tmpfs
>
>
>
> On Wed, Aug 31, 2011 at 4:57 PM, Dan Kuebrich <dan.kuebrich@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     There might be some tuning you can do--key cache, etc--though I
>     can't speak to that in your particular case and with 50 column
>     families you'd probably run into pretty bad memory limits.
>
>     However, having found myself in a similar situation in the past,
>     you might consider experimentally trying different batch sizes on
>     the # of rows (eg 1 request for 900 vs 9 for 100 each, etc).  This
>     has helped me solve timeout problems when retrieving "large"
>     numbers of rows in the past and reduced overall retrieval time.  I
>     know that at least the pycassa client supports this type of
>     multiget out of the box.
>
>     On Wed, Aug 31, 2011 at 5:13 AM, Renato Bacelar da Silveira
>     <renatods@indabamobile.co.za <ma...@indabamobile.co.za>>
>     wrote:
>
>         Hi All
>
>         I am running a query against a node with about 50 Column Families.
>
>         At present One of the column families has 2,502,000 rows, each row
>         contains 100 columns.
>
>         I am searching for 3 columns specifically, and am doing so
>         with Thrift's
>         multiget_slice(). I prepare a statement with about 900 row 
>         keys, each
>         searching for a slice of 3 specific columns.
>
>         My average time taken to return from the multiget_slice() is
>         about 4
>         seconds. I performed a comparative query in mysql, and the results
>         were returned to me in 0.75 seconds or avarage.
>
>         Is 4 seconds way too much time for Cassandra? I am sure this could
>         be under 1 second, like MySql.
>
>         I have resized the Thrift transport size to just 1MB so to not
>         encounter
>         any timeouts, as noted if you push too many queries through.
>         Is this
>         a correct assumption?
>
>         So is it too much to push 900 keys in a multiget_slice() at
>         once? I read
>         that it does a concurrent fetch. I can understand threads
>         racing for
>         cycles, causing waits, but somehow I think I am wrong somewhere.
>
>         Regards to ALL!
>
>
>
>         Renato da Silveira
>         Senior Developer
>         www.indabamobile.co.za <http://www.indabamobile.co.za>
>
>
>
>         -- 
>
>
>


--

Re: Cassandra Reads a little slow - 900 keys takes 4 seconds.

Posted by Yang <te...@gmail.com>.

you might also want to try to see if it's due to disk seeking.

you verify this by increasing your memory/heap size, or writing your files
to a ram disk /tmpfs



On Wed, Aug 31, 2011 at 4:57 PM, Dan Kuebrich <da...@gmail.com>wrote:

> There might be some tuning you can do--key cache, etc--though I can't speak
> to that in your particular case and with 50 column families you'd probably
> run into pretty bad memory limits.
>
> However, having found myself in a similar situation in the past, you might
> consider experimentally trying different batch sizes on the # of rows (eg 1
> request for 900 vs 9 for 100 each, etc).  This has helped me solve timeout
> problems when retrieving "large" numbers of rows in the past and reduced
> overall retrieval time.  I know that at least the pycassa client supports
> this type of multiget out of the box.
>
> On Wed, Aug 31, 2011 at 5:13 AM, Renato Bacelar da Silveira <
> renatods@indabamobile.co.za> wrote:
>
>> **
>> Hi All
>>
>> I am running a query against a node with about 50 Column Families.
>>
>> At present One of the column families has 2,502,000 rows, each row
>> contains 100 columns.
>>
>> I am searching for 3 columns specifically, and am doing so with Thrift's
>> multiget_slice(). I prepare a statement with about 900 row  keys, each
>> searching for a slice of 3 specific columns.
>>
>> My average time taken to return from the multiget_slice() is about 4
>> seconds. I performed a comparative query in mysql, and the results
>> were returned to me in 0.75 seconds or avarage.
>>
>> Is 4 seconds way too much time for Cassandra? I am sure this could
>> be under 1 second, like MySql.
>>
>> I have resized the Thrift transport size to just 1MB so to not encounter
>> any timeouts, as noted if you push too many queries through. Is this
>> a correct assumption?
>>
>> So is it too much to push 900 keys in a multiget_slice() at once? I read
>> that it does a concurrent fetch. I can understand threads racing for
>> cycles, causing waits, but somehow I think I am wrong somewhere.
>>
>> Regards to ALL!
>>
>>
>>
>> Renato da Silveira
>> Senior Developer
>> www.indabamobile.co.za
>>
>>
>>
>> --
>>
>
>

Re: Cassandra Reads a little slow - 900 keys takes 4 seconds.

Posted by Dan Kuebrich <da...@gmail.com>.

There might be some tuning you can do--key cache, etc--though I can't speak
to that in your particular case and with 50 column families you'd probably
run into pretty bad memory limits.

However, having found myself in a similar situation in the past, you might
consider experimentally trying different batch sizes on the # of rows (eg 1
request for 900 vs 9 for 100 each, etc).  This has helped me solve timeout
problems when retrieving "large" numbers of rows in the past and reduced
overall retrieval time.  I know that at least the pycassa client supports
this type of multiget out of the box.

On Wed, Aug 31, 2011 at 5:13 AM, Renato Bacelar da Silveira <
renatods@indabamobile.co.za> wrote:

> **
> Hi All
>
> I am running a query against a node with about 50 Column Families.
>
> At present One of the column families has 2,502,000 rows, each row
> contains 100 columns.
>
> I am searching for 3 columns specifically, and am doing so with Thrift's
> multiget_slice(). I prepare a statement with about 900 row  keys, each
> searching for a slice of 3 specific columns.
>
> My average time taken to return from the multiget_slice() is about 4
> seconds. I performed a comparative query in mysql, and the results
> were returned to me in 0.75 seconds or avarage.
>
> Is 4 seconds way too much time for Cassandra? I am sure this could
> be under 1 second, like MySql.
>
> I have resized the Thrift transport size to just 1MB so to not encounter
> any timeouts, as noted if you push too many queries through. Is this
> a correct assumption?
>
> So is it too much to push 900 keys in a multiget_slice() at once? I read
> that it does a concurrent fetch. I can understand threads racing for
> cycles, causing waits, but somehow I think I am wrong somewhere.
>
> Regards to ALL!
>
>
>
> Renato da Silveira
> Senior Developer
> www.indabamobile.co.za
>
>
>
> --
>