You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Keith Wright <kw...@nanigans.com> on 2013/04/19 19:05:33 UTC

Batch get queries

Hi all,

   I am using C* 1.2.4 and using CQL3 with Astyanax to consume large amount of user based data (around 50-100K / sec).  Requests come in based on user cookies which I then need to link to a user (as users can change their cookies).  This is done using a link table:

CREATE TABLE cookie_user_lookup (
cookie TEXT PRIMARY KEY,
user_id BIGINT,
        creation_time TIMESTAMP
) with  compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and
compaction={'class':'LeveledCompactionStrategy'} and
gc_grace_seconds = 86400;

As I said, I am handling a large number of these per second and wanted to get your take on how best to do the lookup.  I find that there are 3 ways:

 *   Serially fetch 1 by 1.  The latency is very low at 0.1 ms but multiplying that by thousands per second becomes substantial.  This is too slow
 *   Serially fetch 1 by 1 but on separate threads.  This would require a very large number of concurrent connections (unless I change to datastax's binary protocol) as well as threads.  Seems heavy
 *   Batch fetch.  This is what I'm doing now where I build a very large select * from cookie_user_lookup where cookie in (a,b,c,.. Etc).  I am actually doing around 10K of these at a time and getting a response time in my cluster of around 100 ms.  This is very acceptable but wanted to get everyone's take as I have seen messages about this "starving" the request pool.  Note that I'm running in HSHA and am rarely seeing any reads waiting.

I appreciate your input!

Re: Batch get queries

Posted by aaron morton <aa...@thelastpickle.com>.
> This is very acceptable but wanted to get everyone's take as I have seen messages about this "starving" the request pool. 
The issue with sending large mutli gets or batch mutations is that it can reduce overall request throughput. Every row in your 10K multi becomes RF number of tasks that are placed into read thread pools. If these pools are full (which is more likely with smaller clusters) servicing one request they are not servicing requests from other clients. 

Additionally large requests are more likely to upset the delicate flower that is JVM GC. 

10K feels like a lot to me. I would run a test to see the overall throughput for a single thread, at 100, 200, 400, 800 etc rows per request. At some point the gains in overall throughput for that one client will drop off. 

Cheers
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/04/2013, at 5:05 AM, Keith Wright <kw...@nanigans.com> wrote:

> Hi all,
> 
>    I am using C* 1.2.4 and using CQL3 with Astyanax to consume large amount of user based data (around 50-100K / sec).  Requests come in based on user cookies which I then need to link to a user (as users can change their cookies).  This is done using a link table:
> 
> CREATE TABLE cookie_user_lookup (
> 	cookie TEXT PRIMARY KEY,
> 	user_id BIGINT,
>         creation_time TIMESTAMP
> ) with  compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and 
> compaction={'class':'LeveledCompactionStrategy'} and 
> gc_grace_seconds = 86400;
> 
> As I said, I am handling a large number of these per second and wanted to get your take on how best to do the lookup.  I find that there are 3 ways:
> 	• Serially fetch 1 by 1.  The latency is very low at 0.1 ms but multiplying that by thousands per second becomes substantial.  This is too slow
> 	• Serially fetch 1 by 1 but on separate threads.  This would require a very large number of concurrent connections (unless I change to datastax's binary protocol) as well as threads.  Seems heavy
> 	• Batch fetch.  This is what I'm doing now where I build a very large select * from cookie_user_lookup where cookie in (a,b,c,.. Etc).  I am actually doing around 10K of these at a time and getting a response time in my cluster of around 100 ms.  This is very acceptable but wanted to get everyone's take as I have seen messages about this "starving" the request pool.  Note that I'm running in HSHA and am rarely seeing any reads waiting.
> I appreciate your input!