You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shu Zhang <sz...@mediosystems.com> on 2011/01/18 01:53:58 UTC

please help with multiget

Here's the method declaration for quick reference:
map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)

It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys? I mean, it seems like to fully take advantage of cassandra's data structure, you often want to put dynamic data as column names, and different rows may have totally different column names. That's pretty standard practice right? Then it seems like you should be able to batch get-requests mapping different slicepredicates to different keys in an efficient way.

The only way I can think of to retrieve different columns for different keys (besides breaking them into individual requests) is to set the SlicePredicate so that you retrieve entire rows and then parse it on the client side... but that seems a little inefficient and a bit of a pain. Is that what people do? I can see this not being TOO much more inefficient since a single row is always kept together physically.

I haven't found a lot of other complaints about this so maybe I am missing something. But a get request takes a key and a column path, so it seems like a batch-get should allow you to specify any combination of key-columnPath or key-slicePredicate pairs. I mean, intuitive design-wise, for any batch operation, it makes sense to allow for batching together any number of corresponding non-batch operations. Ie. If I can make a non-batch get request for (key1, colNam1), and I can make a non-batch get request for (key2, colName2), then I should be able to make a batch request for (key1, colName1) and (key2, colName2).

Furthermore, a batch-get method signature like

map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, map<string, list<SlicePredicate>>> mutation_map, ConsistencyLevel consistency_level)

look a lot more symmetrical to the batch_mutate method
void batch_mutate(string keyspace, map<string,map<string,list<Mutation>>> mutation_map, ConsistencyLevel consistency_level)

Thoughts? 

Thanks,
Shu

Re: please help with multiget

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Jan 18, 2011 at 4:29 PM, Shu Zhang <sz...@mediosystems.com> wrote:
> Well, I don't think what I'm describing is complicated semantics. I think I've described general batch operation design and something that is symmetrical the batch_mutate method already on the Cassandra API. You are right, I can solve the problem with further denormalization, and the approach of making individual gets in parallel as described by Brandon will work too. I'll be doing one of these for now. But I think neither is as efficient, and I guess I'm still not sure why the multiget is designed the way it is.
>
> The problem with denormalization is you gotta make multiple row writes in place of one, adding load to the server, adding required physical space and losing atomicity on write operations. I know writes are cheap in cassandra, and you can catch failed writes and retry so these problems are not major, but it still seems clear that having a batch-get that works appropriately is a least a little better...
> ________________________________________
> From: Aaron Morton [aaron@thelastpickle.com]
> Sent: Tuesday, January 18, 2011 12:55 PM
> To: user@cassandra.apache.org
> Subject: Re: please help with multiget
>
> I think the general approach is to denormalise data to remove the need for complicated semantics when reading.
>
> Aaron
>
> On 19/01/2011, at 7:57 AM, Shu Zhang <sz...@mediosystems.com> wrote:
>
>> Well, maybe making a batch-get is not  anymore efficient on the server side but without it, you can get bottlenecked on client-server connections and client resources. If the number of requests you want to batch is on the order of connections in your pool, then yes, making gets in parallel is as good or maybe better. But what if you want to batch thousands of requests?
>>
>> The server I can scale out, I would want to get my requests there without needing to wait for connections on my client to free up.
>>
>> I just don't really understand the reasoning for designing muliget_slice the way it is. I still think if you're gonna have a batch-get request (multiget_slice), you should be able to add to the batch a reasonable number of ANY corresponding non-batch get requests. And you can't do that... Plus, it's not symmetrical to the batch-mutate. Is there a good reason for that?
>> ________________________________________
>> From: Brandon Williams [driftx@gmail.com]
>> Sent: Monday, January 17, 2011 5:09 PM
>> To: user@cassandra.apache.org
>> Cc: hector-users@googlegroups.com
>> Subject: Re: please help with multiget
>>
>> On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang <sz...@mediosystems.com>> wrote:
>> Here's the method declaration for quick reference:
>> map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
>>
>> It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys?
>>
>> Issue multiple gets in parallel yourself.  Keep in mind that multiget is not an optimization, in fact, it can work against you when one key exceeds the rpc timeout, because you get nothing back.
>>
>> -Brandon
>

muliget_slice is very useful I IMHO. In my testing, the roundtrip time
for 1000 get requests all being acked individually is much higher then
rountrip time for 200 get_slice grouped 5 at a time. For anyone that
needs that type of access they are in good shape.

I was also theorizing that a CF using RowCache with very, very high
read rate would benefit from "pooling" a bunch of reads together with
multiget.

I do agree that the first time I looked at the multi_get_slice
signature I realized I could do many of the things I was expecting
from a multi-get.

RE: please help with multiget

Posted by Shu Zhang <sz...@mediosystems.com>.

Well, I don't think what I'm describing is complicated semantics. I think I've described general batch operation design and something that is symmetrical the batch_mutate method already on the Cassandra API. You are right, I can solve the problem with further denormalization, and the approach of making individual gets in parallel as described by Brandon will work too. I'll be doing one of these for now. But I think neither is as efficient, and I guess I'm still not sure why the multiget is designed the way it is.

The problem with denormalization is you gotta make multiple row writes in place of one, adding load to the server, adding required physical space and losing atomicity on write operations. I know writes are cheap in cassandra, and you can catch failed writes and retry so these problems are not major, but it still seems clear that having a batch-get that works appropriately is a least a little better... 
________________________________________
From: Aaron Morton [aaron@thelastpickle.com]
Sent: Tuesday, January 18, 2011 12:55 PM
To: user@cassandra.apache.org
Subject: Re: please help with multiget

I think the general approach is to denormalise data to remove the need for complicated semantics when reading.

Aaron

On 19/01/2011, at 7:57 AM, Shu Zhang <sz...@mediosystems.com> wrote:

> Well, maybe making a batch-get is not  anymore efficient on the server side but without it, you can get bottlenecked on client-server connections and client resources. If the number of requests you want to batch is on the order of connections in your pool, then yes, making gets in parallel is as good or maybe better. But what if you want to batch thousands of requests?
>
> The server I can scale out, I would want to get my requests there without needing to wait for connections on my client to free up.
>
> I just don't really understand the reasoning for designing muliget_slice the way it is. I still think if you're gonna have a batch-get request (multiget_slice), you should be able to add to the batch a reasonable number of ANY corresponding non-batch get requests. And you can't do that... Plus, it's not symmetrical to the batch-mutate. Is there a good reason for that?
> ________________________________________
> From: Brandon Williams [driftx@gmail.com]
> Sent: Monday, January 17, 2011 5:09 PM
> To: user@cassandra.apache.org
> Cc: hector-users@googlegroups.com
> Subject: Re: please help with multiget
>
> On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang <sz...@mediosystems.com>> wrote:
> Here's the method declaration for quick reference:
> map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
>
> It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys?
>
> Issue multiple gets in parallel yourself.  Keep in mind that multiget is not an optimization, in fact, it can work against you when one key exceeds the rpc timeout, because you get nothing back.
>
> -Brandon

Re: please help with multiget

Posted by Aaron Morton <aa...@thelastpickle.com>.

I think the general approach is to denormalise data to remove the need for complicated semantics when reading. 

Aaron

On 19/01/2011, at 7:57 AM, Shu Zhang <sz...@mediosystems.com> wrote:

> Well, maybe making a batch-get is not  anymore efficient on the server side but without it, you can get bottlenecked on client-server connections and client resources. If the number of requests you want to batch is on the order of connections in your pool, then yes, making gets in parallel is as good or maybe better. But what if you want to batch thousands of requests?
> 
> The server I can scale out, I would want to get my requests there without needing to wait for connections on my client to free up.
> 
> I just don't really understand the reasoning for designing muliget_slice the way it is. I still think if you're gonna have a batch-get request (multiget_slice), you should be able to add to the batch a reasonable number of ANY corresponding non-batch get requests. And you can't do that... Plus, it's not symmetrical to the batch-mutate. Is there a good reason for that?
> ________________________________________
> From: Brandon Williams [driftx@gmail.com]
> Sent: Monday, January 17, 2011 5:09 PM
> To: user@cassandra.apache.org
> Cc: hector-users@googlegroups.com
> Subject: Re: please help with multiget
> 
> On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang <sz...@mediosystems.com>> wrote:
> Here's the method declaration for quick reference:
> map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
> 
> It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys?
> 
> Issue multiple gets in parallel yourself.  Keep in mind that multiget is not an optimization, in fact, it can work against you when one key exceeds the rpc timeout, because you get nothing back.
> 
> -Brandon

RE: please help with multiget

Posted by Shu Zhang <sz...@mediosystems.com>.

Well, maybe making a batch-get is not  anymore efficient on the server side but without it, you can get bottlenecked on client-server connections and client resources. If the number of requests you want to batch is on the order of connections in your pool, then yes, making gets in parallel is as good or maybe better. But what if you want to batch thousands of requests?

The server I can scale out, I would want to get my requests there without needing to wait for connections on my client to free up.

I just don't really understand the reasoning for designing muliget_slice the way it is. I still think if you're gonna have a batch-get request (multiget_slice), you should be able to add to the batch a reasonable number of ANY corresponding non-batch get requests. And you can't do that... Plus, it's not symmetrical to the batch-mutate. Is there a good reason for that?
________________________________________
From: Brandon Williams [driftx@gmail.com]
Sent: Monday, January 17, 2011 5:09 PM
To: user@cassandra.apache.org
Cc: hector-users@googlegroups.com
Subject: Re: please help with multiget

On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang <sz...@mediosystems.com>> wrote:
Here's the method declaration for quick reference:
map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)

It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys?

Issue multiple gets in parallel yourself.  Keep in mind that multiget is not an optimization, in fact, it can work against you when one key exceeds the rpc timeout, because you get nothing back.

-Brandon

Re: please help with multiget

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang <sz...@mediosystems.com> wrote:

> Here's the method declaration for quick reference:
> map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace,
> list<string> keys, ColumnParent column_parent, SlicePredicate predicate,
> ConsistencyLevel consistency_level)
>
> It looks like you must have the same SlicePredicate for every key in your
> batch retrieval, so what are you suppose to do when you need to retrieve
> different columns for different keys?

Issue multiple gets in parallel yourself.  Keep in mind that multiget is not
an optimization, in fact, it can work against you when one key exceeds the
rpc timeout, because you get nothing back.

-Brandon

Re: please help with multiget

Posted by Aaron Morton <aa...@thelastpickle.com>.

If you can provide some more information on a specific use case we may be able to help with the modelling. 

The general approach is to denormalise the data to the point where each request/activity/feature in your application results in a call to get data from one or more rows in one CF. It's not always possible, it's just the goal I use when modelling. I also lean towards making fewer calls that return more data, rather than more calls that return the exact amount of data. IMHO additional filtering and ordering on the client side will reduce server load at scale. 

You may be able to use a multiget for a super_column for multiple rows, which will return the super columns and (their potentially different) list of columns. Or if the rows have only a few standard columns pull back all columns for the rows. 

Hope that helps.
Aaron


On 18 Jan, 2011,at 01:53 PM, Shu Zhang <sz...@mediosystems.com> wrote:

Here's the method declaration for quick reference:
map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, list<string> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)

It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys? I mean, it seems like to fully take advantage of cassandra's data structure, you often want to put dynamic data as column names, and different rows may have totally different column names. That's pretty standard practice right? Then it seems like you should be able to batch get-requests mapping different slicepredicates to different keys in an efficient way.

The only way I can think of to retrieve different columns for different keys (besides breaking them into individual requests) is to set the SlicePredicate so that you retrieve entire rows and then parse it on the client side... but that seems a little inefficient and a bit of a pain. Is that what people do? I can see this not being TOO much more inefficient since a single row is always kept together physically.

I haven't found a lot of other complaints about this so maybe I am missing something. But a get request takes a key and a column path, so it seems like a batch-get should allow you to specify any combination of key-columnPath or key-slicePredicate pairs. I mean, intuitive design-wise, for any batch operation, it makes sense to allow for batching together any number of corresponding non-batch operations. Ie. If I can make a non-batch get request for (key1, colNam1), and I can make a non-batch get request for (key2, colName2), then I should be able to make a batch request for (key1, colName1) and (key2, colName2).

Furthermore, a batch-get method signature like

map<string,list<ColumnOrSuperColumn>> multiget_slice(string keyspace, map<string, list<SlicePredicate>>> mutation_map, ConsistencyLevel consistency_level)

look a lot more symmetrical to the batch_mutate method
void batch_mutate(string keyspace, map<string,map<string,list<Mutation>>> mutation_map, ConsistencyLevel consistency_level)

Thoughts? 

Thanks,
Shu