You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br> on 2014/06/20 02:56:25 UTC

Best way to do a multi_get using CQL

I was taking a look at Cassandra anti-patterns list:

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html

Among then is

SELECT ... IN or index lookups¶
<http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html?scroll=archPlanAntiPattern__AntiPatMultiGet>

SELECT ... IN and index lookups (formerly secondary indexes) should be
avoided except for specific scenarios. See *When not to use IN* in SELECT
<http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html>
 and *When not to use an index* in Indexing
<http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html>
 in
*CQL for Cassandra 2.0*"

And Looking at the SELECT doc, I saw:
When *not* to use IN¶
<http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html?scroll=reference_ds_d35_v2q_xj__selectInNot>
The recommendations about when not to use an index
<http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html>
 apply to using IN in the WHERE clause. Under most conditions, using IN in
the WHERE clause is not recommended. Using IN can degrade performance
because usually many nodes must be queried. For example, in a single, local
data center cluster having 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes being
queried are most likely even higher, up to 20 nodes depending on where the
keys fall in the token range."

In my system, I have a column family called "entity_lookup":

CREATE KEYSPACE IF NOT EXISTS Identification1
  WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
  'DC1' : 3 };
USE Identification1;

CREATE TABLE IF NOT EXISTS entity_lookup (
  name varchar,
  value varchar,
  entity_id uuid,
  PRIMARY KEY ((name, value), entity_id));

And I use the following select to query it:

SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)

Is this an anti-pattern?

If not using SELECT IN, which other way would you recomend for lookups like
that? I have several values I would like to search in cassandra and they
might not be in the same particion, as above.

Is Cassandra the wrong tool for lookups like that?

Best regards,
Marcelo Valle.

Re: Best way to do a multi_get using CQL

Posted by DuyHai Doan <do...@gmail.com>.

"The bad design part (just my opinion, no intention to offend) is not allow
the possibility of sending batches directly to the data nodes, without
using a coordinator."

 Well it's normal that it's not possible.

 What is a batch ? It's a bunch of insert/update/delete statements put
together. Now each individual statement can operate on a different
partition key. In the end, sending all of those statements to a single node
will result ultimately by a subsequent dispatch of each statement to the
right replicas, depending on which range of tokens each partition key is
hashed to.

 The only case I can see where all the statements in a batch are
effectively executed by a single node is when they share the same partition
key.




On Fri, Jun 20, 2014 at 9:41 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> I forgot to add that each connection can handle multiple simultaneous
> queries.  This was part of the original protocol as of C* 1.2:
> http://www.datastax.com/dev/blog/binary-protocol
>
> Asynchronous: each connection can handle more than one active request
> at the same time. In practice, this means that client libraries will
> only need to maintain a relatively low amount of open connections to a
> given Cassandra node to achieve good performance. This particularly
> matters with Cassandra where a client usually wants to keep connection
> to all (or at least a good part of) the nodes of the Cluster and so
> having a low number of per-node connections helps scaling to large
> clusters.
> Technically, this is achieved by giving each messages a stream ID, and
> by having responses to a request preserve the request’s stream ID.
> Clients can thus send multiple requests with different stream IDs on
> the same connection (i.e. without waiting for the response to a
> request to send the next one) while still being able to associate each
> received response to the right request, even if said responses comes
> in a different order than the one in which requests were submitted.
> That asynchronicity is of course optional in the sense that a client
> library can still choose to use the protocol in a synchronous way if
> that is simpler.
>
> On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma <je...@barchart.com>
> wrote:
> > There is nothing preventing that in Cassandra, it's just a matter of how
> > intelligent the driver API is. Submit a feature request to Astyanax or
> > Datastax driver projects.
> >
> >
> > On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle
> > <ma...@s1mbi0se.com.br> wrote:
> >>
> >> The bad design part (just my opinion, no intention to offend) is not
> allow
> >> the possibility of sending batches directly to the data nodes, without
> using
> >> a coordinator.
> >> I would choose that option.
> >> []s
> >>
> >>
> >> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <do...@gmail.com>:
> >>>
> >>> Well it's kind of a trade-off.
> >>>
> >>>  Either you send data directly to the primary replica nodes to take
> >>> advantage of data-locality using token-aware strategy and the price to
> pay
> >>> is a high number of opened connections from client side.
> >>>
> >>> Or you just batch data to a random node playing the coordinator role to
> >>> dispatch requests to the right nodes. The price to pay is then spike
> load on
> >>> 1 node (the coordinator) and intra-cluster bandwdith usage.
> >>>
> >>>  The choice is yours, it has nothing to do with good or bad design.
> >>>
> >>>
> >>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle
> >>> <ma...@s1mbi0se.com.br> wrote:
> >>>>
> >>>> I am using python + CQL Driver.
> >>>> I wonder how they do...
> >>>> These things seems little important, but they are fundamental to get a
> >>>> good performance in Cassandra...
> >>>> I wish there was a simpler way to query in batches. Opening a large
> >>>> amount of connections and sending 1 message at a time seems bad to
> me, as
> >>>> sometimes you want to work with small rows.
> >>>> It's no surprise Cassandra performs better when we use average row
> >>>> sizes. But honestly I disagree with this part of Cassandra/Driver's
> design.
> >>>> []s
> >>>>
> >>>>
> >>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
> >>>>
> >>>>> That depends on the connection pooling implementation in your driver.
> >>>>> Astyanax will keep N connections open to each node (configurable)
> and route
> >>>>> each query in a separate message over an existing connection,
> waiting until
> >>>>> one becomes available if all are in use.
> >>>>>
> >>>>>
> >>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle
> >>>>> <ma...@s1mbi0se.com.br> wrote:
> >>>>>>
> >>>>>> A question, not sure if you guys know the answer:
> >>>>>> Supose I async query 1000 rows using token aware and suppose I have
> 10
> >>>>>> nodes. Suppose also each node would receive 100 row queries each.
> >>>>>> How does async work in this case? Would it send each row query to
> each
> >>>>>> node in a different connection? Different message?
> >>>>>> I guess if there was a way to use batch with async, once you commit
> >>>>>> the batch for the 1000 queries, it would create 1 connection to
> each host
> >>>>>> and query 100 rows in a single message to each host.
> >>>>>> This would decrease resource usage, am I wrong?
> >>>>>>
> >>>>>> []s
> >>>>>>
> >>>>>>
> >>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
> >>>>>>
> >>>>>>> I've found that if you have any amount of latency between your
> client
> >>>>>>> and nodes, and you are executing a large batch of queries, you'll
> usually
> >>>>>>> want to send them together to one node unless execution time is of
> no
> >>>>>>> concern. The tradeoff is resource usage on the connected node vs.
> time to
> >>>>>>> complete all the queries, because you'll need fewer client -> node
> network
> >>>>>>> round trips.
> >>>>>>>
> >>>>>>> With large numbers of queries you will still want to make sure you
> >>>>>>> split them into manageable batches before sending them, to control
> memory
> >>>>>>> usage on the executing node. I've been limiting queries to batches
> of 100
> >>>>>>> keys in scenarios like this.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael
> >>>>>>> <mi...@nytimes.com> wrote:
> >>>>>>>>
> >>>>>>>> However my extensive benchmarking this week of the python driver
> >>>>>>>> from master shows a performance decrease when using 'token_aware'.
> >>>>>>>>
> >>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
> >>>>>>>>
> >>>>>>>> Also why do the work the coordinator will do for you: send all the
> >>>>>>>> queries, wait for everything to come back in whatever order, and
> sort the
> >>>>>>>> result.
> >>>>>>>>
> >>>>>>>> I would rather keep my app code simple.
> >>>>>>>>
> >>>>>>>> But the real point is that you should benchmark in your own
> >>>>>>>> environment.
> >>>>>>>>
> >>>>>>>> ml
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle
> >>>>>>>> <ma...@s1mbi0se.com.br> wrote:
> >>>>>>>>>
> >>>>>>>>> Yes, I am using the CQL datastax drivers.
> >>>>>>>>> It was a good advice, thanks a lot Janathan.
> >>>>>>>>> []s
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
> >>>>>>>>>
> >>>>>>>>>> The only case in which it might be better to use an IN clause is
> >>>>>>>>>> if
> >>>>>>>>>> the entire query can be satisfied from that machine.  Otherwise,
> >>>>>>>>>> go
> >>>>>>>>>> async.
> >>>>>>>>>>
> >>>>>>>>>> The native driver reuses connections and intelligently manages
> the
> >>>>>>>>>> pool for you.  It can also multiplex queries over a single
> >>>>>>>>>> connection.
> >>>>>>>>>>
> >>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL,
> >>>>>>>>>> btw.
> >>>>>>>>>>
> >>>>>>>>>> Jon
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
> >>>>>>>>>> <ma...@s1mbi0se.com.br> wrote:
> >>>>>>>>>> > This is interesting, I didn't know that!
> >>>>>>>>>> > It might make sense then to use select = + async + token
> aware,
> >>>>>>>>>> > I will try
> >>>>>>>>>> > to change my code.
> >>>>>>>>>> >
> >>>>>>>>>> > But would it be a "recomended solution" for these cases? Any
> >>>>>>>>>> > other options?
> >>>>>>>>>> >
> >>>>>>>>>> > I still would if this is the right use case for Cassandra, to
> >>>>>>>>>> > look for
> >>>>>>>>>> > random keys in a huge cluster. After all, the amount of
> >>>>>>>>>> > connections to
> >>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a
> problem?
> >>>>>>>>>> > Or when you use async the driver reuses the connection?
> >>>>>>>>>> >
> >>>>>>>>>> > []s
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com
> >:
> >>>>>>>>>> >
> >>>>>>>>>> >> If you use async and your driver is token aware, it will go
> to
> >>>>>>>>>> >> the
> >>>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
> >>>>>>>>>> >>
> >>>>>>>>>> >> Realistically you're going to have a connection open to every
> >>>>>>>>>> >> server
> >>>>>>>>>> >> anyways.  It's the difference between you querying for the
> data
> >>>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
> >>>>>>>>>> >> just ask
> >>>>>>>>>> >> the node with the data.
> >>>>>>>>>> >>
> >>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
> >>>>>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
> >>>>>>>>>> >> > But using async queries wouldn't be even worse than using
> >>>>>>>>>> >> > SELECT IN?
> >>>>>>>>>> >> > The justification in the docs is I could query many nodes,
> >>>>>>>>>> >> > but I would
> >>>>>>>>>> >> > still
> >>>>>>>>>> >> > do it.
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
> >>>>>>>>>> >> > ENTITY_LOOKUP + "
> >>>>>>>>>> >> > WHERE
> >>>>>>>>>> >> > name=%s and value in(%s)"
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > for name, values in identifiers.items():
> >>>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
> >>>>>>>>>> >> > ','.join(['%s']*len(values)))
> >>>>>>>>>> >> >    args = [name] + values
> >>>>>>>>>> >> >    query_msg = query % tuple(args)
> >>>>>>>>>> >> >    futures.append((query_msg,
> >>>>>>>>>> >> > self.session.execute_async(query, args)))
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > for query_msg, future in futures:
> >>>>>>>>>> >> >    try:
> >>>>>>>>>> >> >       rows = future.result(timeout=100000)
> >>>>>>>>>> >> >       for row in rows:
> >>>>>>>>>> >> >         entity_ids.add(row.entity_id)
> >>>>>>>>>> >> >    except:
> >>>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " %
> >>>>>>>>>> >> > (query_msg))
> >>>>>>>>>> >> >       raise
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > Using async just with select = would mean instead of 1
> async
> >>>>>>>>>> >> > query
> >>>>>>>>>> >> > (example:
> >>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
> >>>>>>>>>> >> > "values" array
> >>>>>>>>>> >> > above.
> >>>>>>>>>> >> > In my head, this would mean more connections to Cassandra
> and
> >>>>>>>>>> >> > the same
> >>>>>>>>>> >> > amount of work, right? What would be the advantage?
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > []s
> >>>>>>>>>> >> >
> >>>>>>>>>> >> >
> >>>>>>>>>> >> >
> >>>>>>>>>> >> >
> >>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad
> >>>>>>>>>> >> > <jo...@jonhaddad.com>:
> >>>>>>>>>> >> >
> >>>>>>>>>> >> >> Your other option is to fire off async queries.  It's
> pretty
> >>>>>>>>>> >> >> straightforward w/ the java or python drivers.
> >>>>>>>>>> >> >>
> >>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
> >>>>>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
> >>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > Among then is
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > SELECT ... IN or index lookups¶
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
> >>>>>>>>>> >> >> > indexes) should
> >>>>>>>>>> >> >> > be
> >>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to
> use
> >>>>>>>>>> >> >> > IN in
> >>>>>>>>>> >> >> > SELECT
> >>>>>>>>>> >> >> > and
> >>>>>>>>>> >> >> > When not to use an index in Indexing in
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > CQL for Cassandra 2.0"
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > When not to use IN¶
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > The recommendations about when not to use an index apply
> >>>>>>>>>> >> >> > to using IN
> >>>>>>>>>> >> >> > in
> >>>>>>>>>> >> >> > the
> >>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the
> WHERE
> >>>>>>>>>> >> >> > clause is
> >>>>>>>>>> >> >> > not
> >>>>>>>>>> >> >> > recommended. Using IN can degrade performance because
> >>>>>>>>>> >> >> > usually many
> >>>>>>>>>> >> >> > nodes
> >>>>>>>>>> >> >> > must be queried. For example, in a single, local data
> >>>>>>>>>> >> >> > center cluster
> >>>>>>>>>> >> >> > having
> >>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
> >>>>>>>>>> >> >> > level of
> >>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes,
> >>>>>>>>>> >> >> > but if the
> >>>>>>>>>> >> >> > query
> >>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
> >>>>>>>>>> >> >> > are most
> >>>>>>>>>> >> >> > likely
> >>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
> >>>>>>>>>> >> >> > fall in the
> >>>>>>>>>> >> >> > token
> >>>>>>>>>> >> >> > range."
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > In my system, I have a column family called
> >>>>>>>>>> >> >> > "entity_lookup":
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
> >>>>>>>>>> >> >> >   WITH REPLICATION = { 'class' :
> >>>>>>>>>> >> >> > 'NetworkTopologyStrategy',
> >>>>>>>>>> >> >> >   'DC1' : 3 };
> >>>>>>>>>> >> >> > USE Identification1;
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
> >>>>>>>>>> >> >> >   name varchar,
> >>>>>>>>>> >> >> >   value varchar,
> >>>>>>>>>> >> >> >   entity_id uuid,
> >>>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > And I use the following select to query it:
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and
> >>>>>>>>>> >> >> > value in(%s)
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > Is this an anti-pattern?
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > If not using SELECT IN, which other way would you
> recomend
> >>>>>>>>>> >> >> > for
> >>>>>>>>>> >> >> > lookups
> >>>>>>>>>> >> >> > like
> >>>>>>>>>> >> >> > that? I have several values I would like to search in
> >>>>>>>>>> >> >> > cassandra and
> >>>>>>>>>> >> >> > they
> >>>>>>>>>> >> >> > might not be in the same particion, as above.
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> > Best regards,
> >>>>>>>>>> >> >> > Marcelo Valle.
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >> >
> >>>>>>>>>> >> >>
> >>>>>>>>>> >> >>
> >>>>>>>>>> >> >>
> >>>>>>>>>> >> >> --
> >>>>>>>>>> >> >> Jon Haddad
> >>>>>>>>>> >> >> http://www.rustyrazorblade.com
> >>>>>>>>>> >> >> skype: rustyrazorblade
> >>>>>>>>>> >> >
> >>>>>>>>>> >> >
> >>>>>>>>>> >>
> >>>>>>>>>> >>
> >>>>>>>>>> >>
> >>>>>>>>>> >> --
> >>>>>>>>>> >> Jon Haddad
> >>>>>>>>>> >> http://www.rustyrazorblade.com
> >>>>>>>>>> >> skype: rustyrazorblade
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jon Haddad
> >>>>>>>>>> http://www.rustyrazorblade.com
> >>>>>>>>>> skype: rustyrazorblade
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>

Re: Best way to do a multi_get using CQL

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I forgot to add that each connection can handle multiple simultaneous
queries.  This was part of the original protocol as of C* 1.2:
http://www.datastax.com/dev/blog/binary-protocol

Asynchronous: each connection can handle more than one active request
at the same time. In practice, this means that client libraries will
only need to maintain a relatively low amount of open connections to a
given Cassandra node to achieve good performance. This particularly
matters with Cassandra where a client usually wants to keep connection
to all (or at least a good part of) the nodes of the Cluster and so
having a low number of per-node connections helps scaling to large
clusters.
Technically, this is achieved by giving each messages a stream ID, and
by having responses to a request preserve the request’s stream ID.
Clients can thus send multiple requests with different stream IDs on
the same connection (i.e. without waiting for the response to a
request to send the next one) while still being able to associate each
received response to the right request, even if said responses comes
in a different order than the one in which requests were submitted.
That asynchronicity is of course optional in the sense that a client
library can still choose to use the protocol in a synchronous way if
that is simpler.

On Fri, Jun 20, 2014 at 12:30 PM, Jeremy Jongsma <je...@barchart.com> wrote:
> There is nothing preventing that in Cassandra, it's just a matter of how
> intelligent the driver API is. Submit a feature request to Astyanax or
> Datastax driver projects.
>
>
> On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle
> <ma...@s1mbi0se.com.br> wrote:
>>
>> The bad design part (just my opinion, no intention to offend) is not allow
>> the possibility of sending batches directly to the data nodes, without using
>> a coordinator.
>> I would choose that option.
>> []s
>>
>>
>> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <do...@gmail.com>:
>>>
>>> Well it's kind of a trade-off.
>>>
>>>  Either you send data directly to the primary replica nodes to take
>>> advantage of data-locality using token-aware strategy and the price to pay
>>> is a high number of opened connections from client side.
>>>
>>> Or you just batch data to a random node playing the coordinator role to
>>> dispatch requests to the right nodes. The price to pay is then spike load on
>>> 1 node (the coordinator) and intra-cluster bandwdith usage.
>>>
>>>  The choice is yours, it has nothing to do with good or bad design.
>>>
>>>
>>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle
>>> <ma...@s1mbi0se.com.br> wrote:
>>>>
>>>> I am using python + CQL Driver.
>>>> I wonder how they do...
>>>> These things seems little important, but they are fundamental to get a
>>>> good performance in Cassandra...
>>>> I wish there was a simpler way to query in batches. Opening a large
>>>> amount of connections and sending 1 message at a time seems bad to me, as
>>>> sometimes you want to work with small rows.
>>>> It's no surprise Cassandra performs better when we use average row
>>>> sizes. But honestly I disagree with this part of Cassandra/Driver's design.
>>>> []s
>>>>
>>>>
>>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>>
>>>>> That depends on the connection pooling implementation in your driver.
>>>>> Astyanax will keep N connections open to each node (configurable) and route
>>>>> each query in a separate message over an existing connection, waiting until
>>>>> one becomes available if all are in use.
>>>>>
>>>>>
>>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle
>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>> A question, not sure if you guys know the answer:
>>>>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>>>>> nodes. Suppose also each node would receive 100 row queries each.
>>>>>> How does async work in this case? Would it send each row query to each
>>>>>> node in a different connection? Different message?
>>>>>> I guess if there was a way to use batch with async, once you commit
>>>>>> the batch for the 1000 queries, it would create 1 connection to each host
>>>>>> and query 100 rows in a single message to each host.
>>>>>> This would decrease resource usage, am I wrong?
>>>>>>
>>>>>> []s
>>>>>>
>>>>>>
>>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>>>>
>>>>>>> I've found that if you have any amount of latency between your client
>>>>>>> and nodes, and you are executing a large batch of queries, you'll usually
>>>>>>> want to send them together to one node unless execution time is of no
>>>>>>> concern. The tradeoff is resource usage on the connected node vs. time to
>>>>>>> complete all the queries, because you'll need fewer client -> node network
>>>>>>> round trips.
>>>>>>>
>>>>>>> With large numbers of queries you will still want to make sure you
>>>>>>> split them into manageable batches before sending them, to control memory
>>>>>>> usage on the executing node. I've been limiting queries to batches of 100
>>>>>>> keys in scenarios like this.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael
>>>>>>> <mi...@nytimes.com> wrote:
>>>>>>>>
>>>>>>>> However my extensive benchmarking this week of the python driver
>>>>>>>> from master shows a performance decrease when using 'token_aware'.
>>>>>>>>
>>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>>>>
>>>>>>>> Also why do the work the coordinator will do for you: send all the
>>>>>>>> queries, wait for everything to come back in whatever order, and sort the
>>>>>>>> result.
>>>>>>>>
>>>>>>>> I would rather keep my app code simple.
>>>>>>>>
>>>>>>>> But the real point is that you should benchmark in your own
>>>>>>>> environment.
>>>>>>>>
>>>>>>>> ml
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle
>>>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>>
>>>>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>>>>> []s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>>>
>>>>>>>>>> The only case in which it might be better to use an IN clause is
>>>>>>>>>> if
>>>>>>>>>> the entire query can be satisfied from that machine.  Otherwise,
>>>>>>>>>> go
>>>>>>>>>> async.
>>>>>>>>>>
>>>>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>>>>> connection.
>>>>>>>>>>
>>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL,
>>>>>>>>>> btw.
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>>> > This is interesting, I didn't know that!
>>>>>>>>>> > It might make sense then to use select = + async + token aware,
>>>>>>>>>> > I will try
>>>>>>>>>> > to change my code.
>>>>>>>>>> >
>>>>>>>>>> > But would it be a "recomended solution" for these cases? Any
>>>>>>>>>> > other options?
>>>>>>>>>> >
>>>>>>>>>> > I still would if this is the right use case for Cassandra, to
>>>>>>>>>> > look for
>>>>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>>>>> > connections to
>>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>>>>> >
>>>>>>>>>> > []s
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>>>> >
>>>>>>>>>> >> If you use async and your driver is token aware, it will go to
>>>>>>>>>> >> the
>>>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>>>>> >>
>>>>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>>>>> >> server
>>>>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
>>>>>>>>>> >> just ask
>>>>>>>>>> >> the node with the data.
>>>>>>>>>> >>
>>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>>>>> >> > SELECT IN?
>>>>>>>>>> >> > The justification in the docs is I could query many nodes,
>>>>>>>>>> >> > but I would
>>>>>>>>>> >> > still
>>>>>>>>>> >> > do it.
>>>>>>>>>> >> >
>>>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>>>>> >> >
>>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
>>>>>>>>>> >> > ENTITY_LOOKUP + "
>>>>>>>>>> >> > WHERE
>>>>>>>>>> >> > name=%s and value in(%s)"
>>>>>>>>>> >> >
>>>>>>>>>> >> > for name, values in identifiers.items():
>>>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>>>>> >> >    args = [name] + values
>>>>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>>>>> >> >    futures.append((query_msg,
>>>>>>>>>> >> > self.session.execute_async(query, args)))
>>>>>>>>>> >> >
>>>>>>>>>> >> > for query_msg, future in futures:
>>>>>>>>>> >> >    try:
>>>>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>>>>> >> >       for row in rows:
>>>>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>>>>> >> >    except:
>>>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " %
>>>>>>>>>> >> > (query_msg))
>>>>>>>>>> >> >       raise
>>>>>>>>>> >> >
>>>>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>>>>> >> > query
>>>>>>>>>> >> > (example:
>>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>>>>> >> > "values" array
>>>>>>>>>> >> > above.
>>>>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>>>>> >> > the same
>>>>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>>>>> >> >
>>>>>>>>>> >> > []s
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad
>>>>>>>>>> >> > <jo...@jonhaddad.com>:
>>>>>>>>>> >> >
>>>>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Among then is
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
>>>>>>>>>> >> >> > indexes) should
>>>>>>>>>> >> >> > be
>>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>>>>> >> >> > IN in
>>>>>>>>>> >> >> > SELECT
>>>>>>>>>> >> >> > and
>>>>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > When not to use IN¶
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > The recommendations about when not to use an index apply
>>>>>>>>>> >> >> > to using IN
>>>>>>>>>> >> >> > in
>>>>>>>>>> >> >> > the
>>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>>>>> >> >> > clause is
>>>>>>>>>> >> >> > not
>>>>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>>>>> >> >> > usually many
>>>>>>>>>> >> >> > nodes
>>>>>>>>>> >> >> > must be queried. For example, in a single, local data
>>>>>>>>>> >> >> > center cluster
>>>>>>>>>> >> >> > having
>>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
>>>>>>>>>> >> >> > level of
>>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes,
>>>>>>>>>> >> >> > but if the
>>>>>>>>>> >> >> > query
>>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
>>>>>>>>>> >> >> > are most
>>>>>>>>>> >> >> > likely
>>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
>>>>>>>>>> >> >> > fall in the
>>>>>>>>>> >> >> > token
>>>>>>>>>> >> >> > range."
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > In my system, I have a column family called
>>>>>>>>>> >> >> > "entity_lookup":
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>>>>> >> >> >   WITH REPLICATION = { 'class' :
>>>>>>>>>> >> >> > 'NetworkTopologyStrategy',
>>>>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>>>>> >> >> > USE Identification1;
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>>> >> >> >   name varchar,
>>>>>>>>>> >> >> >   value varchar,
>>>>>>>>>> >> >> >   entity_id uuid,
>>>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > And I use the following select to query it:
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and
>>>>>>>>>> >> >> > value in(%s)
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>>>>> >> >> > for
>>>>>>>>>> >> >> > lookups
>>>>>>>>>> >> >> > like
>>>>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>>>>> >> >> > cassandra and
>>>>>>>>>> >> >> > they
>>>>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> > Best regards,
>>>>>>>>>> >> >> > Marcelo Valle.
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >> >
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> --
>>>>>>>>>> >> >> Jon Haddad
>>>>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>>>>> >> >> skype: rustyrazorblade
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> --
>>>>>>>>>> >> Jon Haddad
>>>>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>>>>> >> skype: rustyrazorblade
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jon Haddad
>>>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>>>> skype: rustyrazorblade
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Best way to do a multi_get using CQL

Posted by Jeremy Jongsma <je...@barchart.com>.

There is nothing preventing that in Cassandra, it's just a matter of how
intelligent the driver API is. Submit a feature request to Astyanax or
Datastax driver projects.


On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle <
marcelo@s1mbi0se.com.br> wrote:

> The bad design part (just my opinion, no intention to offend) is not allow
> the possibility of sending batches directly to the data nodes, without
> using a coordinator.
> I would choose that option.
> []s
>
>
> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <do...@gmail.com>:
>
>> Well it's kind of a trade-off.
>>
>>  Either you send data directly to the primary replica nodes to take
>> advantage of data-locality using token-aware strategy and the price to pay
>> is a high number of opened connections from client side.
>>
>> Or you just batch data to a random node playing the coordinator role to
>> dispatch requests to the right nodes. The price to pay is then spike load
>> on 1 node (the coordinator) and intra-cluster bandwdith usage.
>>
>>  The choice is yours, it has nothing to do with good or bad design.
>>
>>
>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle <
>> marcelo@s1mbi0se.com.br> wrote:
>>
>>> I am using python + CQL Driver.
>>> I wonder how they do...
>>> These things seems little important, but they are fundamental to get a
>>> good performance in Cassandra...
>>> I wish there was a simpler way to query in batches. Opening a large
>>> amount of connections and sending 1 message at a time seems bad to me, as
>>> sometimes you want to work with small rows.
>>> It's no surprise Cassandra performs better when we use average row
>>> sizes. But honestly I disagree with this part of Cassandra/Driver's design.
>>> []s
>>>
>>>
>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>
>>> That depends on the connection pooling implementation in your driver.
>>>> Astyanax will keep N connections open to each node (configurable) and route
>>>> each query in a separate message over an existing connection, waiting until
>>>> one becomes available if all are in use.
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>
>>>>> A question, not sure if you guys know the answer:
>>>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>>>> nodes. Suppose also each node would receive 100 row queries each.
>>>>> How does async work in this case? Would it send each row query to each
>>>>> node in a different connection? Different message?
>>>>> I guess if there was a way to use batch with async, once you commit
>>>>> the batch for the 1000 queries, it would create 1 connection to each host
>>>>> and query 100 rows in a single message to each host.
>>>>> This would decrease resource usage, am I wrong?
>>>>>
>>>>> []s
>>>>>
>>>>>
>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>>>
>>>>> I've found that if you have any amount of latency between your client
>>>>>> and nodes, and you are executing a large batch of queries, you'll usually
>>>>>> want to send them together to one node unless execution time is of no
>>>>>> concern. The tradeoff is resource usage on the connected node vs. time to
>>>>>> complete all the queries, because you'll need fewer client -> node network
>>>>>> round trips.
>>>>>>
>>>>>> With large numbers of queries you will still want to make sure you
>>>>>> split them into manageable batches before sending them, to control memory
>>>>>> usage on the executing node. I've been limiting queries to batches of 100
>>>>>> keys in scenarios like this.
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>>>>>> michael.laing@nytimes.com> wrote:
>>>>>>
>>>>>>> However my extensive benchmarking this week of the python driver
>>>>>>> from master shows a performance *decrease* when using 'token_aware'.
>>>>>>>
>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>>>
>>>>>>> Also why do the work the coordinator will do for you: send all the
>>>>>>> queries, wait for everything to come back in whatever order, and sort the
>>>>>>> result.
>>>>>>>
>>>>>>> I would rather keep my app code simple.
>>>>>>>
>>>>>>> But the real point is that you should benchmark in your own
>>>>>>> environment.
>>>>>>>
>>>>>>> ml
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>>>
>>>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>>>> []s
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>>
>>>>>>>> The only case in which it might be better to use an IN clause is if
>>>>>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>>>>>> async.
>>>>>>>>>
>>>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>>>> connection.
>>>>>>>>>
>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL,
>>>>>>>>> btw.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>> > This is interesting, I didn't know that!
>>>>>>>>> > It might make sense then to use select = + async + token aware,
>>>>>>>>> I will try
>>>>>>>>> > to change my code.
>>>>>>>>> >
>>>>>>>>> > But would it be a "recomended solution" for these cases? Any
>>>>>>>>> other options?
>>>>>>>>> >
>>>>>>>>> > I still would if this is the right use case for Cassandra, to
>>>>>>>>> look for
>>>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>>>> connections to
>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>>>> >
>>>>>>>>> > []s
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>>> >
>>>>>>>>> >> If you use async and your driver is token aware, it will go to
>>>>>>>>> the
>>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>>>> >>
>>>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>>>> server
>>>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
>>>>>>>>> just ask
>>>>>>>>> >> the node with the data.
>>>>>>>>> >>
>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>>>> SELECT IN?
>>>>>>>>> >> > The justification in the docs is I could query many nodes,
>>>>>>>>> but I would
>>>>>>>>> >> > still
>>>>>>>>> >> > do it.
>>>>>>>>> >> >
>>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>>>> >> >
>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
>>>>>>>>> ENTITY_LOOKUP + "
>>>>>>>>> >> > WHERE
>>>>>>>>> >> > name=%s and value in(%s)"
>>>>>>>>> >> >
>>>>>>>>> >> > for name, values in identifiers.items():
>>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>>>> >> >    args = [name] + values
>>>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>>>> >> >    futures.append((query_msg,
>>>>>>>>> self.session.execute_async(query, args)))
>>>>>>>>> >> >
>>>>>>>>> >> > for query_msg, future in futures:
>>>>>>>>> >> >    try:
>>>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>>>> >> >       for row in rows:
>>>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>>>> >> >    except:
>>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " %
>>>>>>>>> (query_msg))
>>>>>>>>> >> >       raise
>>>>>>>>> >> >
>>>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>>>> query
>>>>>>>>> >> > (example:
>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>>>> "values" array
>>>>>>>>> >> > above.
>>>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>>>> the same
>>>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>>>> >> >
>>>>>>>>> >> > []s
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com
>>>>>>>>> >:
>>>>>>>>> >> >
>>>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Among then is
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
>>>>>>>>> indexes) should
>>>>>>>>> >> >> > be
>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>>>> IN in
>>>>>>>>> >> >> > SELECT
>>>>>>>>> >> >> > and
>>>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > When not to use IN¶
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > The recommendations about when not to use an index apply
>>>>>>>>> to using IN
>>>>>>>>> >> >> > in
>>>>>>>>> >> >> > the
>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>>>> clause is
>>>>>>>>> >> >> > not
>>>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>>>> usually many
>>>>>>>>> >> >> > nodes
>>>>>>>>> >> >> > must be queried. For example, in a single, local data
>>>>>>>>> center cluster
>>>>>>>>> >> >> > having
>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
>>>>>>>>> level of
>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes,
>>>>>>>>> but if the
>>>>>>>>> >> >> > query
>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
>>>>>>>>> are most
>>>>>>>>> >> >> > likely
>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
>>>>>>>>> fall in the
>>>>>>>>> >> >> > token
>>>>>>>>> >> >> > range."
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > In my system, I have a column family called
>>>>>>>>> "entity_lookup":
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>>>> >> >> > USE Identification1;
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>> >> >> >   name varchar,
>>>>>>>>> >> >> >   value varchar,
>>>>>>>>> >> >> >   entity_id uuid,
>>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > And I use the following select to query it:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and
>>>>>>>>> value in(%s)
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>>>> for
>>>>>>>>> >> >> > lookups
>>>>>>>>> >> >> > like
>>>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>>>> cassandra and
>>>>>>>>> >> >> > they
>>>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Best regards,
>>>>>>>>> >> >> > Marcelo Valle.
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> --
>>>>>>>>> >> >> Jon Haddad
>>>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>>>> >> >> skype: rustyrazorblade
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> Jon Haddad
>>>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>>>> >> skype: rustyrazorblade
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jon Haddad
>>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>>> skype: rustyrazorblade
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

The bad design part (just my opinion, no intention to offend) is not allow
the possibility of sending batches directly to the data nodes, without
using a coordinator.
I would choose that option.
[]s


2014-06-20 16:05 GMT-03:00 DuyHai Doan <do...@gmail.com>:

> Well it's kind of a trade-off.
>
>  Either you send data directly to the primary replica nodes to take
> advantage of data-locality using token-aware strategy and the price to pay
> is a high number of opened connections from client side.
>
> Or you just batch data to a random node playing the coordinator role to
> dispatch requests to the right nodes. The price to pay is then spike load
> on 1 node (the coordinator) and intra-cluster bandwdith usage.
>
>  The choice is yours, it has nothing to do with good or bad design.
>
>
> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>
>> I am using python + CQL Driver.
>> I wonder how they do...
>> These things seems little important, but they are fundamental to get a
>> good performance in Cassandra...
>> I wish there was a simpler way to query in batches. Opening a large
>> amount of connections and sending 1 message at a time seems bad to me, as
>> sometimes you want to work with small rows.
>> It's no surprise Cassandra performs better when we use average row sizes.
>> But honestly I disagree with this part of Cassandra/Driver's design.
>> []s
>>
>>
>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>
>> That depends on the connection pooling implementation in your driver.
>>> Astyanax will keep N connections open to each node (configurable) and route
>>> each query in a separate message over an existing connection, waiting until
>>> one becomes available if all are in use.
>>>
>>>
>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
>>> marcelo@s1mbi0se.com.br> wrote:
>>>
>>>> A question, not sure if you guys know the answer:
>>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>>> nodes. Suppose also each node would receive 100 row queries each.
>>>> How does async work in this case? Would it send each row query to each
>>>> node in a different connection? Different message?
>>>> I guess if there was a way to use batch with async, once you commit the
>>>> batch for the 1000 queries, it would create 1 connection to each host and
>>>> query 100 rows in a single message to each host.
>>>> This would decrease resource usage, am I wrong?
>>>>
>>>> []s
>>>>
>>>>
>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>>
>>>> I've found that if you have any amount of latency between your client
>>>>> and nodes, and you are executing a large batch of queries, you'll usually
>>>>> want to send them together to one node unless execution time is of no
>>>>> concern. The tradeoff is resource usage on the connected node vs. time to
>>>>> complete all the queries, because you'll need fewer client -> node network
>>>>> round trips.
>>>>>
>>>>> With large numbers of queries you will still want to make sure you
>>>>> split them into manageable batches before sending them, to control memory
>>>>> usage on the executing node. I've been limiting queries to batches of 100
>>>>> keys in scenarios like this.
>>>>>
>>>>>
>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>>>>> michael.laing@nytimes.com> wrote:
>>>>>
>>>>>> However my extensive benchmarking this week of the python driver from
>>>>>> master shows a performance *decrease* when using 'token_aware'.
>>>>>>
>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>>
>>>>>> Also why do the work the coordinator will do for you: send all the
>>>>>> queries, wait for everything to come back in whatever order, and sort the
>>>>>> result.
>>>>>>
>>>>>> I would rather keep my app code simple.
>>>>>>
>>>>>> But the real point is that you should benchmark in your own
>>>>>> environment.
>>>>>>
>>>>>> ml
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>>> []s
>>>>>>>
>>>>>>>
>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>
>>>>>>> The only case in which it might be better to use an IN clause is if
>>>>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>>>>> async.
>>>>>>>>
>>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>>> connection.
>>>>>>>>
>>>>>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>> > This is interesting, I didn't know that!
>>>>>>>> > It might make sense then to use select = + async + token aware, I
>>>>>>>> will try
>>>>>>>> > to change my code.
>>>>>>>> >
>>>>>>>> > But would it be a "recomended solution" for these cases? Any
>>>>>>>> other options?
>>>>>>>> >
>>>>>>>> > I still would if this is the right use case for Cassandra, to
>>>>>>>> look for
>>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>>> connections to
>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>>> >
>>>>>>>> > []s
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>>> >
>>>>>>>> >> If you use async and your driver is token aware, it will go to
>>>>>>>> the
>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>>> >>
>>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>>> server
>>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
>>>>>>>> just ask
>>>>>>>> >> the node with the data.
>>>>>>>> >>
>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>>> SELECT IN?
>>>>>>>> >> > The justification in the docs is I could query many nodes, but
>>>>>>>> I would
>>>>>>>> >> > still
>>>>>>>> >> > do it.
>>>>>>>> >> >
>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>>> >> >
>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
>>>>>>>> ENTITY_LOOKUP + "
>>>>>>>> >> > WHERE
>>>>>>>> >> > name=%s and value in(%s)"
>>>>>>>> >> >
>>>>>>>> >> > for name, values in identifiers.items():
>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>>> >> >    args = [name] + values
>>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>>> >> >    futures.append((query_msg,
>>>>>>>> self.session.execute_async(query, args)))
>>>>>>>> >> >
>>>>>>>> >> > for query_msg, future in futures:
>>>>>>>> >> >    try:
>>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>>> >> >       for row in rows:
>>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>>> >> >    except:
>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>>>>>> >> >       raise
>>>>>>>> >> >
>>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>>> query
>>>>>>>> >> > (example:
>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>>> "values" array
>>>>>>>> >> > above.
>>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>>> the same
>>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>>> >> >
>>>>>>>> >> > []s
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com
>>>>>>>> >:
>>>>>>>> >> >
>>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>>> >> >>
>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>>> >> >> >
>>>>>>>> >> >> > Among then is
>>>>>>>> >> >> >
>>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>>> >> >> >
>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
>>>>>>>> indexes) should
>>>>>>>> >> >> > be
>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>>> IN in
>>>>>>>> >> >> > SELECT
>>>>>>>> >> >> > and
>>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>>> >> >> >
>>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>>> >> >> >
>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>>> >> >> >
>>>>>>>> >> >> > When not to use IN¶
>>>>>>>> >> >> >
>>>>>>>> >> >> > The recommendations about when not to use an index apply to
>>>>>>>> using IN
>>>>>>>> >> >> > in
>>>>>>>> >> >> > the
>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>>> clause is
>>>>>>>> >> >> > not
>>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>>> usually many
>>>>>>>> >> >> > nodes
>>>>>>>> >> >> > must be queried. For example, in a single, local data
>>>>>>>> center cluster
>>>>>>>> >> >> > having
>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
>>>>>>>> level of
>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but
>>>>>>>> if the
>>>>>>>> >> >> > query
>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
>>>>>>>> are most
>>>>>>>> >> >> > likely
>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
>>>>>>>> fall in the
>>>>>>>> >> >> > token
>>>>>>>> >> >> > range."
>>>>>>>> >> >> >
>>>>>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>>>>>> >> >> >
>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>>> >> >> > USE Identification1;
>>>>>>>> >> >> >
>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>> >> >> >   name varchar,
>>>>>>>> >> >> >   value varchar,
>>>>>>>> >> >> >   entity_id uuid,
>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>>> >> >> >
>>>>>>>> >> >> > And I use the following select to query it:
>>>>>>>> >> >> >
>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>>>>>> in(%s)
>>>>>>>> >> >> >
>>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>>> >> >> >
>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>>> for
>>>>>>>> >> >> > lookups
>>>>>>>> >> >> > like
>>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>>> cassandra and
>>>>>>>> >> >> > they
>>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>>> >> >> >
>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>>> >> >> >
>>>>>>>> >> >> > Best regards,
>>>>>>>> >> >> > Marcelo Valle.
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >> >
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> --
>>>>>>>> >> >> Jon Haddad
>>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>>> >> >> skype: rustyrazorblade
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Jon Haddad
>>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>>> >> skype: rustyrazorblade
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jon Haddad
>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>> skype: rustyrazorblade
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by DuyHai Doan <do...@gmail.com>.

Well it's kind of a trade-off.

 Either you send data directly to the primary replica nodes to take
advantage of data-locality using token-aware strategy and the price to pay
is a high number of opened connections from client side.

Or you just batch data to a random node playing the coordinator role to
dispatch requests to the right nodes. The price to pay is then spike load
on 1 node (the coordinator) and intra-cluster bandwdith usage.

 The choice is yours, it has nothing to do with good or bad design.


On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle <
marcelo@s1mbi0se.com.br> wrote:

> I am using python + CQL Driver.
> I wonder how they do...
> These things seems little important, but they are fundamental to get a
> good performance in Cassandra...
> I wish there was a simpler way to query in batches. Opening a large amount
> of connections and sending 1 message at a time seems bad to me, as
> sometimes you want to work with small rows.
> It's no surprise Cassandra performs better when we use average row sizes.
> But honestly I disagree with this part of Cassandra/Driver's design.
> []s
>
>
> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>
> That depends on the connection pooling implementation in your driver.
>> Astyanax will keep N connections open to each node (configurable) and route
>> each query in a separate message over an existing connection, waiting until
>> one becomes available if all are in use.
>>
>>
>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
>> marcelo@s1mbi0se.com.br> wrote:
>>
>>> A question, not sure if you guys know the answer:
>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>> nodes. Suppose also each node would receive 100 row queries each.
>>> How does async work in this case? Would it send each row query to each
>>> node in a different connection? Different message?
>>> I guess if there was a way to use batch with async, once you commit the
>>> batch for the 1000 queries, it would create 1 connection to each host and
>>> query 100 rows in a single message to each host.
>>> This would decrease resource usage, am I wrong?
>>>
>>> []s
>>>
>>>
>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>>
>>> I've found that if you have any amount of latency between your client
>>>> and nodes, and you are executing a large batch of queries, you'll usually
>>>> want to send them together to one node unless execution time is of no
>>>> concern. The tradeoff is resource usage on the connected node vs. time to
>>>> complete all the queries, because you'll need fewer client -> node network
>>>> round trips.
>>>>
>>>> With large numbers of queries you will still want to make sure you
>>>> split them into manageable batches before sending them, to control memory
>>>> usage on the executing node. I've been limiting queries to batches of 100
>>>> keys in scenarios like this.
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>>>> michael.laing@nytimes.com> wrote:
>>>>
>>>>> However my extensive benchmarking this week of the python driver from
>>>>> master shows a performance *decrease* when using 'token_aware'.
>>>>>
>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>
>>>>> Also why do the work the coordinator will do for you: send all the
>>>>> queries, wait for everything to come back in whatever order, and sort the
>>>>> result.
>>>>>
>>>>> I would rather keep my app code simple.
>>>>>
>>>>> But the real point is that you should benchmark in your own
>>>>> environment.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>>
>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>> []s
>>>>>>
>>>>>>
>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>
>>>>>> The only case in which it might be better to use an IN clause is if
>>>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>>>> async.
>>>>>>>
>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>> connection.
>>>>>>>
>>>>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>>> > This is interesting, I didn't know that!
>>>>>>> > It might make sense then to use select = + async + token aware, I
>>>>>>> will try
>>>>>>> > to change my code.
>>>>>>> >
>>>>>>> > But would it be a "recomended solution" for these cases? Any other
>>>>>>> options?
>>>>>>> >
>>>>>>> > I still would if this is the right use case for Cassandra, to look
>>>>>>> for
>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>> connections to
>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>> >
>>>>>>> > []s
>>>>>>> >
>>>>>>> >
>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>> >
>>>>>>> >> If you use async and your driver is token aware, it will go to the
>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>> >>
>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>> server
>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to just
>>>>>>> ask
>>>>>>> >> the node with the data.
>>>>>>> >>
>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>> SELECT IN?
>>>>>>> >> > The justification in the docs is I could query many nodes, but
>>>>>>> I would
>>>>>>> >> > still
>>>>>>> >> > do it.
>>>>>>> >> >
>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>> >> >
>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP
>>>>>>> + "
>>>>>>> >> > WHERE
>>>>>>> >> > name=%s and value in(%s)"
>>>>>>> >> >
>>>>>>> >> > for name, values in identifiers.items():
>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>> >> >    args = [name] + values
>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>>>>> args)))
>>>>>>> >> >
>>>>>>> >> > for query_msg, future in futures:
>>>>>>> >> >    try:
>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>> >> >       for row in rows:
>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>> >> >    except:
>>>>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>>>>> >> >       raise
>>>>>>> >> >
>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>> query
>>>>>>> >> > (example:
>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>> "values" array
>>>>>>> >> > above.
>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>> the same
>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>> >> >
>>>>>>> >> > []s
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>>> >> >
>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>> >> >>
>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>> >> >> >
>>>>>>> >> >> > Among then is
>>>>>>> >> >> >
>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>> >> >> >
>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>>>>> should
>>>>>>> >> >> > be
>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>> IN in
>>>>>>> >> >> > SELECT
>>>>>>> >> >> > and
>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>> >> >> >
>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>> >> >> >
>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>> >> >> >
>>>>>>> >> >> > When not to use IN¶
>>>>>>> >> >> >
>>>>>>> >> >> > The recommendations about when not to use an index apply to
>>>>>>> using IN
>>>>>>> >> >> > in
>>>>>>> >> >> > the
>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>> clause is
>>>>>>> >> >> > not
>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>> usually many
>>>>>>> >> >> > nodes
>>>>>>> >> >> > must be queried. For example, in a single, local data center
>>>>>>> cluster
>>>>>>> >> >> > having
>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level
>>>>>>> of
>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but
>>>>>>> if the
>>>>>>> >> >> > query
>>>>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>>>>> most
>>>>>>> >> >> > likely
>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall
>>>>>>> in the
>>>>>>> >> >> > token
>>>>>>> >> >> > range."
>>>>>>> >> >> >
>>>>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>>>>> >> >> >
>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>> >> >> > USE Identification1;
>>>>>>> >> >> >
>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>> >> >> >   name varchar,
>>>>>>> >> >> >   value varchar,
>>>>>>> >> >> >   entity_id uuid,
>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>> >> >> >
>>>>>>> >> >> > And I use the following select to query it:
>>>>>>> >> >> >
>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>>>>> in(%s)
>>>>>>> >> >> >
>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>> >> >> >
>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>> for
>>>>>>> >> >> > lookups
>>>>>>> >> >> > like
>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>> cassandra and
>>>>>>> >> >> > they
>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>> >> >> >
>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>> >> >> >
>>>>>>> >> >> > Best regards,
>>>>>>> >> >> > Marcelo Valle.
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> --
>>>>>>> >> >> Jon Haddad
>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>> >> >> skype: rustyrazorblade
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Jon Haddad
>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>> >> skype: rustyrazorblade
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jon Haddad
>>>>>>> http://www.rustyrazorblade.com
>>>>>>> skype: rustyrazorblade
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

I am using python + CQL Driver.
I wonder how they do...
These things seems little important, but they are fundamental to get a good
performance in Cassandra...
I wish there was a simpler way to query in batches. Opening a large amount
of connections and sending 1 message at a time seems bad to me, as
sometimes you want to work with small rows.
It's no surprise Cassandra performs better when we use average row sizes.
But honestly I disagree with this part of Cassandra/Driver's design.
[]s


2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:

> That depends on the connection pooling implementation in your driver.
> Astyanax will keep N connections open to each node (configurable) and route
> each query in a separate message over an existing connection, waiting until
> one becomes available if all are in use.
>
>
> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>
>> A question, not sure if you guys know the answer:
>> Supose I async query 1000 rows using token aware and suppose I have 10
>> nodes. Suppose also each node would receive 100 row queries each.
>> How does async work in this case? Would it send each row query to each
>> node in a different connection? Different message?
>> I guess if there was a way to use batch with async, once you commit the
>> batch for the 1000 queries, it would create 1 connection to each host and
>> query 100 rows in a single message to each host.
>> This would decrease resource usage, am I wrong?
>>
>> []s
>>
>>
>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>>
>> I've found that if you have any amount of latency between your client and
>>> nodes, and you are executing a large batch of queries, you'll usually want
>>> to send them together to one node unless execution time is of no concern.
>>> The tradeoff is resource usage on the connected node vs. time to complete
>>> all the queries, because you'll need fewer client -> node network round
>>> trips.
>>>
>>> With large numbers of queries you will still want to make sure you split
>>> them into manageable batches before sending them, to control memory usage
>>> on the executing node. I've been limiting queries to batches of 100 keys in
>>> scenarios like this.
>>>
>>>
>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>>> michael.laing@nytimes.com> wrote:
>>>
>>>> However my extensive benchmarking this week of the python driver from
>>>> master shows a performance *decrease* when using 'token_aware'.
>>>>
>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>
>>>> Also why do the work the coordinator will do for you: send all the
>>>> queries, wait for everything to come back in whatever order, and sort the
>>>> result.
>>>>
>>>> I would rather keep my app code simple.
>>>>
>>>> But the real point is that you should benchmark in your own environment.
>>>>
>>>> ml
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>>> marcelo@s1mbi0se.com.br> wrote:
>>>>
>>>>> Yes, I am using the CQL datastax drivers.
>>>>> It was a good advice, thanks a lot Janathan.
>>>>> []s
>>>>>
>>>>>
>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>
>>>>> The only case in which it might be better to use an IN clause is if
>>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>>> async.
>>>>>>
>>>>>> The native driver reuses connections and intelligently manages the
>>>>>> pool for you.  It can also multiplex queries over a single connection.
>>>>>>
>>>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>>> > This is interesting, I didn't know that!
>>>>>> > It might make sense then to use select = + async + token aware, I
>>>>>> will try
>>>>>> > to change my code.
>>>>>> >
>>>>>> > But would it be a "recomended solution" for these cases? Any other
>>>>>> options?
>>>>>> >
>>>>>> > I still would if this is the right use case for Cassandra, to look
>>>>>> for
>>>>>> > random keys in a huge cluster. After all, the amount of connections
>>>>>> to
>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>> > Or when you use async the driver reuses the connection?
>>>>>> >
>>>>>> > []s
>>>>>> >
>>>>>> >
>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>> >
>>>>>> >> If you use async and your driver is token aware, it will go to the
>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>> >>
>>>>>> >> Realistically you're going to have a connection open to every
>>>>>> server
>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>> >> directly and using a coordinator as a proxy.  It's faster to just
>>>>>> ask
>>>>>> >> the node with the data.
>>>>>> >>
>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>> >> > But using async queries wouldn't be even worse than using SELECT
>>>>>> IN?
>>>>>> >> > The justification in the docs is I could query many nodes, but I
>>>>>> would
>>>>>> >> > still
>>>>>> >> > do it.
>>>>>> >> >
>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>> >> >
>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP
>>>>>> + "
>>>>>> >> > WHERE
>>>>>> >> > name=%s and value in(%s)"
>>>>>> >> >
>>>>>> >> > for name, values in identifiers.items():
>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>> >> >    args = [name] + values
>>>>>> >> >    query_msg = query % tuple(args)
>>>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>>>> args)))
>>>>>> >> >
>>>>>> >> > for query_msg, future in futures:
>>>>>> >> >    try:
>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>> >> >       for row in rows:
>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>> >> >    except:
>>>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>>>> >> >       raise
>>>>>> >> >
>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>> query
>>>>>> >> > (example:
>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>> "values" array
>>>>>> >> > above.
>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>> the same
>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>> >> >
>>>>>> >> > []s
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>>> >> >
>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>> >> >>
>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>> >> >> >
>>>>>> >> >> > Among then is
>>>>>> >> >> >
>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>> >> >> >
>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>>>> should
>>>>>> >> >> > be
>>>>>> >> >> > avoided except for specific scenarios. See When not to use IN
>>>>>> in
>>>>>> >> >> > SELECT
>>>>>> >> >> > and
>>>>>> >> >> > When not to use an index in Indexing in
>>>>>> >> >> >
>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>> >> >> >
>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>> >> >> >
>>>>>> >> >> > When not to use IN¶
>>>>>> >> >> >
>>>>>> >> >> > The recommendations about when not to use an index apply to
>>>>>> using IN
>>>>>> >> >> > in
>>>>>> >> >> > the
>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>> clause is
>>>>>> >> >> > not
>>>>>> >> >> > recommended. Using IN can degrade performance because usually
>>>>>> many
>>>>>> >> >> > nodes
>>>>>> >> >> > must be queried. For example, in a single, local data center
>>>>>> cluster
>>>>>> >> >> > having
>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level
>>>>>> of
>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but
>>>>>> if the
>>>>>> >> >> > query
>>>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>>>> most
>>>>>> >> >> > likely
>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall
>>>>>> in the
>>>>>> >> >> > token
>>>>>> >> >> > range."
>>>>>> >> >> >
>>>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>>>> >> >> >
>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>>> >> >> >   'DC1' : 3 };
>>>>>> >> >> > USE Identification1;
>>>>>> >> >> >
>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>> >> >> >   name varchar,
>>>>>> >> >> >   value varchar,
>>>>>> >> >> >   entity_id uuid,
>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>> >> >> >
>>>>>> >> >> > And I use the following select to query it:
>>>>>> >> >> >
>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>>>> in(%s)
>>>>>> >> >> >
>>>>>> >> >> > Is this an anti-pattern?
>>>>>> >> >> >
>>>>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>>>>> >> >> > lookups
>>>>>> >> >> > like
>>>>>> >> >> > that? I have several values I would like to search in
>>>>>> cassandra and
>>>>>> >> >> > they
>>>>>> >> >> > might not be in the same particion, as above.
>>>>>> >> >> >
>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>> >> >> >
>>>>>> >> >> > Best regards,
>>>>>> >> >> > Marcelo Valle.
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Jon Haddad
>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>> >> >> skype: rustyrazorblade
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Jon Haddad
>>>>>> >> http://www.rustyrazorblade.com
>>>>>> >> skype: rustyrazorblade
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jon Haddad
>>>>>> http://www.rustyrazorblade.com
>>>>>> skype: rustyrazorblade
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by Jeremy Jongsma <je...@barchart.com>.

That depends on the connection pooling implementation in your driver.
Astyanax will keep N connections open to each node (configurable) and route
each query in a separate message over an existing connection, waiting until
one becomes available if all are in use.


On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
marcelo@s1mbi0se.com.br> wrote:

> A question, not sure if you guys know the answer:
> Supose I async query 1000 rows using token aware and suppose I have 10
> nodes. Suppose also each node would receive 100 row queries each.
> How does async work in this case? Would it send each row query to each
> node in a different connection? Different message?
> I guess if there was a way to use batch with async, once you commit the
> batch for the 1000 queries, it would create 1 connection to each host and
> query 100 rows in a single message to each host.
> This would decrease resource usage, am I wrong?
>
> []s
>
>
> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:
>
> I've found that if you have any amount of latency between your client and
>> nodes, and you are executing a large batch of queries, you'll usually want
>> to send them together to one node unless execution time is of no concern.
>> The tradeoff is resource usage on the connected node vs. time to complete
>> all the queries, because you'll need fewer client -> node network round
>> trips.
>>
>> With large numbers of queries you will still want to make sure you split
>> them into manageable batches before sending them, to control memory usage
>> on the executing node. I've been limiting queries to batches of 100 keys in
>> scenarios like this.
>>
>>
>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> However my extensive benchmarking this week of the python driver from
>>> master shows a performance *decrease* when using 'token_aware'.
>>>
>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>
>>> Also why do the work the coordinator will do for you: send all the
>>> queries, wait for everything to come back in whatever order, and sort the
>>> result.
>>>
>>> I would rather keep my app code simple.
>>>
>>> But the real point is that you should benchmark in your own environment.
>>>
>>> ml
>>>
>>>
>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>> marcelo@s1mbi0se.com.br> wrote:
>>>
>>>> Yes, I am using the CQL datastax drivers.
>>>> It was a good advice, thanks a lot Janathan.
>>>> []s
>>>>
>>>>
>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>
>>>> The only case in which it might be better to use an IN clause is if
>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>> async.
>>>>>
>>>>> The native driver reuses connections and intelligently manages the
>>>>> pool for you.  It can also multiplex queries over a single connection.
>>>>>
>>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>>
>>>>> Jon
>>>>>
>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>> <ma...@s1mbi0se.com.br> wrote:
>>>>> > This is interesting, I didn't know that!
>>>>> > It might make sense then to use select = + async + token aware, I
>>>>> will try
>>>>> > to change my code.
>>>>> >
>>>>> > But would it be a "recomended solution" for these cases? Any other
>>>>> options?
>>>>> >
>>>>> > I still would if this is the right use case for Cassandra, to look
>>>>> for
>>>>> > random keys in a huge cluster. After all, the amount of connections
>>>>> to
>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>> > Or when you use async the driver reuses the connection?
>>>>> >
>>>>> > []s
>>>>> >
>>>>> >
>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>> >
>>>>> >> If you use async and your driver is token aware, it will go to the
>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>> >>
>>>>> >> Realistically you're going to have a connection open to every server
>>>>> >> anyways.  It's the difference between you querying for the data
>>>>> >> directly and using a coordinator as a proxy.  It's faster to just
>>>>> ask
>>>>> >> the node with the data.
>>>>> >>
>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>>> >> > But using async queries wouldn't be even worse than using SELECT
>>>>> IN?
>>>>> >> > The justification in the docs is I could query many nodes, but I
>>>>> would
>>>>> >> > still
>>>>> >> > do it.
>>>>> >> >
>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>> >> >
>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP +
>>>>> "
>>>>> >> > WHERE
>>>>> >> > name=%s and value in(%s)"
>>>>> >> >
>>>>> >> > for name, values in identifiers.items():
>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>> >> > ','.join(['%s']*len(values)))
>>>>> >> >    args = [name] + values
>>>>> >> >    query_msg = query % tuple(args)
>>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>>> args)))
>>>>> >> >
>>>>> >> > for query_msg, future in futures:
>>>>> >> >    try:
>>>>> >> >       rows = future.result(timeout=100000)
>>>>> >> >       for row in rows:
>>>>> >> >         entity_ids.add(row.entity_id)
>>>>> >> >    except:
>>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>>> >> >       raise
>>>>> >> >
>>>>> >> > Using async just with select = would mean instead of 1 async query
>>>>> >> > (example:
>>>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>>>> array
>>>>> >> > above.
>>>>> >> > In my head, this would mean more connections to Cassandra and the
>>>>> same
>>>>> >> > amount of work, right? What would be the advantage?
>>>>> >> >
>>>>> >> > []s
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>>> >> >
>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>> >> >> straightforward w/ the java or python drivers.
>>>>> >> >>
>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>> >> >> >
>>>>> >> >> > Among then is
>>>>> >> >> >
>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>> >> >> >
>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>>> should
>>>>> >> >> > be
>>>>> >> >> > avoided except for specific scenarios. See When not to use IN
>>>>> in
>>>>> >> >> > SELECT
>>>>> >> >> > and
>>>>> >> >> > When not to use an index in Indexing in
>>>>> >> >> >
>>>>> >> >> > CQL for Cassandra 2.0"
>>>>> >> >> >
>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>> >> >> >
>>>>> >> >> > When not to use IN¶
>>>>> >> >> >
>>>>> >> >> > The recommendations about when not to use an index apply to
>>>>> using IN
>>>>> >> >> > in
>>>>> >> >> > the
>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>> clause is
>>>>> >> >> > not
>>>>> >> >> > recommended. Using IN can degrade performance because usually
>>>>> many
>>>>> >> >> > nodes
>>>>> >> >> > must be queried. For example, in a single, local data center
>>>>> cluster
>>>>> >> >> > having
>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>>>> the
>>>>> >> >> > query
>>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>>> most
>>>>> >> >> > likely
>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall
>>>>> in the
>>>>> >> >> > token
>>>>> >> >> > range."
>>>>> >> >> >
>>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>>> >> >> >
>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>> >> >> >   'DC1' : 3 };
>>>>> >> >> > USE Identification1;
>>>>> >> >> >
>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>> >> >> >   name varchar,
>>>>> >> >> >   value varchar,
>>>>> >> >> >   entity_id uuid,
>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>> >> >> >
>>>>> >> >> > And I use the following select to query it:
>>>>> >> >> >
>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>>> in(%s)
>>>>> >> >> >
>>>>> >> >> > Is this an anti-pattern?
>>>>> >> >> >
>>>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>>>> >> >> > lookups
>>>>> >> >> > like
>>>>> >> >> > that? I have several values I would like to search in
>>>>> cassandra and
>>>>> >> >> > they
>>>>> >> >> > might not be in the same particion, as above.
>>>>> >> >> >
>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>> >> >> >
>>>>> >> >> > Best regards,
>>>>> >> >> > Marcelo Valle.
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Jon Haddad
>>>>> >> >> http://www.rustyrazorblade.com
>>>>> >> >> skype: rustyrazorblade
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Jon Haddad
>>>>> >> http://www.rustyrazorblade.com
>>>>> >> skype: rustyrazorblade
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jon Haddad
>>>>> http://www.rustyrazorblade.com
>>>>> skype: rustyrazorblade
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

A question, not sure if you guys know the answer:
Supose I async query 1000 rows using token aware and suppose I have 10
nodes. Suppose also each node would receive 100 row queries each.
How does async work in this case? Would it send each row query to each node
in a different connection? Different message?
I guess if there was a way to use batch with async, once you commit the
batch for the 1000 queries, it would create 1 connection to each host and
query 100 rows in a single message to each host.
This would decrease resource usage, am I wrong?

[]s


2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <je...@barchart.com>:

> I've found that if you have any amount of latency between your client and
> nodes, and you are executing a large batch of queries, you'll usually want
> to send them together to one node unless execution time is of no concern.
> The tradeoff is resource usage on the connected node vs. time to complete
> all the queries, because you'll need fewer client -> node network round
> trips.
>
> With large numbers of queries you will still want to make sure you split
> them into manageable batches before sending them, to control memory usage
> on the executing node. I've been limiting queries to batches of 100 keys in
> scenarios like this.
>
>
> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> However my extensive benchmarking this week of the python driver from
>> master shows a performance *decrease* when using 'token_aware'.
>>
>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>
>> Also why do the work the coordinator will do for you: send all the
>> queries, wait for everything to come back in whatever order, and sort the
>> result.
>>
>> I would rather keep my app code simple.
>>
>> But the real point is that you should benchmark in your own environment.
>>
>> ml
>>
>>
>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>> marcelo@s1mbi0se.com.br> wrote:
>>
>>> Yes, I am using the CQL datastax drivers.
>>> It was a good advice, thanks a lot Janathan.
>>> []s
>>>
>>>
>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>
>>> The only case in which it might be better to use an IN clause is if
>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>> async.
>>>>
>>>> The native driver reuses connections and intelligently manages the
>>>> pool for you.  It can also multiplex queries over a single connection.
>>>>
>>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>>
>>>> Jon
>>>>
>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>> <ma...@s1mbi0se.com.br> wrote:
>>>> > This is interesting, I didn't know that!
>>>> > It might make sense then to use select = + async + token aware, I
>>>> will try
>>>> > to change my code.
>>>> >
>>>> > But would it be a "recomended solution" for these cases? Any other
>>>> options?
>>>> >
>>>> > I still would if this is the right use case for Cassandra, to look for
>>>> > random keys in a huge cluster. After all, the amount of connections to
>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>> > Or when you use async the driver reuses the connection?
>>>> >
>>>> > []s
>>>> >
>>>> >
>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>> >
>>>> >> If you use async and your driver is token aware, it will go to the
>>>> >> proper node, rather than requiring the coordinator to do so.
>>>> >>
>>>> >> Realistically you're going to have a connection open to every server
>>>> >> anyways.  It's the difference between you querying for the data
>>>> >> directly and using a coordinator as a proxy.  It's faster to just ask
>>>> >> the node with the data.
>>>> >>
>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>> >> <ma...@s1mbi0se.com.br> wrote:
>>>> >> > But using async queries wouldn't be even worse than using SELECT
>>>> IN?
>>>> >> > The justification in the docs is I could query many nodes, but I
>>>> would
>>>> >> > still
>>>> >> > do it.
>>>> >> >
>>>> >> > Today, I use both async queries AND SELECT IN:
>>>> >> >
>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>>>> >> > WHERE
>>>> >> > name=%s and value in(%s)"
>>>> >> >
>>>> >> > for name, values in identifiers.items():
>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>> >> > ','.join(['%s']*len(values)))
>>>> >> >    args = [name] + values
>>>> >> >    query_msg = query % tuple(args)
>>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>>> args)))
>>>> >> >
>>>> >> > for query_msg, future in futures:
>>>> >> >    try:
>>>> >> >       rows = future.result(timeout=100000)
>>>> >> >       for row in rows:
>>>> >> >         entity_ids.add(row.entity_id)
>>>> >> >    except:
>>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>>> >> >       raise
>>>> >> >
>>>> >> > Using async just with select = would mean instead of 1 async query
>>>> >> > (example:
>>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>>> array
>>>> >> > above.
>>>> >> > In my head, this would mean more connections to Cassandra and the
>>>> same
>>>> >> > amount of work, right? What would be the advantage?
>>>> >> >
>>>> >> > []s
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>>> >> >
>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>> >> >> straightforward w/ the java or python drivers.
>>>> >> >>
>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>> >> >> >
>>>> >> >> > Among then is
>>>> >> >> >
>>>> >> >> > SELECT ... IN or index lookups¶
>>>> >> >> >
>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>>> should
>>>> >> >> > be
>>>> >> >> > avoided except for specific scenarios. See When not to use IN in
>>>> >> >> > SELECT
>>>> >> >> > and
>>>> >> >> > When not to use an index in Indexing in
>>>> >> >> >
>>>> >> >> > CQL for Cassandra 2.0"
>>>> >> >> >
>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>> >> >> >
>>>> >> >> > When not to use IN¶
>>>> >> >> >
>>>> >> >> > The recommendations about when not to use an index apply to
>>>> using IN
>>>> >> >> > in
>>>> >> >> > the
>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>> clause is
>>>> >> >> > not
>>>> >> >> > recommended. Using IN can degrade performance because usually
>>>> many
>>>> >> >> > nodes
>>>> >> >> > must be queried. For example, in a single, local data center
>>>> cluster
>>>> >> >> > having
>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>>> the
>>>> >> >> > query
>>>> >> >> > uses the IN condition, the number of nodes being queried are
>>>> most
>>>> >> >> > likely
>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall in
>>>> the
>>>> >> >> > token
>>>> >> >> > range."
>>>> >> >> >
>>>> >> >> > In my system, I have a column family called "entity_lookup":
>>>> >> >> >
>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>> >> >> >   'DC1' : 3 };
>>>> >> >> > USE Identification1;
>>>> >> >> >
>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>> >> >> >   name varchar,
>>>> >> >> >   value varchar,
>>>> >> >> >   entity_id uuid,
>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>> >> >> >
>>>> >> >> > And I use the following select to query it:
>>>> >> >> >
>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>>> in(%s)
>>>> >> >> >
>>>> >> >> > Is this an anti-pattern?
>>>> >> >> >
>>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>>> >> >> > lookups
>>>> >> >> > like
>>>> >> >> > that? I have several values I would like to search in cassandra
>>>> and
>>>> >> >> > they
>>>> >> >> > might not be in the same particion, as above.
>>>> >> >> >
>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>> >> >> >
>>>> >> >> > Best regards,
>>>> >> >> > Marcelo Valle.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Jon Haddad
>>>> >> >> http://www.rustyrazorblade.com
>>>> >> >> skype: rustyrazorblade
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jon Haddad
>>>> >> http://www.rustyrazorblade.com
>>>> >> skype: rustyrazorblade
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jon Haddad
>>>> http://www.rustyrazorblade.com
>>>> skype: rustyrazorblade
>>>>
>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by Jeremy Jongsma <je...@barchart.com>.

I've found that if you have any amount of latency between your client and
nodes, and you are executing a large batch of queries, you'll usually want
to send them together to one node unless execution time is of no concern.
The tradeoff is resource usage on the connected node vs. time to complete
all the queries, because you'll need fewer client -> node network round
trips.

With large numbers of queries you will still want to make sure you split
them into manageable batches before sending them, to control memory usage
on the executing node. I've been limiting queries to batches of 100 keys in
scenarios like this.


On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <mi...@nytimes.com>
wrote:

> However my extensive benchmarking this week of the python driver from
> master shows a performance *decrease* when using 'token_aware'.
>
> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>
> Also why do the work the coordinator will do for you: send all the
> queries, wait for everything to come back in whatever order, and sort the
> result.
>
> I would rather keep my app code simple.
>
> But the real point is that you should benchmark in your own environment.
>
> ml
>
>
> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>
>> Yes, I am using the CQL datastax drivers.
>> It was a good advice, thanks a lot Janathan.
>> []s
>>
>>
>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>
>> The only case in which it might be better to use an IN clause is if
>>> the entire query can be satisfied from that machine.  Otherwise, go
>>> async.
>>>
>>> The native driver reuses connections and intelligently manages the
>>> pool for you.  It can also multiplex queries over a single connection.
>>>
>>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>>
>>> Jon
>>>
>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>> <ma...@s1mbi0se.com.br> wrote:
>>> > This is interesting, I didn't know that!
>>> > It might make sense then to use select = + async + token aware, I will
>>> try
>>> > to change my code.
>>> >
>>> > But would it be a "recomended solution" for these cases? Any other
>>> options?
>>> >
>>> > I still would if this is the right use case for Cassandra, to look for
>>> > random keys in a huge cluster. After all, the amount of connections to
>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>> > Or when you use async the driver reuses the connection?
>>> >
>>> > []s
>>> >
>>> >
>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>> >
>>> >> If you use async and your driver is token aware, it will go to the
>>> >> proper node, rather than requiring the coordinator to do so.
>>> >>
>>> >> Realistically you're going to have a connection open to every server
>>> >> anyways.  It's the difference between you querying for the data
>>> >> directly and using a coordinator as a proxy.  It's faster to just ask
>>> >> the node with the data.
>>> >>
>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>> >> <ma...@s1mbi0se.com.br> wrote:
>>> >> > But using async queries wouldn't be even worse than using SELECT IN?
>>> >> > The justification in the docs is I could query many nodes, but I
>>> would
>>> >> > still
>>> >> > do it.
>>> >> >
>>> >> > Today, I use both async queries AND SELECT IN:
>>> >> >
>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>>> >> > WHERE
>>> >> > name=%s and value in(%s)"
>>> >> >
>>> >> > for name, values in identifiers.items():
>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>> >> > ','.join(['%s']*len(values)))
>>> >> >    args = [name] + values
>>> >> >    query_msg = query % tuple(args)
>>> >> >    futures.append((query_msg, self.session.execute_async(query,
>>> args)))
>>> >> >
>>> >> > for query_msg, future in futures:
>>> >> >    try:
>>> >> >       rows = future.result(timeout=100000)
>>> >> >       for row in rows:
>>> >> >         entity_ids.add(row.entity_id)
>>> >> >    except:
>>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>>> >> >       raise
>>> >> >
>>> >> > Using async just with select = would mean instead of 1 async query
>>> >> > (example:
>>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>>> array
>>> >> > above.
>>> >> > In my head, this would mean more connections to Cassandra and the
>>> same
>>> >> > amount of work, right? What would be the advantage?
>>> >> >
>>> >> > []s
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>>> >> >
>>> >> >> Your other option is to fire off async queries.  It's pretty
>>> >> >> straightforward w/ the java or python drivers.
>>> >> >>
>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>> >> >> <ma...@s1mbi0se.com.br> wrote:
>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>> >> >> >
>>> >> >> > Among then is
>>> >> >> >
>>> >> >> > SELECT ... IN or index lookups¶
>>> >> >> >
>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>>> should
>>> >> >> > be
>>> >> >> > avoided except for specific scenarios. See When not to use IN in
>>> >> >> > SELECT
>>> >> >> > and
>>> >> >> > When not to use an index in Indexing in
>>> >> >> >
>>> >> >> > CQL for Cassandra 2.0"
>>> >> >> >
>>> >> >> > And Looking at the SELECT doc, I saw:
>>> >> >> >
>>> >> >> > When not to use IN¶
>>> >> >> >
>>> >> >> > The recommendations about when not to use an index apply to
>>> using IN
>>> >> >> > in
>>> >> >> > the
>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>> clause is
>>> >> >> > not
>>> >> >> > recommended. Using IN can degrade performance because usually
>>> many
>>> >> >> > nodes
>>> >> >> > must be queried. For example, in a single, local data center
>>> cluster
>>> >> >> > having
>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if
>>> the
>>> >> >> > query
>>> >> >> > uses the IN condition, the number of nodes being queried are most
>>> >> >> > likely
>>> >> >> > even higher, up to 20 nodes depending on where the keys fall in
>>> the
>>> >> >> > token
>>> >> >> > range."
>>> >> >> >
>>> >> >> > In my system, I have a column family called "entity_lookup":
>>> >> >> >
>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>> >> >> >   'DC1' : 3 };
>>> >> >> > USE Identification1;
>>> >> >> >
>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>> >> >> >   name varchar,
>>> >> >> >   value varchar,
>>> >> >> >   entity_id uuid,
>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>> >> >> >
>>> >> >> > And I use the following select to query it:
>>> >> >> >
>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value
>>> in(%s)
>>> >> >> >
>>> >> >> > Is this an anti-pattern?
>>> >> >> >
>>> >> >> > If not using SELECT IN, which other way would you recomend for
>>> >> >> > lookups
>>> >> >> > like
>>> >> >> > that? I have several values I would like to search in cassandra
>>> and
>>> >> >> > they
>>> >> >> > might not be in the same particion, as above.
>>> >> >> >
>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>> >> >> >
>>> >> >> > Best regards,
>>> >> >> > Marcelo Valle.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Jon Haddad
>>> >> >> http://www.rustyrazorblade.com
>>> >> >> skype: rustyrazorblade
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jon Haddad
>>> >> http://www.rustyrazorblade.com
>>> >> skype: rustyrazorblade
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>> skype: rustyrazorblade
>>>
>>
>>
>

Re: Best way to do a multi_get using CQL

Posted by "Laing, Michael" <mi...@nytimes.com>.

However my extensive benchmarking this week of the python driver from
master shows a performance *decrease* when using 'token_aware'.

This is on 12-node, 2-datacenter, RF-3 cluster in AWS.

Also why do the work the coordinator will do for you: send all the queries,
wait for everything to come back in whatever order, and sort the result.

I would rather keep my app code simple.

But the real point is that you should benchmark in your own environment.

ml


On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
marcelo@s1mbi0se.com.br> wrote:

> Yes, I am using the CQL datastax drivers.
> It was a good advice, thanks a lot Janathan.
> []s
>
>
> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>
> The only case in which it might be better to use an IN clause is if
>> the entire query can be satisfied from that machine.  Otherwise, go
>> async.
>>
>> The native driver reuses connections and intelligently manages the
>> pool for you.  It can also multiplex queries over a single connection.
>>
>> I am assuming you're using one of the datastax drivers for CQL, btw.
>>
>> Jon
>>
>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>> <ma...@s1mbi0se.com.br> wrote:
>> > This is interesting, I didn't know that!
>> > It might make sense then to use select = + async + token aware, I will
>> try
>> > to change my code.
>> >
>> > But would it be a "recomended solution" for these cases? Any other
>> options?
>> >
>> > I still would if this is the right use case for Cassandra, to look for
>> > random keys in a huge cluster. After all, the amount of connections to
>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>> > Or when you use async the driver reuses the connection?
>> >
>> > []s
>> >
>> >
>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>> >
>> >> If you use async and your driver is token aware, it will go to the
>> >> proper node, rather than requiring the coordinator to do so.
>> >>
>> >> Realistically you're going to have a connection open to every server
>> >> anyways.  It's the difference between you querying for the data
>> >> directly and using a coordinator as a proxy.  It's faster to just ask
>> >> the node with the data.
>> >>
>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>> >> <ma...@s1mbi0se.com.br> wrote:
>> >> > But using async queries wouldn't be even worse than using SELECT IN?
>> >> > The justification in the docs is I could query many nodes, but I
>> would
>> >> > still
>> >> > do it.
>> >> >
>> >> > Today, I use both async queries AND SELECT IN:
>> >> >
>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>> >> > WHERE
>> >> > name=%s and value in(%s)"
>> >> >
>> >> > for name, values in identifiers.items():
>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>> >> > ','.join(['%s']*len(values)))
>> >> >    args = [name] + values
>> >> >    query_msg = query % tuple(args)
>> >> >    futures.append((query_msg, self.session.execute_async(query,
>> args)))
>> >> >
>> >> > for query_msg, future in futures:
>> >> >    try:
>> >> >       rows = future.result(timeout=100000)
>> >> >       for row in rows:
>> >> >         entity_ids.add(row.entity_id)
>> >> >    except:
>> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>> >> >       raise
>> >> >
>> >> > Using async just with select = would mean instead of 1 async query
>> >> > (example:
>> >> > in (0, 1, 2)), I would do several, one for each value of "values"
>> array
>> >> > above.
>> >> > In my head, this would mean more connections to Cassandra and the
>> same
>> >> > amount of work, right? What would be the advantage?
>> >> >
>> >> > []s
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>> >> >
>> >> >> Your other option is to fire off async queries.  It's pretty
>> >> >> straightforward w/ the java or python drivers.
>> >> >>
>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>> >> >> <ma...@s1mbi0se.com.br> wrote:
>> >> >> > I was taking a look at Cassandra anti-patterns list:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>> >> >> >
>> >> >> > Among then is
>> >> >> >
>> >> >> > SELECT ... IN or index lookups¶
>> >> >> >
>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes)
>> should
>> >> >> > be
>> >> >> > avoided except for specific scenarios. See When not to use IN in
>> >> >> > SELECT
>> >> >> > and
>> >> >> > When not to use an index in Indexing in
>> >> >> >
>> >> >> > CQL for Cassandra 2.0"
>> >> >> >
>> >> >> > And Looking at the SELECT doc, I saw:
>> >> >> >
>> >> >> > When not to use IN¶
>> >> >> >
>> >> >> > The recommendations about when not to use an index apply to using
>> IN
>> >> >> > in
>> >> >> > the
>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE clause
>> is
>> >> >> > not
>> >> >> > recommended. Using IN can degrade performance because usually many
>> >> >> > nodes
>> >> >> > must be queried. For example, in a single, local data center
>> cluster
>> >> >> > having
>> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the
>> >> >> > query
>> >> >> > uses the IN condition, the number of nodes being queried are most
>> >> >> > likely
>> >> >> > even higher, up to 20 nodes depending on where the keys fall in
>> the
>> >> >> > token
>> >> >> > range."
>> >> >> >
>> >> >> > In my system, I have a column family called "entity_lookup":
>> >> >> >
>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>> >> >> >   'DC1' : 3 };
>> >> >> > USE Identification1;
>> >> >> >
>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>> >> >> >   name varchar,
>> >> >> >   value varchar,
>> >> >> >   entity_id uuid,
>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>> >> >> >
>> >> >> > And I use the following select to query it:
>> >> >> >
>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
>> >> >> >
>> >> >> > Is this an anti-pattern?
>> >> >> >
>> >> >> > If not using SELECT IN, which other way would you recomend for
>> >> >> > lookups
>> >> >> > like
>> >> >> > that? I have several values I would like to search in cassandra
>> and
>> >> >> > they
>> >> >> > might not be in the same particion, as above.
>> >> >> >
>> >> >> > Is Cassandra the wrong tool for lookups like that?
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Marcelo Valle.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Jon Haddad
>> >> >> http://www.rustyrazorblade.com
>> >> >> skype: rustyrazorblade
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jon Haddad
>> >> http://www.rustyrazorblade.com
>> >> skype: rustyrazorblade
>> >
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>>
>
>

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

Yes, I am using the CQL datastax drivers.
It was a good advice, thanks a lot Janathan.
[]s


2014-06-20 0:28 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:

> The only case in which it might be better to use an IN clause is if
> the entire query can be satisfied from that machine.  Otherwise, go
> async.
>
> The native driver reuses connections and intelligently manages the
> pool for you.  It can also multiplex queries over a single connection.
>
> I am assuming you're using one of the datastax drivers for CQL, btw.
>
> Jon
>
> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
> <ma...@s1mbi0se.com.br> wrote:
> > This is interesting, I didn't know that!
> > It might make sense then to use select = + async + token aware, I will
> try
> > to change my code.
> >
> > But would it be a "recomended solution" for these cases? Any other
> options?
> >
> > I still would if this is the right use case for Cassandra, to look for
> > random keys in a huge cluster. After all, the amount of connections to
> > Cassandra will still be huge, right... Wouldn't it be a problem?
> > Or when you use async the driver reuses the connection?
> >
> > []s
> >
> >
> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
> >
> >> If you use async and your driver is token aware, it will go to the
> >> proper node, rather than requiring the coordinator to do so.
> >>
> >> Realistically you're going to have a connection open to every server
> >> anyways.  It's the difference between you querying for the data
> >> directly and using a coordinator as a proxy.  It's faster to just ask
> >> the node with the data.
> >>
> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
> >> <ma...@s1mbi0se.com.br> wrote:
> >> > But using async queries wouldn't be even worse than using SELECT IN?
> >> > The justification in the docs is I could query many nodes, but I would
> >> > still
> >> > do it.
> >> >
> >> > Today, I use both async queries AND SELECT IN:
> >> >
> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
> >> > WHERE
> >> > name=%s and value in(%s)"
> >> >
> >> > for name, values in identifiers.items():
> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
> >> > ','.join(['%s']*len(values)))
> >> >    args = [name] + values
> >> >    query_msg = query % tuple(args)
> >> >    futures.append((query_msg, self.session.execute_async(query,
> args)))
> >> >
> >> > for query_msg, future in futures:
> >> >    try:
> >> >       rows = future.result(timeout=100000)
> >> >       for row in rows:
> >> >         entity_ids.add(row.entity_id)
> >> >    except:
> >> >       logging.error("Query '%s' returned ERROR " % (query_msg))
> >> >       raise
> >> >
> >> > Using async just with select = would mean instead of 1 async query
> >> > (example:
> >> > in (0, 1, 2)), I would do several, one for each value of "values"
> array
> >> > above.
> >> > In my head, this would mean more connections to Cassandra and the same
> >> > amount of work, right? What would be the advantage?
> >> >
> >> > []s
> >> >
> >> >
> >> >
> >> >
> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
> >> >
> >> >> Your other option is to fire off async queries.  It's pretty
> >> >> straightforward w/ the java or python drivers.
> >> >>
> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
> >> >> <ma...@s1mbi0se.com.br> wrote:
> >> >> > I was taking a look at Cassandra anti-patterns list:
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
> >> >> >
> >> >> > Among then is
> >> >> >
> >> >> > SELECT ... IN or index lookups¶
> >> >> >
> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) should
> >> >> > be
> >> >> > avoided except for specific scenarios. See When not to use IN in
> >> >> > SELECT
> >> >> > and
> >> >> > When not to use an index in Indexing in
> >> >> >
> >> >> > CQL for Cassandra 2.0"
> >> >> >
> >> >> > And Looking at the SELECT doc, I saw:
> >> >> >
> >> >> > When not to use IN¶
> >> >> >
> >> >> > The recommendations about when not to use an index apply to using
> IN
> >> >> > in
> >> >> > the
> >> >> > WHERE clause. Under most conditions, using IN in the WHERE clause
> is
> >> >> > not
> >> >> > recommended. Using IN can degrade performance because usually many
> >> >> > nodes
> >> >> > must be queried. For example, in a single, local data center
> cluster
> >> >> > having
> >> >> > 30 nodes, a replication factor of 3, and a consistency level of
> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the
> >> >> > query
> >> >> > uses the IN condition, the number of nodes being queried are most
> >> >> > likely
> >> >> > even higher, up to 20 nodes depending on where the keys fall in the
> >> >> > token
> >> >> > range."
> >> >> >
> >> >> > In my system, I have a column family called "entity_lookup":
> >> >> >
> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
> >> >> >   'DC1' : 3 };
> >> >> > USE Identification1;
> >> >> >
> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
> >> >> >   name varchar,
> >> >> >   value varchar,
> >> >> >   entity_id uuid,
> >> >> >   PRIMARY KEY ((name, value), entity_id));
> >> >> >
> >> >> > And I use the following select to query it:
> >> >> >
> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
> >> >> >
> >> >> > Is this an anti-pattern?
> >> >> >
> >> >> > If not using SELECT IN, which other way would you recomend for
> >> >> > lookups
> >> >> > like
> >> >> > that? I have several values I would like to search in cassandra and
> >> >> > they
> >> >> > might not be in the same particion, as above.
> >> >> >
> >> >> > Is Cassandra the wrong tool for lookups like that?
> >> >> >
> >> >> > Best regards,
> >> >> > Marcelo Valle.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jon Haddad
> >> >> http://www.rustyrazorblade.com
> >> >> skype: rustyrazorblade
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jon Haddad
> >> http://www.rustyrazorblade.com
> >> skype: rustyrazorblade
> >
> >
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>

Re: Best way to do a multi_get using CQL

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

The only case in which it might be better to use an IN clause is if
the entire query can be satisfied from that machine.  Otherwise, go
async.

The native driver reuses connections and intelligently manages the
pool for you.  It can also multiplex queries over a single connection.

I am assuming you're using one of the datastax drivers for CQL, btw.

Jon

On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
<ma...@s1mbi0se.com.br> wrote:
> This is interesting, I didn't know that!
> It might make sense then to use select = + async + token aware, I will try
> to change my code.
>
> But would it be a "recomended solution" for these cases? Any other options?
>
> I still would if this is the right use case for Cassandra, to look for
> random keys in a huge cluster. After all, the amount of connections to
> Cassandra will still be huge, right... Wouldn't it be a problem?
> Or when you use async the driver reuses the connection?
>
> []s
>
>
> 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>
>> If you use async and your driver is token aware, it will go to the
>> proper node, rather than requiring the coordinator to do so.
>>
>> Realistically you're going to have a connection open to every server
>> anyways.  It's the difference between you querying for the data
>> directly and using a coordinator as a proxy.  It's faster to just ask
>> the node with the data.
>>
>> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>> <ma...@s1mbi0se.com.br> wrote:
>> > But using async queries wouldn't be even worse than using SELECT IN?
>> > The justification in the docs is I could query many nodes, but I would
>> > still
>> > do it.
>> >
>> > Today, I use both async queries AND SELECT IN:
>> >
>> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + "
>> > WHERE
>> > name=%s and value in(%s)"
>> >
>> > for name, values in identifiers.items():
>> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>> > ','.join(['%s']*len(values)))
>> >    args = [name] + values
>> >    query_msg = query % tuple(args)
>> >    futures.append((query_msg, self.session.execute_async(query, args)))
>> >
>> > for query_msg, future in futures:
>> >    try:
>> >       rows = future.result(timeout=100000)
>> >       for row in rows:
>> >         entity_ids.add(row.entity_id)
>> >    except:
>> >       logging.error("Query '%s' returned ERROR " % (query_msg))
>> >       raise
>> >
>> > Using async just with select = would mean instead of 1 async query
>> > (example:
>> > in (0, 1, 2)), I would do several, one for each value of "values" array
>> > above.
>> > In my head, this would mean more connections to Cassandra and the same
>> > amount of work, right? What would be the advantage?
>> >
>> > []s
>> >
>> >
>> >
>> >
>> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>> >
>> >> Your other option is to fire off async queries.  It's pretty
>> >> straightforward w/ the java or python drivers.
>> >>
>> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>> >> <ma...@s1mbi0se.com.br> wrote:
>> >> > I was taking a look at Cassandra anti-patterns list:
>> >> >
>> >> >
>> >> >
>> >> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>> >> >
>> >> > Among then is
>> >> >
>> >> > SELECT ... IN or index lookups¶
>> >> >
>> >> > SELECT ... IN and index lookups (formerly secondary indexes) should
>> >> > be
>> >> > avoided except for specific scenarios. See When not to use IN in
>> >> > SELECT
>> >> > and
>> >> > When not to use an index in Indexing in
>> >> >
>> >> > CQL for Cassandra 2.0"
>> >> >
>> >> > And Looking at the SELECT doc, I saw:
>> >> >
>> >> > When not to use IN¶
>> >> >
>> >> > The recommendations about when not to use an index apply to using IN
>> >> > in
>> >> > the
>> >> > WHERE clause. Under most conditions, using IN in the WHERE clause is
>> >> > not
>> >> > recommended. Using IN can degrade performance because usually many
>> >> > nodes
>> >> > must be queried. For example, in a single, local data center cluster
>> >> > having
>> >> > 30 nodes, a replication factor of 3, and a consistency level of
>> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the
>> >> > query
>> >> > uses the IN condition, the number of nodes being queried are most
>> >> > likely
>> >> > even higher, up to 20 nodes depending on where the keys fall in the
>> >> > token
>> >> > range."
>> >> >
>> >> > In my system, I have a column family called "entity_lookup":
>> >> >
>> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>> >> >   'DC1' : 3 };
>> >> > USE Identification1;
>> >> >
>> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>> >> >   name varchar,
>> >> >   value varchar,
>> >> >   entity_id uuid,
>> >> >   PRIMARY KEY ((name, value), entity_id));
>> >> >
>> >> > And I use the following select to query it:
>> >> >
>> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
>> >> >
>> >> > Is this an anti-pattern?
>> >> >
>> >> > If not using SELECT IN, which other way would you recomend for
>> >> > lookups
>> >> > like
>> >> > that? I have several values I would like to search in cassandra and
>> >> > they
>> >> > might not be in the same particion, as above.
>> >> >
>> >> > Is Cassandra the wrong tool for lookups like that?
>> >> >
>> >> > Best regards,
>> >> > Marcelo Valle.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jon Haddad
>> >> http://www.rustyrazorblade.com
>> >> skype: rustyrazorblade
>> >
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

This is interesting, I didn't know that!
It might make sense then to use select = + async + token aware, I will try
to change my code.

But would it be a "recomended solution" for these cases? Any other options?

I still would if this is the right use case for Cassandra, to look for
random keys in a huge cluster. After all, the amount of connections to
Cassandra will still be huge, right... Wouldn't it be a problem?
Or when you use async the driver reuses the connection?

[]s


2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:

> If you use async and your driver is token aware, it will go to the
> proper node, rather than requiring the coordinator to do so.
>
> Realistically you're going to have a connection open to every server
> anyways.  It's the difference between you querying for the data
> directly and using a coordinator as a proxy.  It's faster to just ask
> the node with the data.
>
> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
> <ma...@s1mbi0se.com.br> wrote:
> > But using async queries wouldn't be even worse than using SELECT IN?
> > The justification in the docs is I could query many nodes, but I would
> still
> > do it.
> >
> > Today, I use both async queries AND SELECT IN:
> >
> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + " WHERE
> > name=%s and value in(%s)"
> >
> > for name, values in identifiers.items():
> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
> ','.join(['%s']*len(values)))
> >    args = [name] + values
> >    query_msg = query % tuple(args)
> >    futures.append((query_msg, self.session.execute_async(query, args)))
> >
> > for query_msg, future in futures:
> >    try:
> >       rows = future.result(timeout=100000)
> >       for row in rows:
> >         entity_ids.add(row.entity_id)
> >    except:
> >       logging.error("Query '%s' returned ERROR " % (query_msg))
> >       raise
> >
> > Using async just with select = would mean instead of 1 async query
> (example:
> > in (0, 1, 2)), I would do several, one for each value of "values" array
> > above.
> > In my head, this would mean more connections to Cassandra and the same
> > amount of work, right? What would be the advantage?
> >
> > []s
> >
> >
> >
> >
> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
> >
> >> Your other option is to fire off async queries.  It's pretty
> >> straightforward w/ the java or python drivers.
> >>
> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
> >> <ma...@s1mbi0se.com.br> wrote:
> >> > I was taking a look at Cassandra anti-patterns list:
> >> >
> >> >
> >> >
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
> >> >
> >> > Among then is
> >> >
> >> > SELECT ... IN or index lookups¶
> >> >
> >> > SELECT ... IN and index lookups (formerly secondary indexes) should be
> >> > avoided except for specific scenarios. See When not to use IN in
> SELECT
> >> > and
> >> > When not to use an index in Indexing in
> >> >
> >> > CQL for Cassandra 2.0"
> >> >
> >> > And Looking at the SELECT doc, I saw:
> >> >
> >> > When not to use IN¶
> >> >
> >> > The recommendations about when not to use an index apply to using IN
> in
> >> > the
> >> > WHERE clause. Under most conditions, using IN in the WHERE clause is
> not
> >> > recommended. Using IN can degrade performance because usually many
> nodes
> >> > must be queried. For example, in a single, local data center cluster
> >> > having
> >> > 30 nodes, a replication factor of 3, and a consistency level of
> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the
> query
> >> > uses the IN condition, the number of nodes being queried are most
> likely
> >> > even higher, up to 20 nodes depending on where the keys fall in the
> >> > token
> >> > range."
> >> >
> >> > In my system, I have a column family called "entity_lookup":
> >> >
> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
> >> >   'DC1' : 3 };
> >> > USE Identification1;
> >> >
> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
> >> >   name varchar,
> >> >   value varchar,
> >> >   entity_id uuid,
> >> >   PRIMARY KEY ((name, value), entity_id));
> >> >
> >> > And I use the following select to query it:
> >> >
> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
> >> >
> >> > Is this an anti-pattern?
> >> >
> >> > If not using SELECT IN, which other way would you recomend for lookups
> >> > like
> >> > that? I have several values I would like to search in cassandra and
> they
> >> > might not be in the same particion, as above.
> >> >
> >> > Is Cassandra the wrong tool for lookups like that?
> >> >
> >> > Best regards,
> >> > Marcelo Valle.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jon Haddad
> >> http://www.rustyrazorblade.com
> >> skype: rustyrazorblade
> >
> >
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>

Re: Best way to do a multi_get using CQL

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

If you use async and your driver is token aware, it will go to the
proper node, rather than requiring the coordinator to do so.

Realistically you're going to have a connection open to every server
anyways.  It's the difference between you querying for the data
directly and using a coordinator as a proxy.  It's faster to just ask
the node with the data.

On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
<ma...@s1mbi0se.com.br> wrote:
> But using async queries wouldn't be even worse than using SELECT IN?
> The justification in the docs is I could query many nodes, but I would still
> do it.
>
> Today, I use both async queries AND SELECT IN:
>
> SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + " WHERE
> name=%s and value in(%s)"
>
> for name, values in identifiers.items():
>    query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
>    args = [name] + values
>    query_msg = query % tuple(args)
>    futures.append((query_msg, self.session.execute_async(query, args)))
>
> for query_msg, future in futures:
>    try:
>       rows = future.result(timeout=100000)
>       for row in rows:
>         entity_ids.add(row.entity_id)
>    except:
>       logging.error("Query '%s' returned ERROR " % (query_msg))
>       raise
>
> Using async just with select = would mean instead of 1 async query (example:
> in (0, 1, 2)), I would do several, one for each value of "values" array
> above.
> In my head, this would mean more connections to Cassandra and the same
> amount of work, right? What would be the advantage?
>
> []s
>
>
>
>
> 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:
>
>> Your other option is to fire off async queries.  It's pretty
>> straightforward w/ the java or python drivers.
>>
>> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>> <ma...@s1mbi0se.com.br> wrote:
>> > I was taking a look at Cassandra anti-patterns list:
>> >
>> >
>> > http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>> >
>> > Among then is
>> >
>> > SELECT ... IN or index lookups¶
>> >
>> > SELECT ... IN and index lookups (formerly secondary indexes) should be
>> > avoided except for specific scenarios. See When not to use IN in SELECT
>> > and
>> > When not to use an index in Indexing in
>> >
>> > CQL for Cassandra 2.0"
>> >
>> > And Looking at the SELECT doc, I saw:
>> >
>> > When not to use IN¶
>> >
>> > The recommendations about when not to use an index apply to using IN in
>> > the
>> > WHERE clause. Under most conditions, using IN in the WHERE clause is not
>> > recommended. Using IN can degrade performance because usually many nodes
>> > must be queried. For example, in a single, local data center cluster
>> > having
>> > 30 nodes, a replication factor of 3, and a consistency level of
>> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
>> > uses the IN condition, the number of nodes being queried are most likely
>> > even higher, up to 20 nodes depending on where the keys fall in the
>> > token
>> > range."
>> >
>> > In my system, I have a column family called "entity_lookup":
>> >
>> > CREATE KEYSPACE IF NOT EXISTS Identification1
>> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>> >   'DC1' : 3 };
>> > USE Identification1;
>> >
>> > CREATE TABLE IF NOT EXISTS entity_lookup (
>> >   name varchar,
>> >   value varchar,
>> >   entity_id uuid,
>> >   PRIMARY KEY ((name, value), entity_id));
>> >
>> > And I use the following select to query it:
>> >
>> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
>> >
>> > Is this an anti-pattern?
>> >
>> > If not using SELECT IN, which other way would you recomend for lookups
>> > like
>> > that? I have several values I would like to search in cassandra and they
>> > might not be in the same particion, as above.
>> >
>> > Is Cassandra the wrong tool for lookups like that?
>> >
>> > Best regards,
>> > Marcelo Valle.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Best way to do a multi_get using CQL

Posted by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br>.

But using async queries wouldn't be even worse than using SELECT IN?
The justification in the docs is I could query many nodes, but I would
still do it.

Today, I use both async queries AND SELECT IN:

SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + " WHERE
name=%s and value in(%s)"

for name, values in identifiers.items():
   query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values)))
   args = [name] + values
   query_msg = query % tuple(args)
   futures.append((query_msg, self.session.execute_async(query, args)))

for query_msg, future in futures:
   try:
      rows = future.result(timeout=100000)
      for row in rows:
        entity_ids.add(row.entity_id)
   except:
      logging.error("Query '%s' returned ERROR " % (query_msg))
      raise

Using async just with select = would mean instead of 1 async query
(example: in (0, 1, 2)), I would do several, one for each value of "values"
array above.
In my head, this would mean more connections to Cassandra and the same
amount of work, right? What would be the advantage?

[]s




2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jo...@jonhaddad.com>:

> Your other option is to fire off async queries.  It's pretty
> straightforward w/ the java or python drivers.
>
> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
> <ma...@s1mbi0se.com.br> wrote:
> > I was taking a look at Cassandra anti-patterns list:
> >
> >
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
> >
> > Among then is
> >
> > SELECT ... IN or index lookups¶
> >
> > SELECT ... IN and index lookups (formerly secondary indexes) should be
> > avoided except for specific scenarios. See When not to use IN in SELECT
> and
> > When not to use an index in Indexing in
> >
> > CQL for Cassandra 2.0"
> >
> > And Looking at the SELECT doc, I saw:
> >
> > When not to use IN¶
> >
> > The recommendations about when not to use an index apply to using IN in
> the
> > WHERE clause. Under most conditions, using IN in the WHERE clause is not
> > recommended. Using IN can degrade performance because usually many nodes
> > must be queried. For example, in a single, local data center cluster
> having
> > 30 nodes, a replication factor of 3, and a consistency level of
> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
> > uses the IN condition, the number of nodes being queried are most likely
> > even higher, up to 20 nodes depending on where the keys fall in the token
> > range."
> >
> > In my system, I have a column family called "entity_lookup":
> >
> > CREATE KEYSPACE IF NOT EXISTS Identification1
> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
> >   'DC1' : 3 };
> > USE Identification1;
> >
> > CREATE TABLE IF NOT EXISTS entity_lookup (
> >   name varchar,
> >   value varchar,
> >   entity_id uuid,
> >   PRIMARY KEY ((name, value), entity_id));
> >
> > And I use the following select to query it:
> >
> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
> >
> > Is this an anti-pattern?
> >
> > If not using SELECT IN, which other way would you recomend for lookups
> like
> > that? I have several values I would like to search in cassandra and they
> > might not be in the same particion, as above.
> >
> > Is Cassandra the wrong tool for lookups like that?
> >
> > Best regards,
> > Marcelo Valle.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>

Re: Best way to do a multi_get using CQL

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Your other option is to fire off async queries.  It's pretty
straightforward w/ the java or python drivers.

On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
<ma...@s1mbi0se.com.br> wrote:
> I was taking a look at Cassandra anti-patterns list:
>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>
> Among then is
>
> SELECT ... IN or index lookups¶
>
> SELECT ... IN and index lookups (formerly secondary indexes) should be
> avoided except for specific scenarios. See When not to use IN in SELECT and
> When not to use an index in Indexing in
>
> CQL for Cassandra 2.0"
>
> And Looking at the SELECT doc, I saw:
>
> When not to use IN¶
>
> The recommendations about when not to use an index apply to using IN in the
> WHERE clause. Under most conditions, using IN in the WHERE clause is not
> recommended. Using IN can degrade performance because usually many nodes
> must be queried. For example, in a single, local data center cluster having
> 30 nodes, a replication factor of 3, and a consistency level of
> LOCAL_QUORUM, a single key query goes out to two nodes, but if the query
> uses the IN condition, the number of nodes being queried are most likely
> even higher, up to 20 nodes depending on where the keys fall in the token
> range."
>
> In my system, I have a column family called "entity_lookup":
>
> CREATE KEYSPACE IF NOT EXISTS Identification1
>   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>   'DC1' : 3 };
> USE Identification1;
>
> CREATE TABLE IF NOT EXISTS entity_lookup (
>   name varchar,
>   value varchar,
>   entity_id uuid,
>   PRIMARY KEY ((name, value), entity_id));
>
> And I use the following select to query it:
>
> SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s)
>
> Is this an anti-pattern?
>
> If not using SELECT IN, which other way would you recomend for lookups like
> that? I have several values I would like to search in cassandra and they
> might not be in the same particion, as above.
>
> Is Cassandra the wrong tool for lookups like that?
>
> Best regards,
> Marcelo Valle.
>
>
>
>
>
>
>
>
>
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade