You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "Roger Fischer (CW)" <rf...@Brocade.com> on 2017/07/27 00:03:09 UTC

Question on efficient loading from Cassandra

Hello,

what is the best way to efficiently load data from a backing store, like Cassandra. I am looking for a solution that minimizes work in Ignite and Cassandra.

As I understand:

The simplest way is to call loadCache() with a single select statement.
cache.loadCache( null, "select * from a_table where a_date_time >= '2017-07-25 10:00:00');")

Is it correct that:
1) Each Ignite node gets the same loadCache() request.
2) Each Ignite node sends the same query to Cassandra.
3) Each Ignite node gets all matched objects (rows) back from Cassandra.
4) Each Ignite node stores only the objects for which it has the primary partition, or a backup partition.

Unless I misunderstand, this simple approach has the following inefficiencies:
a) Cassandra executes the same query multiple times, once for each Ignite node.
b) The query results are transferred multiple times, once for each Ignite node.
c) The Ignite node gets a lot of data which it does not need (has neither primary or backup partition).
d) Each Cassandra node has to query all partitions.

loadCache() supports multiple queries. This allows the query to be broken down, ideally (for this case) into one query per Cassandra partition.

cache.loadCache( null, "select * from a_table where partition_key = 0 and a_date_time >= '2017-07-25 10:00:00');", "select * from a_table where partition_key = 1 and a_date_time >= '2017-07-25 10:00:00');", ...)

This optimizes the Cassandra query, as each query is constrained to one Cassandra partition.

But, I think, each node still needs to execute each query. Thus none of the other inefficiencies are eliminated.

I believe that, when multiple cores (worker threads) are available, the Ignite nodes will execute multiple queries in parallel. So, there is a reduction in elapsed time. Correct?

Now, is there any way to avoid that Cassandra has to execute the same query multiple times, and that the data is transferred multiple times?

One approach would be that an Ignite node modifies the query so that it only includes the partitions for which it has the primary or a backup partition. That eliminates some duplication, but may not result in efficient queries in Cassandra.

Another approach is that Ignite forwards objects for which it is not the primary or does not have a backup (similar to when an application does a put()). That would optimize the Cassandra query, but require additional communications between Ignite nodes.

What if Ignite and Cassandra partitions were aligned? Then queries could be created that only return data relevant to the node and only query a subset of Cassandra partitions. But this seems not practical for a generalized system (I think).

Any other suggestions?

Thanks...

Roger

PS: The use case for this is to use Ignite as an SQL cache for a large data set in the Cassandra DB. The most recent data is pre-loaded (and updated) in Ignite. When older data is required, it is loaded first into Ignite, and then processed. It is this dynamic loading that should be quick (and efficient).


Re: Question on efficient loading from Cassandra

Posted by Igor Rudyak <ir...@gmail.com>.
Hi Nikolai,

As for now Ignite-Cassandra module always executes same CQL query on each
node while doing loadCache(...).

But you assumptions are right and there is a ticket for this:
https://issues.apache.org/jira/browse/IGNITE-3962


Igor



On Thu, Jul 27, 2017 at 10:28 AM, Nikolai Tikhonov <nt...@apache.org>
wrote:

> Hello,
>
> >So, there is a reduction in elapsed time. Correct?
>
> I think that it is not correct for any case. If you have significal count
> of nodes (for example 20 nodes with 4 cores) than in short period of time
> Ignite will be quering to Cassandra from ~80 threads. I'm not sure that
> this high load will bring more performance than one thread per node. BTW do
> you know Casssandra caches quries?
>
>
> On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <rf...@brocade.com>
> wrote:
>
>> Hello,
>>
>>
>>
>> what is the best way to efficiently load data from a backing store, like
>> Cassandra. I am looking for a solution that minimizes work in Ignite and
>> Cassandra.
>>
>>
>>
>> As I understand:
>>
>>
>>
>> The simplest way is to call loadCache() with a single select statement.
>>
>> cache.loadCache( null, “select * from a_table where a_date_time >=
>> ‘2017-07-25 10:00:00’);”)
>>
>>
>>
>> Is it correct that:
>>
>> 1) Each Ignite node gets the same loadCache() request.
>>
>> 2) Each Ignite node sends the same query to Cassandra.
>>
>> 3) Each Ignite node gets all matched objects (rows) back from Cassandra.
>>
>> 4) Each Ignite node stores only the objects for which it has the primary
>> partition, or a backup partition.
>>
>>
>>
>> Unless I misunderstand, this simple approach has the following
>> inefficiencies:
>>
>> a) Cassandra executes the same query multiple times, once for each Ignite
>> node.
>>
>> b) The query results are transferred multiple times, once for each Ignite
>> node.
>>
>> c) The Ignite node gets a lot of data which it does not need (has neither
>> primary or backup partition).
>>
>> d) Each Cassandra node has to query all partitions.
>>
>>
>>
>> loadCache() supports multiple queries. This allows the query to be broken
>> down, ideally (for this case) into one query per Cassandra partition.
>>
>>
>>
>> cache.loadCache( null, “select * from a_table where partition_key = 0 and
>> a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where
>> partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …)
>>
>>
>>
>> This optimizes the Cassandra query, as each query is constrained to one
>> Cassandra partition.
>>
>>
>>
>> But, I think, each node still needs to execute each query. Thus none of
>> the other inefficiencies are eliminated.
>>
>>
>>
>> I believe that, when multiple cores (worker threads) are available, the
>> Ignite nodes will execute multiple queries in parallel. So, there is a
>> reduction in elapsed time. Correct?
>>
>>
>>
>> Now, is there any way to avoid that Cassandra has to execute the same
>> query multiple times, and that the data is transferred multiple times?
>>
>>
>>
>> One approach would be that an Ignite node modifies the query so that it
>> only includes the partitions for which it has the primary or a backup
>> partition. That eliminates some duplication, but may not result in
>> efficient queries in Cassandra.
>>
>>
>>
>> Another approach is that Ignite forwards objects for which it is not the
>> primary or does not have a backup (similar to when an application does a
>> put()). That would optimize the Cassandra query, but require additional
>> communications between Ignite nodes.
>>
>>
>>
>> What if Ignite and Cassandra partitions were aligned? Then queries could
>> be created that only return data relevant to the node and only query a
>> subset of Cassandra partitions. But this seems not practical for a
>> generalized system (I think).
>>
>>
>>
>> Any other suggestions?
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>> PS: The use case for this is to use Ignite as an SQL cache for a large
>> data set in the Cassandra DB. The most recent data is pre-loaded (and
>> updated) in Ignite. When older data is required, it is loaded first into
>> Ignite, and then processed. It is this dynamic loading that should be quick
>> (and efficient).
>>
>>
>>
>
>

Re: Question on efficient loading from Cassandra

Posted by Nikolai Tikhonov <nt...@apache.org>.
Hello,

>So, there is a reduction in elapsed time. Correct?

I think that it is not correct for any case. If you have significal count
of nodes (for example 20 nodes with 4 cores) than in short period of time
Ignite will be quering to Cassandra from ~80 threads. I'm not sure that
this high load will bring more performance than one thread per node. BTW do
you know Casssandra caches quries?


On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <rf...@brocade.com>
wrote:

> Hello,
>
>
>
> what is the best way to efficiently load data from a backing store, like
> Cassandra. I am looking for a solution that minimizes work in Ignite and
> Cassandra.
>
>
>
> As I understand:
>
>
>
> The simplest way is to call loadCache() with a single select statement.
>
> cache.loadCache( null, “select * from a_table where a_date_time >=
> ‘2017-07-25 10:00:00’);”)
>
>
>
> Is it correct that:
>
> 1) Each Ignite node gets the same loadCache() request.
>
> 2) Each Ignite node sends the same query to Cassandra.
>
> 3) Each Ignite node gets all matched objects (rows) back from Cassandra.
>
> 4) Each Ignite node stores only the objects for which it has the primary
> partition, or a backup partition.
>
>
>
> Unless I misunderstand, this simple approach has the following
> inefficiencies:
>
> a) Cassandra executes the same query multiple times, once for each Ignite
> node.
>
> b) The query results are transferred multiple times, once for each Ignite
> node.
>
> c) The Ignite node gets a lot of data which it does not need (has neither
> primary or backup partition).
>
> d) Each Cassandra node has to query all partitions.
>
>
>
> loadCache() supports multiple queries. This allows the query to be broken
> down, ideally (for this case) into one query per Cassandra partition.
>
>
>
> cache.loadCache( null, “select * from a_table where partition_key = 0 and
> a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where
> partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …)
>
>
>
> This optimizes the Cassandra query, as each query is constrained to one
> Cassandra partition.
>
>
>
> But, I think, each node still needs to execute each query. Thus none of
> the other inefficiencies are eliminated.
>
>
>
> I believe that, when multiple cores (worker threads) are available, the
> Ignite nodes will execute multiple queries in parallel. So, there is a
> reduction in elapsed time. Correct?
>
>
>
> Now, is there any way to avoid that Cassandra has to execute the same
> query multiple times, and that the data is transferred multiple times?
>
>
>
> One approach would be that an Ignite node modifies the query so that it
> only includes the partitions for which it has the primary or a backup
> partition. That eliminates some duplication, but may not result in
> efficient queries in Cassandra.
>
>
>
> Another approach is that Ignite forwards objects for which it is not the
> primary or does not have a backup (similar to when an application does a
> put()). That would optimize the Cassandra query, but require additional
> communications between Ignite nodes.
>
>
>
> What if Ignite and Cassandra partitions were aligned? Then queries could
> be created that only return data relevant to the node and only query a
> subset of Cassandra partitions. But this seems not practical for a
> generalized system (I think).
>
>
>
> Any other suggestions?
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
> PS: The use case for this is to use Ignite as an SQL cache for a large
> data set in the Cassandra DB. The most recent data is pre-loaded (and
> updated) in Ignite. When older data is required, it is loaded first into
> Ignite, and then processed. It is this dynamic loading that should be quick
> (and efficient).
>
>
>