You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Hartzman, Leslie" <le...@medtronic.com> on 2013/09/21 00:25:50 UTC

Ad-hoc queries question

I know that for NoSQL the idea is to figure out your queries beforehand and then plan your data architecture to support them. And this typically is accomplished with a denormalized database.

So are ad-hoc queries more awkward or not feasible?

Thanks.

Les


[CONFIDENTIALITY AND PRIVACY NOTICE]

Information transmitted by this email is proprietary to Medtronic and is intended for use only by the individual or entity to which it is addressed, and may contain information that is private, privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please delete this mail from your records.
 
To view this notice in other languages you can either select the following link or manually copy and paste the link into the address bar of a web browser: http://emaildisclaimer.medtronic.com

RE: Ad-hoc queries question

Posted by "Hartzman, Leslie" <le...@medtronic.com>.

Cool! Thanks for the suggestions.

From: Peter Lin [mailto:woolfel@gmail.com]
Sent: Friday, September 20, 2013 4:52 PM
To: user@cassandra.apache.org
Subject: Re: Ad-hoc queries question

there are several ways of handling these types of use cases. Some people take a soft real-time approach by calculating aggregates in-memory and saving it to tables periodically. One example of this is twitter and storm. Other techniques includes using batch process to extract summaries and storing them in a OLAP cube, for reporting purposes.
If your application doesn't need ad-hoc queries results immediately, usually mapreduce is sufficient. Many people use Pig and Hive to do this type of operation.

On Fri, Sep 20, 2013 at 7:41 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
By ad-hoc queries I mean exactly what you've described. The need to access data from multiple column families, typically addressed in RDBs with JOINs.

I haven't really become familiar enough with MapReduce yet, so I'll have to delve deeper into that. I'm hoping that the de-normalized nature of things would obviate the need for complex subquery-type of operations.

From: Peter Lin [mailto:woolfel@gmail.com<ma...@gmail.com>]
Sent: Friday, September 20, 2013 4:30 PM

To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Ad-hoc queries question

What do you mean by ad-hoc queries?
Most NoSql databases do not support cross table joins, due to the distributed nature of NoSql databases. If we compare this to partitioned databases in the RDB world, cross partition joins is also more expensive than non-partitioned databases.
you can do ad-hoc queries on a single table as long as the columns have secondary indexes defined. You can do multi-table joins using MapReduce or using CQL handle that logic in your application. In some cases, you can use the concept of summary tables to speed up complex multi-table adhoc queries that have nasty joins. One thing that is very hard to do with all NoSql databases is complex correlated subqueries. For those kinds of use cases, MapReduce is the "preferred" technique.

for comparison, databases like Oracle RAC distribute table indexes and perform index joins to speed up complex multi-table joins. The downside is a full Oracle RAC is very expensive and has a high up front cost.

On Fri, Sep 20, 2013 at 7:20 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
Thanks Rob. I thought that might have been the situation but wasn't sure. So does this negate the use of cqlsh to do this then? I'd hate to have to provide custom code to support ad-hoc queries.

Les

From: Robert Coli [mailto:rcoli@eventbrite.com<ma...@eventbrite.com>]
Sent: Friday, September 20, 2013 4:06 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Ad-hoc queries question

On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
So are ad-hoc queries more awkward or not feasible?

Yes.

To expand slightly, you will probably end up querying multiple columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.

=Rob

[CONFIDENTIALITY AND PRIVACY NOTICE] Information transmitted by this email is proprietary to Medtronic and is intended for use only by the individual or entity to which it is addressed, and may contain information that is private, privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please delete this mail from your records. To view this notice in other languages you can either select the following link or manually copy and paste the link into the address bar of a web browser: http://emaildisclaimer.medtronic.com

Re: Ad-hoc queries question

Posted by Peter Lin <wo...@gmail.com>.

there are several ways of handling these types of use cases. Some people
take a soft real-time approach by calculating aggregates in-memory and
saving it to tables periodically. One example of this is twitter and storm.
Other techniques includes using batch process to extract summaries and
storing them in a OLAP cube, for reporting purposes.

If your application doesn't need ad-hoc queries results immediately,
usually mapreduce is sufficient. Many people use Pig and Hive to do this
type of operation.



On Fri, Sep 20, 2013 at 7:41 PM, Hartzman, Leslie <
leslie.d.hartzman@medtronic.com> wrote:

>  By ad-hoc queries I mean exactly what you’ve described. The need to
> access data from multiple column families, typically addressed in RDBs with
> JOINs. ****
>
> ** **
>
> I haven’t really become familiar enough with MapReduce yet, so I’ll have
> to delve deeper into that. I’m hoping that the de-normalized nature of
> things would obviate the need for complex subquery-type of operations.****
>
> ** **
>
> *From:* Peter Lin [mailto:woolfel@gmail.com]
> *Sent:* Friday, September 20, 2013 4:30 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Ad-hoc queries question****
>
> ** **
>
> ** **
>
> What do you mean by ad-hoc queries?****
>
> Most NoSql databases do not support cross table joins, due to the
> distributed nature of NoSql databases. If we compare this to partitioned
> databases in the RDB world, cross partition joins is also more expensive
> than non-partitioned databases.****
>
> you can do ad-hoc queries on a single table as long as the columns have
> secondary indexes defined. You can do multi-table joins using MapReduce or
> using CQL handle that logic in your application. In some cases, you can use
> the concept of summary tables to speed up complex multi-table adhoc queries
> that have nasty joins. One thing that is very hard to do with all NoSql
> databases is complex correlated subqueries. For those kinds of use cases,
> MapReduce is the "preferred" technique.
>
> for comparison, databases like Oracle RAC distribute table indexes and
> perform index joins to speed up complex multi-table joins. The downside is
> a full Oracle RAC is very expensive and has a high up front cost.****
>
> ** **
>
> On Fri, Sep 20, 2013 at 7:20 PM, Hartzman, Leslie <
> leslie.d.hartzman@medtronic.com> wrote:****
>
> Thanks Rob. I thought that might have been the situation but wasn’t sure.
> So does this negate the use of cqlsh to do this then? I’d hate to have to
> provide custom code to support ad-hoc queries.****
>
>  ****
>
> Les****
>
>  ****
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
> *Sent:* Friday, September 20, 2013 4:06 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Ad-hoc queries question****
>
>  ****
>
> On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <
> leslie.d.hartzman@medtronic.com> wrote:****
>
>  So are ad-hoc queries more awkward or not feasible?****
>
>   ****
>
> Yes.****
>
>  ****
>
> To expand slightly, you will probably end up querying multiple
> columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.
> ****
>
>  ****
>
> =Rob****
>
>  ****
>
> [CONFIDENTIALITY AND PRIVACY NOTICE] Information transmitted by this email
> is proprietary to Medtronic and is intended for use only by the individual
> or entity to which it is addressed, and may contain information that is
> private, privileged, confidential or exempt from disclosure under
> applicable law. If you are not the intended recipient or it appears that
> this mail has been forwarded to you without proper authority, you are
> notified that any use or dissemination of this information in any manner is
> strictly prohibited. In such cases, please delete this mail from your
> records. To view this notice in other languages you can either select the
> following link or manually copy and paste the link into the address bar of
> a web browser: http://emaildisclaimer.medtronic.com****
>
> ** **
>

RE: Ad-hoc queries question

Posted by "Hartzman, Leslie" <le...@medtronic.com>.

By ad-hoc queries I mean exactly what you've described. The need to access data from multiple column families, typically addressed in RDBs with JOINs.

I haven't really become familiar enough with MapReduce yet, so I'll have to delve deeper into that. I'm hoping that the de-normalized nature of things would obviate the need for complex subquery-type of operations.

From: Peter Lin [mailto:woolfel@gmail.com]
Sent: Friday, September 20, 2013 4:30 PM
To: user@cassandra.apache.org
Subject: Re: Ad-hoc queries question

What do you mean by ad-hoc queries?
Most NoSql databases do not support cross table joins, due to the distributed nature of NoSql databases. If we compare this to partitioned databases in the RDB world, cross partition joins is also more expensive than non-partitioned databases.
you can do ad-hoc queries on a single table as long as the columns have secondary indexes defined. You can do multi-table joins using MapReduce or using CQL handle that logic in your application. In some cases, you can use the concept of summary tables to speed up complex multi-table adhoc queries that have nasty joins. One thing that is very hard to do with all NoSql databases is complex correlated subqueries. For those kinds of use cases, MapReduce is the "preferred" technique.

for comparison, databases like Oracle RAC distribute table indexes and perform index joins to speed up complex multi-table joins. The downside is a full Oracle RAC is very expensive and has a high up front cost.

On Fri, Sep 20, 2013 at 7:20 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
Thanks Rob. I thought that might have been the situation but wasn't sure. So does this negate the use of cqlsh to do this then? I'd hate to have to provide custom code to support ad-hoc queries.

Les

From: Robert Coli [mailto:rcoli@eventbrite.com<ma...@eventbrite.com>]
Sent: Friday, September 20, 2013 4:06 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Ad-hoc queries question

On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
So are ad-hoc queries more awkward or not feasible?

Yes.

To expand slightly, you will probably end up querying multiple columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.

=Rob

[CONFIDENTIALITY AND PRIVACY NOTICE] Information transmitted by this email is proprietary to Medtronic and is intended for use only by the individual or entity to which it is addressed, and may contain information that is private, privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please delete this mail from your records. To view this notice in other languages you can either select the following link or manually copy and paste the link into the address bar of a web browser: http://emaildisclaimer.medtronic.com

Re: Ad-hoc queries question

Posted by Peter Lin <wo...@gmail.com>.

What do you mean by ad-hoc queries?

Most NoSql databases do not support cross table joins, due to the
distributed nature of NoSql databases. If we compare this to partitioned
databases in the RDB world, cross partition joins is also more expensive
than non-partitioned databases.

you can do ad-hoc queries on a single table as long as the columns have
secondary indexes defined. You can do multi-table joins using MapReduce or
using CQL handle that logic in your application. In some cases, you can use
the concept of summary tables to speed up complex multi-table adhoc queries
that have nasty joins. One thing that is very hard to do with all NoSql
databases is complex correlated subqueries. For those kinds of use cases,
MapReduce is the "preferred" technique.

for comparison, databases like Oracle RAC distribute table indexes and
perform index joins to speed up complex multi-table joins. The downside is
a full Oracle RAC is very expensive and has a high up front cost.

On Fri, Sep 20, 2013 at 7:20 PM, Hartzman, Leslie <
leslie.d.hartzman@medtronic.com> wrote:

>  Thanks Rob. I thought that might have been the situation but wasn’t
> sure. So does this negate the use of cqlsh to do this then? I’d hate to
> have to provide custom code to support ad-hoc queries.****
>
> ** **
>
> Les****
>
> ** **
>
> *From:* Robert Coli [mailto:rcoli@eventbrite.com]
> *Sent:* Friday, September 20, 2013 4:06 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Ad-hoc queries question****
>
> ** **
>
> On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <
> leslie.d.hartzman@medtronic.com> wrote:****
>
>  So are ad-hoc queries more awkward or not feasible?****
>
>  ** **
>
> Yes.****
>
> ** **
>
> To expand slightly, you will probably end up querying multiple
> columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.
> ****
>
> ** **
>
> =Rob****
>
> ** **
>
> [CONFIDENTIALITY AND PRIVACY NOTICE] Information transmitted by this email
> is proprietary to Medtronic and is intended for use only by the individual
> or entity to which it is addressed, and may contain information that is
> private, privileged, confidential or exempt from disclosure under
> applicable law. If you are not the intended recipient or it appears that
> this mail has been forwarded to you without proper authority, you are
> notified that any use or dissemination of this information in any manner is
> strictly prohibited. In such cases, please delete this mail from your
> records. To view this notice in other languages you can either select the
> following link or manually copy and paste the link into the address bar of
> a web browser: http://emaildisclaimer.medtronic.com
>

RE: Ad-hoc queries question

Posted by "Hartzman, Leslie" <le...@medtronic.com>.

Yeah, I know it was vague, but that is due to the fact that I'm still coming up to speed on the project and have yet to hear some of the details. Since I had heard that there has always been a requirement for ad-hoc queries against the Oracle DB for data-mining purpsoes, that was the best I could do. The database definition to support the day-to-day needs was not a worry. It was how some analysts or business people might use it.

My hope is that with a better understanding of how Oracle is used now, a better database definition will remove/minimize the need for some of the existing complexity.

Les

From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Friday, September 20, 2013 5:10 PM
To: user@cassandra.apache.org
Subject: Re: Ad-hoc queries question

On Fri, Sep 20, 2013 at 4:20 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
Thanks Rob. I thought that might have been the situation but wasn't sure. So does this negate the use of cqlsh to do this then? I'd hate to have to provide custom code to support ad-hoc queries.

The form of your question is pretty vague. CQLsh, and CQL generally, give you a bit more flexibility to construct complex queries than the old thrift interface. The more complex these queries, however, the worse/less predictably they are likely to perform. An example would be ALLOW FILTERING, which the docs describe thusly.

"
By default, CQL only allows select queries that don't involve "filtering" server side, i.e. queries where we know that all (live) record read will be returned (maybe partly) in the result set. The reasoning is that those "non filtering" queries have predictable performance in the sense that they will execute in a time that is proportional to the amount of data returned by the query (which can be controlled through LIMIT).

The ALLOW FILTERING option allows to explicitely [sic] allow (some) queries that require filtering. Please note that a query using ALLOW FILTERING may thus have unpredictable performance (for the definition above), i.e. even a query that selects a handful of records may exhibit performance that depends on the total amount of data stored in the cluster.
"

=Rob

[CONFIDENTIALITY AND PRIVACY NOTICE]

Information transmitted by this email is proprietary to Medtronic and is intended for use only by the individual or entity to which it is addressed, and may contain information that is private, privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please delete this mail from your records.

To view this notice in other languages you can either select the following link or manually copy and paste the link into the address bar of a web browser: http://emaildisclaimer.medtronic.com

Re: Ad-hoc queries question

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Sep 20, 2013 at 4:20 PM, Hartzman, Leslie <
leslie.d.hartzman@medtronic.com> wrote:

>  Thanks Rob. I thought that might have been the situation but wasn’t
> sure. So does this negate the use of cqlsh to do this then? I’d hate to
> have to provide custom code to support ad-hoc queries.
>

The form of your question is pretty vague. CQLsh, and CQL generally, give
you a bit more flexibility to construct complex queries than the old thrift
interface. The more complex these queries, however, the worse/less
predictably they are likely to perform. An example would be ALLOW
FILTERING, which the docs describe thusly.

"
By default, CQL only allows select queries that don’t involve “filtering”
server side, i.e. queries where we know that all (live) record read will be
returned (maybe partly) in the result set. The reasoning is that those “non
filtering” queries have predictable performance in the sense that they will
execute in a time that is proportional to the amount of data returned by
the query (which can be controlled through LIMIT).

The ALLOW FILTERING option allows to explicitely [sic] allow (some) queries
that require filtering. Please note that a query using ALLOW FILTERING may
thus have unpredictable performance (for the definition above), i.e. even a
query that selects a handful of records may exhibit performance that
depends on the total amount of data stored in the cluster.
"

=Rob

RE: Ad-hoc queries question

Posted by "Hartzman, Leslie" <le...@medtronic.com>.

Thanks Rob. I thought that might have been the situation but wasn't sure. So does this negate the use of cqlsh to do this then? I'd hate to have to provide custom code to support ad-hoc queries.

Les

From: Robert Coli [mailto:rcoli@eventbrite.com]
Sent: Friday, September 20, 2013 4:06 PM
To: user@cassandra.apache.org
Subject: Re: Ad-hoc queries question

On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <le...@medtronic.com>> wrote:
So are ad-hoc queries more awkward or not feasible?

Yes.

To expand slightly, you will probably end up querying multiple columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.

=Rob

[CONFIDENTIALITY AND PRIVACY NOTICE]

Information transmitted by this email is proprietary to Medtronic and is intended for use only by the individual or entity to which it is addressed, and may contain information that is private, privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please delete this mail from your records.

To view this notice in other languages you can either select the following link or manually copy and paste the link into the address bar of a web browser: http://emaildisclaimer.medtronic.com

Re: Ad-hoc queries question

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Sep 20, 2013 at 3:25 PM, Hartzman, Leslie <
leslie.d.hartzman@medtronic.com> wrote:

>  So are ad-hoc queries more awkward or not feasible?
>

Yes.

To expand slightly, you will probably end up querying multiple
columnfamilies and doing the ad-hoc JOIN-esque aspect in application code.

=Rob