You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "renga.kannan" <re...@gmail.com> on 2015/07/21 02:04:56 UTC

Is SPARK is the right choice for traditional OLAP query processing?

All,
I really appreciate anyone's input on this. We are having a very simple
traditional OLAP query processing use case. Our use case is as follows.


1. We have a customer sales order table data coming from RDBMs table. 
2. There are many dimension columns in the sales order table. For each of
those dimensions, we have individual dimension tables that stores the
dimension record sets.
3. We also have some BI like hierarchies that is defined for dimension data
set.

What we want for business users is as follows.?

1. We wanted to show some aggregated values from sales Order transaction
table columns.
2. User would like to filter these with specific dimension values from
dimension table.
3. User should be able to drill down from higher level to lower level by
traversing hierarchy on dimension


We want these use actions respond within 2 to 5 seconds.


We are thinking about using SPARK as our backend enginee to sever data to
these front end application.


Has anyone tried using SPARK for these kind of use cases. These are all
traditional use cases in BI space. If so, can SPARK respond to these queries
with in 2 to 5 seconds for large data sets.

Thanks,
Renga



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


R: Is SPARK is the right choice for traditional OLAP query processing?

Posted by Paolo Platter <pa...@agilelab.it>.
Try to give a look at zoomdata. They are spark based and they offer BI features with good performance.

Paolo

Inviata dal mio Windows Phone
________________________________
Da: Ruslan Dautkhanov<ma...@gmail.com>
Inviato: ‎29/‎07/‎2015 06:18
A: renga.kannan<ma...@gmail.com>
Cc: user<ma...@spark.apache.org>
Oggetto: Re: Is SPARK is the right choice for traditional OLAP query processing?

>> We want these use actions respond within 2 to 5 seconds.

I think this goal is a stretch for Spark. Some queries may run faster than that on a large dataset,
but in general you can't put an SLA like this. For example if you have to join some huge datasets,
you'll likely will be much over that. Spark is great for huge jobs and it'll be much faster than MR.
I don't think Spark was designed with interactive queries in mind. For example, although Spark is
"in-memory", its in-memory is only for a job. It's not like in traditional RDBMS systems where you
have a persistent "buffer cache" or "in-memory columnar storage" (both are Oracle terms)
If you have multiple users running interatactive BI queries, results that were cached for first user
wouldn't be used by second user. Unless you invent something that would keep a persistent
Spark context and serve users' requests and decided which RDDs to cache, when and how.
At least that's my understanding how Spark works. If I'm wrong, I will be glad to hear that as
we ran into the same questions.

As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez project,
but Tez has components that resemble closer to "buffer cache" or "in-memory columnar storage" caching
from traditional RDBMS systems, and may get better and/or more predictable performance on
BI queries.



--
Ruslan Dautkhanov

On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan <re...@gmail.com>> wrote:
All,
I really appreciate anyone's input on this. We are having a very simple
traditional OLAP query processing use case. Our use case is as follows.


1. We have a customer sales order table data coming from RDBMs table.
2. There are many dimension columns in the sales order table. For each of
those dimensions, we have individual dimension tables that stores the
dimension record sets.
3. We also have some BI like hierarchies that is defined for dimension data
set.

What we want for business users is as follows.?

1. We wanted to show some aggregated values from sales Order transaction
table columns.
2. User would like to filter these with specific dimension values from
dimension table.
3. User should be able to drill down from higher level to lower level by
traversing hierarchy on dimension


We want these use actions respond within 2 to 5 seconds.


We are thinking about using SPARK as our backend enginee to sever data to
these front end application.


Has anyone tried using SPARK for these kind of use cases. These are all
traditional use cases in BI space. If so, can SPARK respond to these queries
with in 2 to 5 seconds for large data sets.

Thanks,
Renga



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>



Re: Is SPARK is the right choice for traditional OLAP query processing?

Posted by Ruslan Dautkhanov <da...@gmail.com>.
>> We want these use actions respond within 2 to 5 seconds.

I think this goal is a stretch for Spark. Some queries may run faster than
that on a large dataset,
but in general you can't put an SLA like this. For example if you have to
join some huge datasets,
you'll likely will be much over that. Spark is great for huge jobs and
it'll be much faster than MR.
I don't think Spark was designed with interactive queries in mind. For
example, although Spark is
"in-memory", its in-memory is only for a job. It's not like in traditional
RDBMS systems where you
have a persistent "buffer cache" or "in-memory columnar storage" (both are
Oracle terms)
If you have multiple users running interatactive BI queries, results that
were cached for first user
wouldn't be used by second user. Unless you invent something that would
keep a persistent
Spark context and serve users' requests and decided which RDDs to cache,
when and how.
At least that's my understanding how Spark works. If I'm wrong, I will be
glad to hear that as
we ran into the same questions.

As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez
project,
but Tez has components that resemble closer to "buffer cache" or "in-memory
columnar storage" caching
from traditional RDBMS systems, and may get better and/or more predictable
performance on
BI queries.



-- 
Ruslan Dautkhanov

On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan <re...@gmail.com>
wrote:

> All,
> I really appreciate anyone's input on this. We are having a very simple
> traditional OLAP query processing use case. Our use case is as follows.
>
>
> 1. We have a customer sales order table data coming from RDBMs table.
> 2. There are many dimension columns in the sales order table. For each of
> those dimensions, we have individual dimension tables that stores the
> dimension record sets.
> 3. We also have some BI like hierarchies that is defined for dimension data
> set.
>
> What we want for business users is as follows.?
>
> 1. We wanted to show some aggregated values from sales Order transaction
> table columns.
> 2. User would like to filter these with specific dimension values from
> dimension table.
> 3. User should be able to drill down from higher level to lower level by
> traversing hierarchy on dimension
>
>
> We want these use actions respond within 2 to 5 seconds.
>
>
> We are thinking about using SPARK as our backend enginee to sever data to
> these front end application.
>
>
> Has anyone tried using SPARK for these kind of use cases. These are all
> traditional use cases in BI space. If so, can SPARK respond to these
> queries
> with in 2 to 5 seconds for large data sets.
>
> Thanks,
> Renga
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Is SPARK is the right choice for traditional OLAP query processing?

Posted by Jörn Franke <jo...@gmail.com>.
You may check out apache phoenix on top of Hbase for this. However, it does
not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of
TEZ will also serve your purpose. You should run some proof of concept with
these technologies using real or generated data. About how much data are we
talking about in the fact and dimensional tables. Are thé dimensional
tables small or large? Does your current software setup not satisfy your
requirements?

Le mar. 21 juil. 2015 à 2:06, renga.kannan <re...@gmail.com> a
écrit :

> All,
> I really appreciate anyone's input on this. We are having a very simple
> traditional OLAP query processing use case. Our use case is as follows.
>
>
> 1. We have a customer sales order table data coming from RDBMs table.
> 2. There are many dimension columns in the sales order table. For each of
> those dimensions, we have individual dimension tables that stores the
> dimension record sets.
> 3. We also have some BI like hierarchies that is defined for dimension data
> set.
>
> What we want for business users is as follows.?
>
> 1. We wanted to show some aggregated values from sales Order transaction
> table columns.
> 2. User would like to filter these with specific dimension values from
> dimension table.
> 3. User should be able to drill down from higher level to lower level by
> traversing hierarchy on dimension
>
>
> We want these use actions respond within 2 to 5 seconds.
>
>
> We are thinking about using SPARK as our backend enginee to sever data to
> these front end application.
>
>
> Has anyone tried using SPARK for these kind of use cases. These are all
> traditional use cases in BI space. If so, can SPARK respond to these
> queries
> with in 2 to 5 seconds for large data sets.
>
> Thanks,
> Renga
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Is SPARK is the right choice for traditional OLAP query processing?

Posted by chandan prakash <ch...@gmail.com>.
Apache Drill is also a very good candidate for this.



On Mon, Nov 9, 2015 at 9:33 AM, Hitoshi Ozawa <oz...@worksap.co.jp> wrote:

> It depends on how much data needs to be processed. Data Warehouse with
> indexes is going to be faster when there is not much data. If you have big
> data, Spark Streaming and may be Spark SQL may interest you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921p25319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Chandan Prakash

Re: Is SPARK is the right choice for traditional OLAP query processing?

Posted by Hitoshi Ozawa <oz...@worksap.co.jp>.
It depends on how much data needs to be processed. Data Warehouse with
indexes is going to be faster when there is not much data. If you have big
data, Spark Streaming and may be Spark SQL may interest you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921p25319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org