You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mahender Sarangam <Ma...@outlook.com> on 2018/04/04 09:29:00 UTC

Building Datwarehouse Application in Spark

Hi,
Does anyone has good architecture document/design principle for building warehouse application using Spark.

Is it better way of having Hive Context created with HQL and perform transformation or Directly loading  files in dataframe and perform data transformation.

We need to implement SCD 2 Type in Spark, Is there any better document/reference for building Type 2 warehouse object

Thanks in advace

/Mahender

Re: Building Datwarehouse Application in Spark

Posted by "Richard A. Bross" <rb...@oaktreepeak.com>.

Mahender,

To really address your question I think that you'd have to supply a bit more information, such as the kind of data that you want to save; RBDMS type look ups, key/value/index type look ups, insert velocity, etc.  These wide choices of technologies are suited to different use cases, although they overlap in some areas.

In a previous position that I held we used Spark on Cassandra to solve a similar problem.  The Datastax distribution puts Spark worker nodes directly on Cassandra nodes.  Because Cassandra partitions the data across nodes based on a row key, it's a nice match.  If the key is chosen properly, the Spark nodes are typically accessing local data on the Cassandra nodes, meaning that there are typically very few shuffles for direct queries and also that inserts go directly to the proper Cassandra nodes.  We had time series data based on unique row keys.  So our row keys were unique and our column keys were the time stamps. In that case our queries were done directly with the Cassandra clients for the most part, with SparkQL primarily used for ad-hoc queries.  

At my current position, we directly load raw data into Hive (using HiveQL) and then use Presto for queries.  That's our OLAP data store.  You can use any number of other tools to query Hive created data stores as well.

Then we have another pipeline that takes the same raw data, uses Spark for the ETL, and then inserts the results into Aurora (MySQL).  The schema is designed for specific queries, so the Spark ETL is designed to transform the data to optimize for the schema so as to allow efficient updates to those tables.  That's our OLTP data store and we use standard SQL for queries.

Rick


----- Original Message -----
From: "Furcy Pin" <pi...@gmail.com>
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 6:58:58 AM
Subject: Re: Building Datwarehouse Application in Spark


Hi Mahender, 


Did you look at this? https://www.snappydata.io/blog/the-spark-database 


But I believe that most people handle this use case by either using: 
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if the data is not too big 
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and needs to be distributed 


Spark generally makes it easy enough to query these other databases to allow you to perform analytics. 


Hive and Spark have been designed as OLAP tools, not OLTP. 
I'm not sure what features you are seeking for your SCD but they probably won't be part of Spark's core design. 


Hope this helps, 


Furcy 






On 4 April 2018 at 11:29, Mahender Sarangam < Mahender.Bigdata@outlook.com > wrote: 




Hi, 
Does anyone has good architecture document/design principle for building warehouse application using Spark. 


Is it better way of having Hive Context created with HQL and perform transformation or Directly loading files in dataframe and perform data transformation. 


We need to implement SCD 2 Type in Spark, Is there any better document/reference for building Type 2 warehouse object 


Thanks in advace 


/Mahender

Re: Building Datwarehouse Application in Spark

Posted by Furcy Pin <pi...@gmail.com>.

Hi Mahender,

Did you look at this? https://www.snappydata.io/blog/the-spark-database

But I believe that most people handle this use case by either using:
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...)
if the data is not too big
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big
and needs to be distributed

Spark generally makes it easy enough to query these other databases to
allow you to perform analytics.

Hive and Spark have been designed as OLAP tools, not OLTP.
I'm not sure what features you are seeking for your SCD but they probably
won't be part of Spark's core design.

Hope this helps,

Furcy

On 4 April 2018 at 11:29, Mahender Sarangam <Ma...@outlook.com>
wrote:

> Hi,
> Does anyone has good architecture document/design principle for building
> warehouse application using Spark.
>
> Is it better way of having Hive Context created with HQL and perform
> transformation or Directly loading  files in dataframe and perform data
> transformation.
>
> We need to implement SCD 2 Type in Spark, Is there any better
> document/reference for building Type 2 warehouse object
>
> Thanks in advace
>
> /Mahender
>