You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adaryl Wakefield <ad...@hotmail.com> on 2017/02/28 00:18:28 UTC

using spark to load a data warehouse in real time

Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn't a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It's all turn of the century stuff and doesn't take into account advancements in database technology. Additionally, whenever I try to learn spark, it's always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science.

I'd like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

RE: using spark to load a data warehouse in real time

Posted by Adaryl Wakefield <ad...@hotmail.com>.

That does thanks. I’m starting to think a straight Kafka solution would be more appropriate.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Sam Elamin [mailto:hussam.elamin@gmail.com]
Sent: Wednesday, March 1, 2017 2:29 AM
To: Adaryl Wakefield <ad...@hotmail.com>; Jörn Franke <jo...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl

Having come from a Web background myself I completely understand your confusion so let me try to clarify a few things

First and foremost, Spark is a data processing engine not a general framework. In the Web applications and frameworks world you load the entities, map them to the UI and serve them up to the users then save whatever you need to back to the database via some sort of entity mapping. Whether that's an orm or a stored procedures or any other manner

Spark as I mentioned is a data processing engine so there Is no concept of an orm or data mapper. You can give it the schema of what you expect the data to like like, it also works well with most of the data formats being used in the industry like CSV,JSON,AVRO and PARQUET including infering the schema from the data provided making it much easier to develop and maintain

Now as to your question of loading data in real time it absolutely can be done. Traditionally data coming in arrives at a location most people call the landing. This is where the extract of the etl part begins.

As Jorn mention spark streaming isn't meant to write to a database but you can write to kafka or kinesis to write to a pipeline then have another process call them and write to your end datastore.

 The creators of spark realised that you're use case is absolutely valid and almost everyone they talked to said that streaming on its own wasn't enough, for this very same reason the concept of structured streaming was brought in place.

Se  this blog post from databricks

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html


You can potentially use the structured streaming APIs to continually read changes from hdfs or in your case S3 then write it out via jdbc to your end datastore

I have done it before so I'll give you a few gotchas to be aware of

The most important one is that your end datastore or data warehouse supports streaming inserts, some are better than others. Redshift specifically is really bad when it comes to small very frequent deltas which is what streaming at high scale is

The second is that the structured streaming is still in alpha phase and the code is marked as experimental, that's not to say it will die the minute you push any load through because I found that it handled Gbs of data well. The pains I found is that the underlying goal of structured streaming was to use the underlying dataframe APIs hence unifying the batch and stream data types meaning you only need to learn one. However some methods don't yet work on the streaming dataframes such as dropDuplicates


That's pretty much it. So really it comes down to you're use case, if you need the data to be reliable and never go down then implement kafka or Kinesis. If it's a proof of concept or you are trying to validate a theory use structured streaming as it's much quicker to write, weeks and months of set up vs a few hours


I hope I clarified things for you

Regards
Sam

Sent from my iPhone




On Wed, 1 Mar 2017 at 07:34, Jörn Franke <jo...@gmail.com>> wrote:
I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH.

You need also define what realtime means and what is needed there - it will differ from client to client significantly.

From my experience, just SQL is not enough for the users in the future. Especially large data volumes require much more beyond just aggregations. These may become less useful in context of large data volumes. They have to learn new ways of dealing with the data from a business perspective by employing proper sampling of data from a large dataset, machine learning approaches etc. These are new methods which are not technically driven but business driven. I think it is wrong to assume that users learning new skills is a bad thing; it might be in the future a necessity.

On 28 Feb 2017, at 23:18, Adaryl Wakefield <ad...@hotmail.com>> wrote:
I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so nobody has to learn new skills or new tech just to do some analysis. Everybody and their mom can write SQL. Designing relational databases is a rare skill but not as rare as what is necessary for designing some NoSQL solutions.

I’m looking for the fastest path to move a company from batch to real time analytical processing.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Tuesday, February 28, 2017 12:57 PM
To: Adaryl Wakefield <ad...@hotmail.com>>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.




[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti<http://about.me/mti>








Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>


<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore. <http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femibyte@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. <http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>

Re: using spark to load a data warehouse in real time

Posted by Sam Elamin <hu...@gmail.com>.

Hi Adaryl

Having come from a Web background myself I completely understand your
confusion so let me try to clarify a few things

First and foremost, Spark is a data processing engine not a general
framework. In the Web applications and frameworks world you load the
entities, map them to the UI and serve them up to the users then save
whatever you need to back to the database via some sort of entity mapping.
Whether that's an orm or a stored procedures or any other manner

Spark as I mentioned is a data processing engine so there Is no concept of
an orm or data mapper. You can give it the schema of what you expect the
data to like like, it also works well with most of the data formats being
used in the industry like CSV,JSON,AVRO and PARQUET including infering the
schema from the data provided making it much easier to develop and maintain

Now as to your question of loading data in real time it absolutely can be
done. Traditionally data coming in arrives at a location most people call
the landing. This is where the extract of the etl part begins.

As Jorn mention spark streaming isn't meant to write to a database but you
can write to kafka or kinesis to write to a pipeline then have another
process call them and write to your end datastore.

The creators of spark realised that you're use case is absolutely valid
and almost everyone they talked to said that streaming on its own wasn't
enough, for this very same reason the concept of structured streaming was
brought in place.

Se this blog post from databricks

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

You can potentially use the structured streaming APIs to continually read
changes from hdfs or in your case S3 then write it out via jdbc to your end
datastore

I have done it before so I'll give you a few gotchas to be aware of

The most important one is that your end datastore or data warehouse
supports streaming inserts, some are better than others. Redshift
specifically is really bad when it comes to small very frequent deltas
which is what streaming at high scale is

The second is that the structured streaming is still in alpha phase and the
code is marked as experimental, that's not to say it will die the minute
you push any load through because I found that it handled Gbs of data well.
The pains I found is that the underlying goal of structured streaming was
to use the underlying dataframe APIs hence unifying the batch and stream
data types meaning you only need to learn one. However some methods don't
yet work on the streaming dataframes such as dropDuplicates

That's pretty much it. So really it comes down to you're use case, if you
need the data to be reliable and never go down then implement kafka or
Kinesis. If it's a proof of concept or you are trying to validate a theory
use structured streaming as it's much quicker to write, weeks and months of
set up vs a few hours

I hope I clarified things for you

Regards
Sam

Sent from my iPhone

On Wed, 1 Mar 2017 at 07:34, Jörn Franke <jo...@gmail.com> wrote:

I am not sure that Spark Streaming is what you want to do. It is for
streaming analytics not for loading in a DWH.

You need also define what realtime means and what is needed there - it will
differ from client to client significantly.

From my experience, just SQL is not enough for the users in the future.
Especially large data volumes require much more beyond just aggregations.
These may become less useful in context of large data volumes. They have to
learn new ways of dealing with the data from a business perspective by
employing proper sampling of data from a large dataset, machine learning
approaches etc. These are new methods which are not technically driven but
business driven. I think it is wrong to assume that users learning new
skills is a bad thing; it might be in the future a necessity.

On 28 Feb 2017, at 23:18, Adaryl Wakefield <ad...@hotmail.com>
wrote:

I’m actually trying to come up with a generalized use case that I can take
from client to client. We have structured data coming from some
application. Instead of dropping it into Hadoop and then using yet another
technology to query that data, I just want to dump it into a relational MPP
DW so nobody has to learn new skills or new tech just to do some analysis.
Everybody and their mom can write SQL. Designing relational databases is a
rare skill but not as rare as what is necessary for designing some NoSQL
solutions.

I’m looking for the fastest path to move a company from batch to real time
analytical processing.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685

www.massstreet.net

www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

*From:* Mohammad Tariq [mailto:dontariq@gmail.com <do...@gmail.com>]
*Sent:* Tuesday, February 28, 2017 12:57 PM
*To:* Adaryl Wakefield <ad...@hotmail.com>
*Cc:* user@spark.apache.org
*Subject:* Re: using spark to load a data warehouse in real time

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC
support through DataFrames. Could you please explain your use case a bit
more? That'll help us in answering your query better.

*Tariq, Mohammad*

about.me/mti

<http://about.me/mti>

*Tariq, Mohammad <http://about.me/mti>*

about.me/mti

<http://about.me/mti>

On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
*adaryl.wakefield@hotmail.com*> wrote: <http://about.me/mti>

I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
of course have to be in any architecture but it looks like they are
suggesting that Kafka is all you need. <http://about.me/mti>

<http://about.me/mti>

My primary concern is the complexity of loading warehouses. I have a web
development background so I have somewhat of an idea on how to insert data
into a database from an application. I’ve since moved on to straight
database programming and don’t work with anything that reads from an app
anymore. <http://about.me/mti>

<http://about.me/mti>

Loading a warehouse requires a lot of cleaning of data and running and
grabbing keys to maintain referential integrity. Usually that’s done in a
batch process. Now I have to do it record by record (or a few records). I
have some ideas but I’m not quite there yet. <http://about.me/mti>

<http://about.me/mti>

I thought SparkSQL would be the way to get this done but so far, all the
examples I’ve seen are just SELECT statements, no INSERTS or MERGE
statements. <http://about.me/mti>

<http://about.me/mti>

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685 <http://about.me/mti>

*www.massstreet.net* <http://about.me/mti>

*www.linkedin.com/in/bobwakefieldmba*
Twitter: @BobLovesData <http://about.me/mti>

<http://about.me/mti>

*From: Femi Anthony [mailto:femibyte@gmail.com] Sent: Tuesday, February 28,
2017 4:13 AM To: Adaryl Wakefield <ad...@hotmail.com> Cc:
user@spark.apache.org Subject: Re: using spark to load a data warehouse in
real time <http://about.me/mti>*

<http://about.me/mti>

Have you checked to see if there are any drivers to enable you to write to
Greenplum directly from Spark ? <http://about.me/mti>

<http://about.me/mti>

You can also take a look at this link: <http://about.me/mti>

<http://about.me/mti>

*https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q*
<http://about.me/mti>

<http://about.me/mti>

Apparently GPDB is based on Postgres so maybe that approach may work.
<http://about.me/mti>

Another approach maybe for Spark Streaming to write to Kafka, and then have
another process read from Kafka and write to Greenplum.
<http://about.me/mti>

<http://about.me/mti>

Kafka Connect may be useful in this case - <http://about.me/mti>

<http://about.me/mti>

*https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/*
<http://about.me/mti>

<http://about.me/mti>

Femi Anthony <http://about.me/mti>

<http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
*adaryl.wakefield@hotmail.com*> wrote: <http://about.me/mti>

Is anybody using Spark streaming/SQL to load a relational data warehouse in
real time? There isn’t a lot of information on this use case out there.
When I google real time data warehouse load, nothing I find is up to date.
It’s all turn of the century stuff and doesn’t take into account
advancements in database technology. Additionally, whenever I try to learn
spark, it’s always the same thing. Play with twitter data never structured
data. All the CEP uses cases are about data science. <http://about.me/mti>

<http://about.me/mti>

I’d like to use Spark to load Greenplumb in real time. Intuitively, this
should be possible. I was thinking Spark Streaming with Spark SQL along
with a ORM should do it. Am I off base with this? Is the reason why there
are no examples is because there is a better way to do what I want?
<http://about.me/mti>

<http://about.me/mti>

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685 <http://about.me/mti>

*www.massstreet.net* <http://about.me/mti>

*www.linkedin.com/in/bobwakefieldmba*
Twitter: @BobLovesData <http://about.me/mti>

<http://about.me/mti>

RE: using spark to load a data warehouse in real time

Posted by Adaryl Wakefield <ad...@hotmail.com>.

For all the work that is necessary to load a warehouse, could not that work be considered a special case of CEP? Real time means I’m trying to get to zero lag between an event happening in the transactional system and someone being able to do analytics on that data but not just from that application. If that were the case, I’d use just an in memory solution and be done. I want the business to be able to do 360 analysis on their business using up to the second data.

This really shouldn’t be hard and I feel like I am missing something.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: Wednesday, March 1, 2017 1:25 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time

I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH.

You need also define what realtime means and what is needed there - it will differ from client to client significantly.

From my experience, just SQL is not enough for the users in the future. Especially large data volumes require much more beyond just aggregations. These may become less useful in context of large data volumes. They have to learn new ways of dealing with the data from a business perspective by employing proper sampling of data from a large dataset, machine learning approaches etc. These are new methods which are not technically driven but business driven. I think it is wrong to assume that users learning new skills is a bad thing; it might be in the future a necessity.

On 28 Feb 2017, at 23:18, Adaryl Wakefield <ad...@hotmail.com>> wrote:
I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so nobody has to learn new skills or new tech just to do some analysis. Everybody and their mom can write SQL. Designing relational databases is a rare skill but not as rare as what is necessary for designing some NoSQL solutions.

I’m looking for the fastest path to move a company from batch to real time analytical processing.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Tuesday, February 28, 2017 12:57 PM
To: Adaryl Wakefield <ad...@hotmail.com>>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.




[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti<http://about.me/mti>








Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>


<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore. <http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femibyte@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. <http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>

Re: using spark to load a data warehouse in real time

Posted by Jörn Franke <jo...@gmail.com>.

I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH. 

You need also define what realtime means and what is needed there - it will differ from client to client significantly.

From my experience, just SQL is not enough for the users in the future. Especially large data volumes require much more beyond just aggregations. These may become less useful in context of large data volumes. They have to learn new ways of dealing with the data from a business perspective by employing proper sampling of data from a large dataset, machine learning approaches etc. These are new methods which are not technically driven but business driven. I think it is wrong to assume that users learning new skills is a bad thing; it might be in the future a necessity.

> On 28 Feb 2017, at 23:18, Adaryl Wakefield <ad...@hotmail.com> wrote:
> 
> I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so nobody has to learn new skills or new tech just to do some analysis. Everybody and their mom can write SQL. Designing relational databases is a rare skill but not as rare as what is necessary for designing some NoSQL solutions.
>  
> I’m looking for the fastest path to move a company from batch to real time analytical processing.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Mohammad Tariq [mailto:dontariq@gmail.com] 
> Sent: Tuesday, February 28, 2017 12:57 PM
> To: Adaryl Wakefield <ad...@hotmail.com>
> Cc: user@spark.apache.org
> Subject: Re: using spark to load a data warehouse in real time
>  
> Hi Adaryl,
>  
> You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.
>  
>  
>  
> 
>  
> Tariq, Mohammad
> about.me/mti
> 
>  
> 
>  
> Tariq, Mohammad
> about.me/mti
> 
>  
>  
>  
> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <ad...@hotmail.com> wrote:
> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need.
>  
> My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore.
>  
> Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.
>  
> I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Femi Anthony [mailto:femibyte@gmail.com] 
> Sent: Tuesday, February 28, 2017 4:13 AM
> To: Adaryl Wakefield <ad...@hotmail.com>
> Cc: user@spark.apache.org
> Subject: Re: using spark to load a data warehouse in real time
>  
> Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?
>  
> You can also take a look at this link:
>  
> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q
>  
> Apparently GPDB is based on Postgres so maybe that approach may work. 
> Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.
>  
> Kafka Connect may be useful in this case -
>  
> https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
>  
> Femi Anthony
>  
>  
> 
> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:
> 
> Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science.
>  
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> 
>

RE: using spark to load a data warehouse in real time

Posted by Adaryl Wakefield <ad...@hotmail.com>.

I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so nobody has to learn new skills or new tech just to do some analysis. Everybody and their mom can write SQL. Designing relational databases is a rare skill but not as rare as what is necessary for designing some NoSQL solutions.

I’m looking for the fastest path to move a company from batch to real time analytical processing.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Mohammad Tariq [mailto:dontariq@gmail.com]
Sent: Tuesday, February 28, 2017 12:57 PM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.




[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti








<http://about.me/mti>
Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>

<http://about.me/mti>

<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore. <http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femibyte@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. <http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>

RE: using spark to load a data warehouse in real time

Posted by Adaryl Wakefield <ad...@hotmail.com>.

Hi Henry,
I didn’t catch your email until now. When you wrote to the database, how did you enforce the schema? Did the data frames just spit everything out with the necessary keys?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Henry Tremblay [mailto:paulhtremblay@gmail.com]
Sent: Tuesday, February 28, 2017 3:56 PM
To: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time


We did this all the time at my last position.

1. We had unstructured data in S3.

2.We read directly from S3 and then gave structure to the data by a dataframe in Spark.

3. We wrote the results to S3

4. We used Redshift's super fast parallel ability to load the results into a table.

Henry

On 02/28/2017 11:04 AM, Mohammad Tariq wrote:
You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each RDD into a DataFrame. Use these DataFrames to perform whatever processing is required and then save that DataFrame into your target relational warehouse.

HTH


[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti








<http://about.me/mti>
Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>

<http://about.me/mti>

<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <do...@gmail.com> wrote:<http://about.me/mti>
Hi Adaryl, <http://about.me/mti>
 <http://about.me/mti>
You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better.<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

<http://about.me/mti>
[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]<http://about.me/mti>

 <http://about.me/mti>

Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>


<http://about.me/mti>

 <http://about.me/mti>

<http://about.me/mti>
 <http://about.me/mti>

<http://about.me/mti>
Tariq, Mohammad
about.me/mti






 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore. <http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femibyte@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. <http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>


<http://about.me/mti>

-- <http://about.me/mti>

Henry Tremblay<http://about.me/mti>

Robert Half Technology<http://about.me/mti>

Re: using spark to load a data warehouse in real time

Posted by Henry Tremblay <pa...@gmail.com>.

We did this all the time at my last position.

1. We had unstructured data in S3.

2.We read directly from S3 and then gave structure to the data by a 
dataframe in Spark.

3. We wrote the results to S3

4. We used Redshift's super fast parallel ability to load the results 
into a table.

Henry


On 02/28/2017 11:04 AM, Mohammad Tariq wrote:
> You could try this as a blueprint :
>
> Read the data in through Spark Streaming. Iterate over it and convert 
> each RDD into a DataFrame. Use these DataFrames to perform whatever 
> processing is required and then save that DataFrame into your target 
> relational warehouse.
>
> HTH
> -- 
> 		
> Tariq, Mohammad
> https://about.me/mti
>
> <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext> 
>
>
>
>
> http://
> Tariq, Mohammad
> about.me/mti
>
> <http://about.me/mti>http://
>
>
> On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <dontariq@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi Adaryl,
>
>     You could definitely load data into a warehouse through Spark's
>     JDBC support through DataFrames. Could you please explain your use
>     case a bit more? That'll help us in answering your query better.
>
>
>     -- 
>     		
>     Tariq, Mohammad
>     https://about.me/mti
>
>     <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>
>
>
>
>
>     http://
>     Tariq, Mohammad
>     about.me/mti
>
>     <http://about.me/mti>http://
>
>
>     On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield
>     <adaryl.wakefield@hotmail.com
>     <ma...@hotmail.com>> wrote:
>
>         I haven\u2019t heard of Kafka connect. I\u2019ll have to look into it.
>         Kafka would, of course have to be in any architecture but it
>         looks like they are suggesting that Kafka is all you need.
>
>         My primary concern is the complexity of loading warehouses. I
>         have a web development background so I have somewhat of an
>         idea on how to insert data into a database from an
>         application. I\u2019ve since moved on to straight database
>         programming and don\u2019t work with anything that reads from an
>         app anymore.
>
>         Loading a warehouse requires a lot of cleaning of data and
>         running and grabbing keys to maintain referential integrity.
>         Usually that\u2019s done in a batch process. Now I have to do it
>         record by record (or a few records). I have some ideas but I\u2019m
>         not quite there yet.
>
>         I thought SparkSQL would be the way to get this done but so
>         far, all the examples I\u2019ve seen are just SELECT statements, no
>         INSERTS or MERGE statements.
>
>         Adaryl "Bob" Wakefield, MBA
>         Principal
>         Mass Street Analytics, LLC
>         913.938.6685
>
>         www.massstreet.net <http://www.massstreet.net>
>
>         www.linkedin.com/in/bobwakefieldmba
>         <http://www.linkedin.com/in/bobwakefieldmba>
>         Twitter: @BobLovesData
>
>         *From:*Femi Anthony [mailto:femibyte@gmail.com
>         <ma...@gmail.com>]
>         *Sent:* Tuesday, February 28, 2017 4:13 AM
>         *To:* Adaryl Wakefield <adaryl.wakefield@hotmail.com
>         <ma...@hotmail.com>>
>         *Cc:* user@spark.apache.org <ma...@spark.apache.org>
>         *Subject:* Re: using spark to load a data warehouse in real time
>
>         Have you checked to see if there are any drivers to enable you
>         to write to Greenplum directly from Spark ?
>
>         You can also take a look at this link:
>
>         https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q
>         <https://groups.google.com/a/greenplum.org/forum/m/#%21topic/gpdb-users/lnm0Z7WBW6Q>
>
>         Apparently GPDB is based on Postgres so maybe that approach
>         may work.
>
>         Another approach maybe for Spark Streaming to write to Kafka,
>         and then have another process read from Kafka and write to
>         Greenplum.
>
>         Kafka Connect may be useful in this case -
>
>         https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
>         <https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/>
>
>         Femi Anthony
>
>
>         On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield
>         <adaryl.wakefield@hotmail.com
>         <ma...@hotmail.com>> wrote:
>
>             Is anybody using Spark streaming/SQL to load a relational
>             data warehouse in real time? There isn\u2019t a lot of
>             information on this use case out there. When I google real
>             time data warehouse load, nothing I find is up to date.
>             It\u2019s all turn of the century stuff and doesn\u2019t take into
>             account advancements in database technology. Additionally,
>             whenever I try to learn spark, it\u2019s always the same thing.
>             Play with twitter data never structured data. All the CEP
>             uses cases are about data science.
>
>             I\u2019d like to use Spark to load Greenplumb in real time.
>             Intuitively, this should be possible. I was thinking Spark
>             Streaming with Spark SQL along with a ORM should do it. Am
>             I off base with this? Is the reason why there are no
>             examples is because there is a better way to do what I want?
>
>             Adaryl "Bob" Wakefield, MBA
>             Principal
>             Mass Street Analytics, LLC
>             913.938.6685
>
>             www.massstreet.net <http://www.massstreet.net>
>
>             www.linkedin.com/in/bobwakefieldmba
>             <http://www.linkedin.com/in/bobwakefieldmba>
>             Twitter: @BobLovesData
>
>
>

-- 
Henry Tremblay
Robert Half Technology

Re: using spark to load a data warehouse in real time

Posted by Mohammad Tariq <do...@gmail.com>.

You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each
RDD into a DataFrame. Use these DataFrames to perform whatever processing
is required and then save that DataFrame into your target relational
warehouse.

HTH


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hi Adaryl,
>
> You could definitely load data into a warehouse through Spark's JDBC
> support through DataFrames. Could you please explain your use case a bit
> more? That'll help us in answering your query better.
>
>
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
> adaryl.wakefield@hotmail.com> wrote:
>
>> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
>> of course have to be in any architecture but it looks like they are
>> suggesting that Kafka is all you need.
>>
>>
>>
>> My primary concern is the complexity of loading warehouses. I have a web
>> development background so I have somewhat of an idea on how to insert data
>> into a database from an application. I’ve since moved on to straight
>> database programming and don’t work with anything that reads from an app
>> anymore.
>>
>>
>>
>> Loading a warehouse requires a lot of cleaning of data and running and
>> grabbing keys to maintain referential integrity. Usually that’s done in a
>> batch process. Now I have to do it record by record (or a few records). I
>> have some ideas but I’m not quite there yet.
>>
>>
>>
>> I thought SparkSQL would be the way to get this done but so far, all the
>> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
>> statements.
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>> *From:* Femi Anthony [mailto:femibyte@gmail.com]
>> *Sent:* Tuesday, February 28, 2017 4:13 AM
>> *To:* Adaryl Wakefield <ad...@hotmail.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: using spark to load a data warehouse in real time
>>
>>
>>
>> Have you checked to see if there are any drivers to enable you to write
>> to Greenplum directly from Spark ?
>>
>>
>>
>> You can also take a look at this link:
>>
>>
>>
>> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gp
>> db-users/lnm0Z7WBW6Q
>>
>>
>>
>> Apparently GPDB is based on Postgres so maybe that approach may work.
>>
>> Another approach maybe for Spark Streaming to write to Kafka, and then
>> have another process read from Kafka and write to Greenplum.
>>
>>
>>
>> Kafka Connect may be useful in this case -
>>
>>
>>
>> https://www.confluent.io/blog/announcing-kafka-connect-build
>> ing-large-scale-low-latency-data-pipelines/
>>
>>
>>
>> Femi Anthony
>>
>>
>>
>>
>>
>>
>> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>> Is anybody using Spark streaming/SQL to load a relational data warehouse
>> in real time? There isn’t a lot of information on this use case out there.
>> When I google real time data warehouse load, nothing I find is up to date.
>> It’s all turn of the century stuff and doesn’t take into account
>> advancements in database technology. Additionally, whenever I try to learn
>> spark, it’s always the same thing. Play with twitter data never structured
>> data. All the CEP uses cases are about data science.
>>
>>
>>
>> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
>> should be possible. I was thinking Spark Streaming with Spark SQL along
>> with a ORM should do it. Am I off base with this? Is the reason why there
>> are no examples is because there is a better way to do what I want?
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>>
>

Re: using spark to load a data warehouse in real time

Posted by Mohammad Tariq <do...@gmail.com>.

Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC
support through DataFrames. Could you please explain your use case a bit
more? That'll help us in answering your query better.




[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
adaryl.wakefield@hotmail.com> wrote:

> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
> of course have to be in any architecture but it looks like they are
> suggesting that Kafka is all you need.
>
>
>
> My primary concern is the complexity of loading warehouses. I have a web
> development background so I have somewhat of an idea on how to insert data
> into a database from an application. I’ve since moved on to straight
> database programming and don’t work with anything that reads from an app
> anymore.
>
>
>
> Loading a warehouse requires a lot of cleaning of data and running and
> grabbing keys to maintain referential integrity. Usually that’s done in a
> batch process. Now I have to do it record by record (or a few records). I
> have some ideas but I’m not quite there yet.
>
>
>
> I thought SparkSQL would be the way to get this done but so far, all the
> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
> statements.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* Femi Anthony [mailto:femibyte@gmail.com]
> *Sent:* Tuesday, February 28, 2017 4:13 AM
> *To:* Adaryl Wakefield <ad...@hotmail.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: using spark to load a data warehouse in real time
>
>
>
> Have you checked to see if there are any drivers to enable you to write to
> Greenplum directly from Spark ?
>
>
>
> You can also take a look at this link:
>
>
>
> https://groups.google.com/a/greenplum.org/forum/m/#!topic/
> gpdb-users/lnm0Z7WBW6Q
>
>
>
> Apparently GPDB is based on Postgres so maybe that approach may work.
>
> Another approach maybe for Spark Streaming to write to Kafka, and then
> have another process read from Kafka and write to Greenplum.
>
>
>
> Kafka Connect may be useful in this case -
>
>
>
> https://www.confluent.io/blog/announcing-kafka-connect-
> building-large-scale-low-latency-data-pipelines/
>
>
>
> Femi Anthony
>
>
>
>
>
>
> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
> adaryl.wakefield@hotmail.com> wrote:
>
> Is anybody using Spark streaming/SQL to load a relational data warehouse
> in real time? There isn’t a lot of information on this use case out there.
> When I google real time data warehouse load, nothing I find is up to date.
> It’s all turn of the century stuff and doesn’t take into account
> advancements in database technology. Additionally, whenever I try to learn
> spark, it’s always the same thing. Play with twitter data never structured
> data. All the CEP uses cases are about data science.
>
>
>
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
> should be possible. I was thinking Spark Streaming with Spark SQL along
> with a ORM should do it. Am I off base with this? Is the reason why there
> are no examples is because there is a better way to do what I want?
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
>

RE: using spark to load a data warehouse in real time

Posted by Adaryl Wakefield <ad...@hotmail.com>.

I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need.

My primary concern is the complexity of loading warehouses. I have a web development background so I have somewhat of an idea on how to insert data into a database from an application. I’ve since moved on to straight database programming and don’t work with anything that reads from an app anymore.

Loading a warehouse requires a lot of cleaning of data and running and grabbing keys to maintain referential integrity. Usually that’s done in a batch process. Now I have to do it record by record (or a few records). I have some ideas but I’m not quite there yet.

I thought SparkSQL would be the way to get this done but so far, all the examples I’ve seen are just SELECT statements, no INSERTS or MERGE statements.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Femi Anthony [mailto:femibyte@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <ad...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time

Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?

You can also take a look at this link:

https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q

Apparently GPDB is based on Postgres so maybe that approach may work.
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.

Kafka Connect may be useful in this case -

https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/

Femi Anthony



On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com>> wrote:
Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science.

I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

Re: using spark to load a data warehouse in real time

Posted by Femi Anthony <fe...@gmail.com>.

Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ?

You can also take a look at this link:

https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q

Apparently GPDB is based on Postgres so maybe that approach may work. 
Another approach maybe for Spark Streaming to write to Kafka, and then have another process read from Kafka and write to Greenplum.

Kafka Connect may be useful in this case -

https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/

Femi Anthony



> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <ad...@hotmail.com> wrote:
> 
> Is anybody using Spark streaming/SQL to load a relational data warehouse in real time? There isn’t a lot of information on this use case out there. When I google real time data warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t take into account advancements in database technology. Additionally, whenever I try to learn spark, it’s always the same thing. Play with twitter data never structured data. All the CEP uses cases are about data science.
>  
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am I off base with this? Is the reason why there are no examples is because there is a better way to do what I want?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>