You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by agfung <ag...@gmail.com> on 2014/11/05 00:11:19 UTC

Spark v Redshift

I'm in the midst of a heated debate about the use of Redshift v Spark with a
colleague.  We keep trading anecdotes and links back and forth (eg airbnb
post from 2013 or amplab benchmarks), and we don't seem to be getting
anywhere. 

So before we start down the prototype /benchmark road, and in desperation 
of finding *some* kind of objective third party perspective,  was wondering
if anyone who has used both in 2014 would care to provide commentary about
the sweet spot use cases / gotchas for non trivial use (eg a simple filter
scan isn't really interesting).  Soft issues like operational maintenance
and time spent developing v out of the box are interesting too... 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark v Redshift

Posted by Matei Zaharia <ma...@gmail.com>.

BTW while I haven't actually used Redshift, I've seen many companies that use both, usually using Spark for ETL and advanced analytics and Redshift for SQL on the cleaned / summarized data. Xiangrui Meng also wrote https://github.com/mengxr/redshift-input-format to make it easy to read data exported from Redshift into Spark or Hadoop.

Matei

> On Nov 4, 2014, at 3:51 PM, Matei Zaharia <ma...@gmail.com> wrote:
> 
> Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box.
> 
> Matei
> 
>> On Nov 4, 2014, at 3:11 PM, agfung <ag...@gmail.com> wrote:
>> 
>> I'm in the midst of a heated debate about the use of Redshift v Spark with a
>> colleague.  We keep trading anecdotes and links back and forth (eg airbnb
>> post from 2013 or amplab benchmarks), and we don't seem to be getting
>> anywhere. 
>> 
>> So before we start down the prototype /benchmark road, and in desperation 
>> of finding *some* kind of objective third party perspective,  was wondering
>> if anyone who has used both in 2014 would care to provide commentary about
>> the sweet spot use cases / gotchas for non trivial use (eg a simple filter
>> scan isn't really interesting).  Soft issues like operational maintenance
>> and time spent developing v out of the box are interesting too... 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark v Redshift

Posted by Vladimir Rodionov <vr...@splicemachine.com>.

>> We service templated queries from the appserver, i.e. user fills
>>out some forms, dropdowns: we translate to a query.

and

>>The target data
>>size is about a billion records, 20'ish fields, distributed throughout a
>>year (about 50GB on disk as CSV, uncompressed).

tells me that proprietary in memory app will be the best option for you.

I do not see any need for neither Spark nor Redshift in your case.



On Tue, Nov 4, 2014 at 5:41 PM, agfung <ag...@gmail.com> wrote:

> Sounds like context would help, I just didn't want to subject people to a
> wall of text if it wasn't necessary :)
>
> Currently we use neither Spark SQL (or anything in the Hadoop stack) or
> Redshift.  We service templated queries from the appserver, i.e. user fills
> out some forms, dropdowns: we translate to a query.
>
> Data is "basically" one table containing thousands of independent time
> series, with one or two tables of reference data to join to.  e.g. median
> value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
> and T2 joining on a surrogate key, group by a different Field3.  The data
> structure is a little bit dynamic.  User can upload any CSV, as long as
> they
> tell us the name of each column and the programmatic type.  The target data
> size is about a billion records, 20'ish fields, distributed throughout a
> year (about 50GB on disk as CSV, uncompressed).
>
> So we're currently doing "historical" analytics (e.g. see analytic results
> of only yesterday's data or older, but want to see the result "quickly").
> We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
> the impact of new data on analytics "quickly").  Machine learning is also
> on
> the roadmap.
>
> One proposition is for Spark SQL as a complete replacement for Redshift.
> It
> would simplify the architecture, since our long term strategy is to handle
> data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
> other parts of the Hadoop family that would come into play for ETL is
> undetermined right now.  Spark SQL appears to have relational ability, and
> if we're going to use the Hadoop stack for ML and streaming analytics, and
> it has the ability, why not do it all on one stack and not shovel data
> around?  Also, lots of people talking about it.
>
> The other proposition is Redshift as the historical analytics solution, and
> something else (could be Spark, doesn't matter) for streaming analytics and
> ML.   If we need to relate the two, we'll have an API or process to stitch
> it together.   I've read about the "lambda architecture", which more or
> less
> describes this approach.  The motivation is Redshift has the AWS
> reliability/scalability/operational concerns worked out, richer query
> language (SQL and pgsql functions are designed for slice-n-dice analytics)
> so we can spend our coding time elsewhere, and a measure of safety against
> design issues and bugs: Spark just came out of incubator status this year,
> and it's much easier to find people on the web raving positively about
> Redshift in real-world usage (i.e. part of live, client-facing system) than
> Spark.
>
> category_theory's observation that most of the speed comes from fitting in
> memory is helpful.  It's what I would have surmised from the AMPLab Big
> Data
> benchmark, but confirmation from the hands-on community is invaluable,
> thank
> you.
>
> I understand a lot of it simply has to do with what-do-you-value-more
> weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
> sure if there were any other "key assumptions/requirements/gotchas" to
> consider.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark v Redshift

Posted by agfung <ag...@gmail.com>.

Sounds like context would help, I just didn't want to subject people to a
wall of text if it wasn't necessary :)

Currently we use neither Spark SQL (or anything in the Hadoop stack) or
Redshift. We service templated queries from the appserver, i.e. user fills
out some forms, dropdowns: we translate to a query.

Data is "basically" one table containing thousands of independent time
series, with one or two tables of reference data to join to. e.g. median
value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
and T2 joining on a surrogate key, group by a different Field3. The data
structure is a little bit dynamic. User can upload any CSV, as long as they
tell us the name of each column and the programmatic type. The target data
size is about a billion records, 20'ish fields, distributed throughout a
year (about 50GB on disk as CSV, uncompressed).

So we're currently doing "historical" analytics (e.g. see analytic results
of only yesterday's data or older, but want to see the result "quickly").
We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
the impact of new data on analytics "quickly"). Machine learning is also on
the roadmap.

One proposition is for Spark SQL as a complete replacement for Redshift. It
would simplify the architecture, since our long term strategy is to handle
data intake and ETL on HDFS (regardless of Redshift or Spark SQL). The
other parts of the Hadoop family that would come into play for ETL is
undetermined right now. Spark SQL appears to have relational ability, and
if we're going to use the Hadoop stack for ML and streaming analytics, and
it has the ability, why not do it all on one stack and not shovel data
around? Also, lots of people talking about it.

The other proposition is Redshift as the historical analytics solution, and
something else (could be Spark, doesn't matter) for streaming analytics and
ML. If we need to relate the two, we'll have an API or process to stitch
it together. I've read about the "lambda architecture", which more or less
describes this approach. The motivation is Redshift has the AWS
reliability/scalability/operational concerns worked out, richer query
language (SQL and pgsql functions are designed for slice-n-dice analytics)
so we can spend our coding time elsewhere, and a measure of safety against
design issues and bugs: Spark just came out of incubator status this year,
and it's much easier to find people on the web raving positively about
Redshift in real-world usage (i.e. part of live, client-facing system) than
Spark.

category_theory's observation that most of the speed comes from fitting in
memory is helpful. It's what I would have surmised from the AMPLab Big Data
benchmark, but confirmation from the hands-on community is invaluable, thank
you.

I understand a lot of it simply has to do with what-do-you-value-more
weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
sure if there were any other "key assumptions/requirements/gotchas" to
consider.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark v Redshift

Posted by Akshar Dave <ak...@softnets.com>.

There is no one size fits all solution available in the market today. If
somebody tell you they do then they are simply lying :)

Both solutions cater to different set of problems. My recommendation is to
put real focus on getting better understanding of your problems that you
are trying to solve with Spark and Redshift and pick tool based on how
effectively they handle those problems. Like Matei said, both might be
relevant in some cases.

Thanks
Akshar


On Tue, Nov 4, 2014 at 4:00 PM, Jimmy McErlain <ji...@sellpoints.com> wrote:

> This is pretty spot on.. though I would also add that the Spark features
> that it touts around speed are all dependent on caching the data into
> memory... reading off the disk still takes time..ie pulling the data into
> an RDD.  This is the reason that Spark is great for ML... the data is used
> over and over again to fit models so its pulled into memory once then
> basically analyzed through the algos... other DBs systems are reading and
> writing to disk repeatedly and are thus slower, such as mahout (though its
> getting ported over to Spark as well to compete with MLlib)...
>
> J
> ᐧ
>
>
>
>
> *JIMMY MCERLAIN*
>
> DATA SCIENTIST (NERD)
>
> *. . . . . . . . . . . . . . . . . .*
>
>
> *IF WE CAN’T DOUBLE YOUR SALES,*
>
>
>
> *ONE OF US IS IN THE WRONG BUSINESS.*
>
> *E*: jimmy@sellpoints.com
>
> *M*: *510.303.7751 <510.303.7751>*
>
> On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> Is this about Spark SQL vs Redshift, or Spark in general? Spark in
>> general provides a broader set of capabilities than Redshift because it has
>> APIs in general-purpose languages (Java, Scala, Python) and libraries for
>> things like machine learning and graph processing. For example, you might
>> use Spark to do the ETL that will put data into a database such as
>> Redshift, or you might pull data out of Redshift into Spark for machine
>> learning. On the other hand, if *all* you want to do is SQL and you are
>> okay with the set of data formats and features in Redshift (i.e. you can
>> express everything using its UDFs and you have a way to get data in), then
>> Redshift is a complete service which will do more management out of the box.
>>
>> Matei
>>
>> > On Nov 4, 2014, at 3:11 PM, agfung <ag...@gmail.com> wrote:
>> >
>> > I'm in the midst of a heated debate about the use of Redshift v Spark
>> with a
>> > colleague.  We keep trading anecdotes and links back and forth (eg
>> airbnb
>> > post from 2013 or amplab benchmarks), and we don't seem to be getting
>> > anywhere.
>> >
>> > So before we start down the prototype /benchmark road, and in
>> desperation
>> > of finding *some* kind of objective third party perspective,  was
>> wondering
>> > if anyone who has used both in 2014 would care to provide commentary
>> about
>> > the sweet spot use cases / gotchas for non trivial use (eg a simple
>> filter
>> > scan isn't really interesting).  Soft issues like operational
>> maintenance
>> > and time spent developing v out of the box are interesting too...
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>


-- 
Akshar Dave
Principal – Big Data
SoftNet Solutions
Office: 408.542.0888 | Mobile: 408.896.1486
940 Hamlin Court, Sunnyvale, CA 94089
www.softnets.com/bigdata

Re: Spark v Redshift

Posted by Jimmy McErlain <ji...@sellpoints.com>.

This is pretty spot on.. though I would also add that the Spark features
that it touts around speed are all dependent on caching the data into
memory... reading off the disk still takes time..ie pulling the data into
an RDD.  This is the reason that Spark is great for ML... the data is used
over and over again to fit models so its pulled into memory once then
basically analyzed through the algos... other DBs systems are reading and
writing to disk repeatedly and are thus slower, such as mahout (though its
getting ported over to Spark as well to compete with MLlib)...

J
ᐧ




*JIMMY MCERLAIN*

DATA SCIENTIST (NERD)

*. . . . . . . . . . . . . . . . . .*


*IF WE CAN’T DOUBLE YOUR SALES,*



*ONE OF US IS IN THE WRONG BUSINESS.*

*E*: jimmy@sellpoints.com

*M*: *510.303.7751*

On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
> provides a broader set of capabilities than Redshift because it has APIs in
> general-purpose languages (Java, Scala, Python) and libraries for things
> like machine learning and graph processing. For example, you might use
> Spark to do the ETL that will put data into a database such as Redshift, or
> you might pull data out of Redshift into Spark for machine learning. On the
> other hand, if *all* you want to do is SQL and you are okay with the set of
> data formats and features in Redshift (i.e. you can express everything
> using its UDFs and you have a way to get data in), then Redshift is a
> complete service which will do more management out of the box.
>
> Matei
>
> > On Nov 4, 2014, at 3:11 PM, agfung <ag...@gmail.com> wrote:
> >
> > I'm in the midst of a heated debate about the use of Redshift v Spark
> with a
> > colleague.  We keep trading anecdotes and links back and forth (eg airbnb
> > post from 2013 or amplab benchmarks), and we don't seem to be getting
> > anywhere.
> >
> > So before we start down the prototype /benchmark road, and in desperation
> > of finding *some* kind of objective third party perspective,  was
> wondering
> > if anyone who has used both in 2014 would care to provide commentary
> about
> > the sweet spot use cases / gotchas for non trivial use (eg a simple
> filter
> > scan isn't really interesting).  Soft issues like operational maintenance
> > and time spent developing v out of the box are interesting too...
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark v Redshift

Posted by Matei Zaharia <ma...@gmail.com>.

Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box.

Matei

> On Nov 4, 2014, at 3:11 PM, agfung <ag...@gmail.com> wrote:
> 
> I'm in the midst of a heated debate about the use of Redshift v Spark with a
> colleague.  We keep trading anecdotes and links back and forth (eg airbnb
> post from 2013 or amplab benchmarks), and we don't seem to be getting
> anywhere. 
> 
> So before we start down the prototype /benchmark road, and in desperation 
> of finding *some* kind of objective third party perspective,  was wondering
> if anyone who has used both in 2014 would care to provide commentary about
> the sweet spot use cases / gotchas for non trivial use (eg a simple filter
> scan isn't really interesting).  Soft issues like operational maintenance
> and time spent developing v out of the box are interesting too... 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org