You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by vha14 <vh...@msn.com> on 2015/02/12 17:56:12 UTC

Spark SQL value proposition in batch pipelines

My team is building a batch data processing pipeline using Spark API and
trying to understand if Spark SQL can help us. Below are what we found so
far:

- SQL's declarative style may be more readable in some cases (e.g. joining
of more than two RDDs), although some devs prefer the fluent style
regardless. 
- Cogrouping of more than 4 RDDs is not supported and it's not clear if
Spark SQL supports joining of arbitrary number of RDDs.
- It seems that Spark SQL's features such as optimization based on predicate
pushdown and dynamic schema inference are less applicable in a batch
environment.

Your inputs/suggestions are most welcome!

Thanks,
Vu Ha
CTO, Semantic Scholar
http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spark SQL value proposition in batch pipelines

Posted by vha14 <vh...@msn.com>.

This is super helpful, thanks Evan and Reynold!




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607p10610.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spark SQL value proposition in batch pipelines

Posted by Reynold Xin <rx...@databricks.com>.

Evan articulated it well.


On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks <ev...@gmail.com>
wrote:

> Well, you can always join as many RDDs as you want by chaining them
> together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
> RDDs in this way but 10 is probably doable.
>
> That said - SparkSQL has an optimizer under the covers that can make clever
> decisions e.g. pushing the predicates in the WHERE clause down to the base
> data (even to external data sources if you have them), ordering joins, and
> choosing between join implementations (like using broadcast joins instead
> of the default shuffle-based hash join in RDD.join). These decisions can
> make your queries run orders of magnitude faster than they would if you
> implemented them using basic RDD transformations. The best part is at this
> stage, I'd expect the optimizer will continue to improve - meaning many of
> your queries will get faster with each new release.
>
> I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
> as you're working with multiple tables and doing fairly textbook SQL stuff
> - you likely want the engine figuring this stuff out for you rather than
> hand coding it yourself. That said - with Spark, you can always drop back
> to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.
>
> On Thu, Feb 12, 2015 at 8:56 AM, vha14 <vh...@msn.com> wrote:
>
> > My team is building a batch data processing pipeline using Spark API and
> > trying to understand if Spark SQL can help us. Below are what we found so
> > far:
> >
> > - SQL's declarative style may be more readable in some cases (e.g.
> joining
> > of more than two RDDs), although some devs prefer the fluent style
> > regardless.
> > - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> > Spark SQL supports joining of arbitrary number of RDDs.
> > - It seems that Spark SQL's features such as optimization based on
> > predicate
> > pushdown and dynamic schema inference are less applicable in a batch
> > environment.
> >
> > Your inputs/suggestions are most welcome!
> >
> > Thanks,
> > Vu Ha
> > CTO, Semantic Scholar
> > http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>

Re: Spark SQL value proposition in batch pipelines

Posted by "Evan R. Sparks" <ev...@gmail.com>.

Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.

That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the predicates in the WHERE clause down to the base
data (even to external data sources if you have them), ordering joins, and
choosing between join implementations (like using broadcast joins instead
of the default shuffle-based hash join in RDD.join). These decisions can
make your queries run orders of magnitude faster than they would if you
implemented them using basic RDD transformations. The best part is at this
stage, I'd expect the optimizer will continue to improve - meaning many of
your queries will get faster with each new release.

I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
as you're working with multiple tables and doing fairly textbook SQL stuff
- you likely want the engine figuring this stuff out for you rather than
hand coding it yourself. That said - with Spark, you can always drop back
to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.

On Thu, Feb 12, 2015 at 8:56 AM, vha14 <vh...@msn.com> wrote:

> My team is building a batch data processing pipeline using Spark API and
> trying to understand if Spark SQL can help us. Below are what we found so
> far:
>
> - SQL's declarative style may be more readable in some cases (e.g. joining
> of more than two RDDs), although some devs prefer the fluent style
> regardless.
> - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> Spark SQL supports joining of arbitrary number of RDDs.
> - It seems that Spark SQL's features such as optimization based on
> predicate
> pushdown and dynamic schema inference are less applicable in a batch
> environment.
>
> Your inputs/suggestions are most welcome!
>
> Thanks,
> Vu Ha
> CTO, Semantic Scholar
> http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>