You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kant kodali <ka...@gmail.com> on 2017/04/07 05:34:20 UTC

Apache Drill vs Spark SQL

Hi All,

I am very impressed with the work done on Spark SQL however when I have to
pick something to serve real time queries I am in a dilemma for the
following reasons.

1. Even though Spark Sql has logical plans, physical plans and run time
code generation and all that it still doesn't look like the tool to serve
real time queries like we normally do from a database. I tend to think this
is because the queries had to go through job submission first. I don't want
to call this overhead or anything but this is what it seems to do.
comparing this, on the other hand we have the data that we want to serve
sitting on a database where we simply issue an SQL query and get the
response back so for this use case what would be an appropriate tool? I
tend to think its Drill but would like to hear if there are any interesting
arguments.

2. I can see a case for Spark SQL such as queries that need to be expressed
in a iterative fashion. For example doing a graph traversal such BFS, DFS
or say even a simple pre order, in order , post order Traversals on a BST.
All this will be very hard to express on a Declarative syntax like SQL. I
also tend to think Ad-hoc distributed joins (By Ad-hoc I mean one is not
certain about their query patterns) are also better off expressing it in
map-reduce style than say SQL unless one know their query patterns well
ahead such that the possibility of queries that require redistribution is
so low. I am also sure there are plenty of other cases where Spark SQL will
excel but I wanted to see what is good choice to simple serve the data?

Any suggestions are appreciated.

Thanks!

Re: Apache Drill vs Spark SQL

Posted by Pierce Lamb <ri...@gmail.com>.

Hi Kant,

If you are interested in using Spark alongside a database to serve real
time queries, there are many options. Almost every popular database has
built some sort of connector to Spark. I've listed a majority of them and
tried to delineate them in some way in this StackOverflow answer:

http://stackoverflow.com/a/39753976/3723346

As an employee of SnappyData <https://github.com/SnappyDataInc/snappydata>,
I'm biased toward it's solution in which Spark and the database are deeply
integrated and run on the same JVM. But there are many options depending on
your needs.

I'm not sure if the above link also answers your second question, but there
are two graph databases listed that connect to Spark as well.

Hope this helps,

Pierce

On Thu, Apr 6, 2017 at 10:34 PM, kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> I am very impressed with the work done on Spark SQL however when I have to
> pick something to serve real time queries I am in a dilemma for the
> following reasons.
>
> 1. Even though Spark Sql has logical plans, physical plans and run time
> code generation and all that it still doesn't look like the tool to serve
> real time queries like we normally do from a database. I tend to think this
> is because the queries had to go through job submission first. I don't want
> to call this overhead or anything but this is what it seems to do.
> comparing this, on the other hand we have the data that we want to serve
> sitting on a database where we simply issue an SQL query and get the
> response back so for this use case what would be an appropriate tool? I
> tend to think its Drill but would like to hear if there are any interesting
> arguments.
>
> 2. I can see a case for Spark SQL such as queries that need to be
> expressed in a iterative fashion. For example doing a graph traversal such
> BFS, DFS or say even a simple pre order, in order , post order Traversals
> on a BST. All this will be very hard to express on a Declarative syntax
> like SQL. I also tend to think Ad-hoc distributed joins (By Ad-hoc I mean
> one is not certain about their query patterns) are also better off
> expressing it in map-reduce style than say SQL unless one know their query
> patterns well ahead such that the possibility of queries that require
> redistribution is so low. I am also sure there are plenty of other cases
> where Spark SQL will excel but I wanted to see what is good choice to
> simple serve the data?
>
> Any suggestions are appreciated.
>
> Thanks!
>
>