You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Tridib Samanta <tr...@live.com> on 2014/10/29 18:00:28 UTC

Apache Drill Vs Spark SQL

Hello Experts,
I am new in Apache Drill. To me it's very similar to Spark SQL. I was wandering how does it differ from Spark SQL. What are the use case where Apache Drill thrives compare to Spark SQL?
 
Thanks & Regards
Tridib

Re: Apache Drill Vs Spark SQL

Posted by Adam Hunt <ad...@gmail.com>.

Hi Tridib and Neeraja,

Although Spark SQL has some boiler plate, it can discover the schema of
Parquet files just like Drill.  You are correct that Hive and Impala still
require you to create a table.
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v1/latest/Installing-and-Using-Impala/ciiu_parquet.html

Adam

On Wed, Oct 29, 2014 at 11:18 AM, Neeraja Rentachintala <
nrentachintala@maprtech.com> wrote:

> Tridib
>
> If you are getting started with Drill, you can also refer to a tutorial
> which goes through various Drill's capabilities.
> https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial
>
> You are spot on the metadata part. Discovering metadata dynamically and
> providing ability to work with complex datatypes such as JSON without
> transformation is a key difference for Drill compared to SparkSQL and other
> SQL options.
>
> -Neeraja
>
>
> On Wed, Oct 29, 2014 at 11:12 AM, Tridib Samanta <tr...@live.com>
> wrote:
>
> > Hi Adam,
> > Thanks for sharing this! Apache Drill is very easy to get started. I
> liked
> > the part that Drill manages the meta data part by itself and does not
> > required Hive (like Spark).
> >
> > Thanks
> > Tridib
> >
> > > Date: Wed, 29 Oct 2014 10:50:37 -0700
> > > Subject: Re: Apache Drill Vs Spark SQL
> > > From: adamphunt@gmail.com
> > > To: drill-user@incubator.apache.org
> > >
> > > Hi Tridib,
> > >
> > > I just completed a simple evaluation of Drill 0.6.0 and Spark SQL
> > 1.1.0.  I
> > > ran a few queries over 14GB of Snappy compressed Parquet files on a
> four
> > > server MapR cluster (96 cores, 256 GB).  Here are the results.
> > >
> > > Spark SQL requires some very very minor setup, where Drill doesn't.
> > > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > > val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> > > testData.registerTempTable("testData")
> > >
> > > In Drill, a simple count query took 19s the first time and 0.9s the
> > second
> > > time
> > > SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
> > >
> > > In Spark SQL, it took 17s the first time and 1.7s the second
> > > sqlContext.sql("SELECT count(*) FROM
> > testData").collect().foreach(println)
> > >
> > > In Drill, a simple group by query printed the results, but would not
> > return
> > > to the prompt without hitting ctrl-c (after 6s).
> > > SELECT httpResponseCode, count(*) FROM
> > > dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
> > >
> > > In Spark SQL, it finished in 3.6s
> > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > > httpResponseCode").collect().foreach(println)
> > >
> > > In Drill, this query never finished (probably due to the issue
> described
> > > above).
> > > SELECT httpResponseCode, count(*) FROM
> > > dfs.`/user/ahunt/test/2014/10/28/` GROUP
> > > BY httpResponseCode ORDER BY httpResponseCode DESC;
> > >
> > > In Spark SQL, the same query finished in 5s.
> > > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > > httpResponseCode ORDER BY httpResponseCode
> > DESC").collect().foreach(println)
> > >
> > > Although Drill seems very promising, it seems that it has a few issues
> to
> > > work out, and since I already use Spark I'm going to stick with Spark
> SQL
> > > for now.
> > >
> > > Adam
> > >
> > >
> > > On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <
> > tridib.samanta@live.com>
> > > wrote:
> > >
> > > > Hello Experts,
> > > > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > > > wandering how does it differ from Spark SQL. What are the use case
> > where
> > > > Apache Drill thrives compare to Spark SQL?
> > > >
> > > > Thanks & Regards
> > > > Tridib
> > > >
> >
> >
>

Re: Apache Drill Vs Spark SQL

Posted by Neeraja Rentachintala <nr...@maprtech.com>.

Tridib

If you are getting started with Drill, you can also refer to a tutorial
which goes through various Drill's capabilities.
https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Tutorial

You are spot on the metadata part. Discovering metadata dynamically and
providing ability to work with complex datatypes such as JSON without
transformation is a key difference for Drill compared to SparkSQL and other
SQL options.

-Neeraja


On Wed, Oct 29, 2014 at 11:12 AM, Tridib Samanta <tr...@live.com>
wrote:

> Hi Adam,
> Thanks for sharing this! Apache Drill is very easy to get started. I liked
> the part that Drill manages the meta data part by itself and does not
> required Hive (like Spark).
>
> Thanks
> Tridib
>
> > Date: Wed, 29 Oct 2014 10:50:37 -0700
> > Subject: Re: Apache Drill Vs Spark SQL
> > From: adamphunt@gmail.com
> > To: drill-user@incubator.apache.org
> >
> > Hi Tridib,
> >
> > I just completed a simple evaluation of Drill 0.6.0 and Spark SQL
> 1.1.0.  I
> > ran a few queries over 14GB of Snappy compressed Parquet files on a four
> > server MapR cluster (96 cores, 256 GB).  Here are the results.
> >
> > Spark SQL requires some very very minor setup, where Drill doesn't.
> > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> > testData.registerTempTable("testData")
> >
> > In Drill, a simple count query took 19s the first time and 0.9s the
> second
> > time
> > SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
> >
> > In Spark SQL, it took 17s the first time and 1.7s the second
> > sqlContext.sql("SELECT count(*) FROM
> testData").collect().foreach(println)
> >
> > In Drill, a simple group by query printed the results, but would not
> return
> > to the prompt without hitting ctrl-c (after 6s).
> > SELECT httpResponseCode, count(*) FROM
> > dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
> >
> > In Spark SQL, it finished in 3.6s
> > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > httpResponseCode").collect().foreach(println)
> >
> > In Drill, this query never finished (probably due to the issue described
> > above).
> > SELECT httpResponseCode, count(*) FROM
> > dfs.`/user/ahunt/test/2014/10/28/` GROUP
> > BY httpResponseCode ORDER BY httpResponseCode DESC;
> >
> > In Spark SQL, the same query finished in 5s.
> > sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> > httpResponseCode ORDER BY httpResponseCode
> DESC").collect().foreach(println)
> >
> > Although Drill seems very promising, it seems that it has a few issues to
> > work out, and since I already use Spark I'm going to stick with Spark SQL
> > for now.
> >
> > Adam
> >
> >
> > On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <
> tridib.samanta@live.com>
> > wrote:
> >
> > > Hello Experts,
> > > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > > wandering how does it differ from Spark SQL. What are the use case
> where
> > > Apache Drill thrives compare to Spark SQL?
> > >
> > > Thanks & Regards
> > > Tridib
> > >
>
>

RE: Apache Drill Vs Spark SQL

Posted by Tridib Samanta <tr...@live.com>.

Hi Adam,
Thanks for sharing this! Apache Drill is very easy to get started. I liked the part that Drill manages the meta data part by itself and does not required Hive (like Spark).
 
Thanks
Tridib
 
> Date: Wed, 29 Oct 2014 10:50:37 -0700
> Subject: Re: Apache Drill Vs Spark SQL
> From: adamphunt@gmail.com
> To: drill-user@incubator.apache.org
> 
> Hi Tridib,
> 
> I just completed a simple evaluation of Drill 0.6.0 and Spark SQL 1.1.0.  I
> ran a few queries over 14GB of Snappy compressed Parquet files on a four
> server MapR cluster (96 cores, 256 GB).  Here are the results.
> 
> Spark SQL requires some very very minor setup, where Drill doesn't.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> testData.registerTempTable("testData")
> 
> In Drill, a simple count query took 19s the first time and 0.9s the second
> time
> SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
> 
> In Spark SQL, it took 17s the first time and 1.7s the second
> sqlContext.sql("SELECT count(*) FROM testData").collect().foreach(println)
> 
> In Drill, a simple group by query printed the results, but would not return
> to the prompt without hitting ctrl-c (after 6s).
> SELECT httpResponseCode, count(*) FROM
> dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
> 
> In Spark SQL, it finished in 3.6s
> sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> httpResponseCode").collect().foreach(println)
> 
> In Drill, this query never finished (probably due to the issue described
> above).
> SELECT httpResponseCode, count(*) FROM
> dfs.`/user/ahunt/test/2014/10/28/` GROUP
> BY httpResponseCode ORDER BY httpResponseCode DESC;
> 
> In Spark SQL, the same query finished in 5s.
> sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> httpResponseCode ORDER BY httpResponseCode DESC").collect().foreach(println)
> 
> Although Drill seems very promising, it seems that it has a few issues to
> work out, and since I already use Spark I'm going to stick with Spark SQL
> for now.
> 
> Adam
> 
> 
> On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <tr...@live.com>
> wrote:
> 
> > Hello Experts,
> > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > wandering how does it differ from Spark SQL. What are the use case where
> > Apache Drill thrives compare to Spark SQL?
> >
> > Thanks & Regards
> > Tridib
> >

Re: Apache Drill Vs Spark SQL

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Oct 29, 2014 at 10:50 AM, Adam Hunt <ad...@gmail.com> wrote:

> Although Drill seems very promising, it seems that it has a few issues to
> work out, and since I already use Spark I'm going to stick with Spark SQL
> for now.
>

Adam,

Really excellent feedback.

If you check back in a few weeks from now, I would expect that you will
find many or most of these issues resolved.

(frankly, your scenario should become a unit test)

Re: Apache Drill Vs Spark SQL

Posted by Christopher Matta <cm...@mapr.com>.

Adam,
I had a similar experience with queries not returning immediately, this
setting seems like it might help:

ALTER SESSION set `planner.add_producer_consumer`=false;



Chris Matta
cmatta@mapr.com
215-701-3146

On Wed, Oct 29, 2014 at 1:50 PM, Adam Hunt <ad...@gmail.com> wrote:

> Hi Tridib,
>
> I just completed a simple evaluation of Drill 0.6.0 and Spark SQL 1.1.0.  I
> ran a few queries over 14GB of Snappy compressed Parquet files on a four
> server MapR cluster (96 cores, 256 GB).  Here are the results.
>
> Spark SQL requires some very very minor setup, where Drill doesn't.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
> testData.registerTempTable("testData")
>
> In Drill, a simple count query took 19s the first time and 0.9s the second
> time
> SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;
>
> In Spark SQL, it took 17s the first time and 1.7s the second
> sqlContext.sql("SELECT count(*) FROM testData").collect().foreach(println)
>
> In Drill, a simple group by query printed the results, but would not return
> to the prompt without hitting ctrl-c (after 6s).
> SELECT httpResponseCode, count(*) FROM
> dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;
>
> In Spark SQL, it finished in 3.6s
> sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> httpResponseCode").collect().foreach(println)
>
> In Drill, this query never finished (probably due to the issue described
> above).
> SELECT httpResponseCode, count(*) FROM
> dfs.`/user/ahunt/test/2014/10/28/` GROUP
> BY httpResponseCode ORDER BY httpResponseCode DESC;
>
> In Spark SQL, the same query finished in 5s.
> sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
> httpResponseCode ORDER BY httpResponseCode
> DESC").collect().foreach(println)
>
> Although Drill seems very promising, it seems that it has a few issues to
> work out, and since I already use Spark I'm going to stick with Spark SQL
> for now.
>
> Adam
>
>
> On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <tr...@live.com>
> wrote:
>
> > Hello Experts,
> > I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> > wandering how does it differ from Spark SQL. What are the use case where
> > Apache Drill thrives compare to Spark SQL?
> >
> > Thanks & Regards
> > Tridib
> >
>

Re: Apache Drill Vs Spark SQL

Posted by Adam Hunt <ad...@gmail.com>.

Hi Tridib,

I just completed a simple evaluation of Drill 0.6.0 and Spark SQL 1.1.0.  I
ran a few queries over 14GB of Snappy compressed Parquet files on a four
server MapR cluster (96 cores, 256 GB).  Here are the results.

Spark SQL requires some very very minor setup, where Drill doesn't.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val testData = sqlContext.parquetFile("/user/ahunt/test/2014/10/28/")
testData.registerTempTable("testData")

In Drill, a simple count query took 19s the first time and 0.9s the second
time
SELECT count(*) FROM  dfs.`/user/ahunt/test/2014/10/28/part-*`;

In Spark SQL, it took 17s the first time and 1.7s the second
sqlContext.sql("SELECT count(*) FROM testData").collect().foreach(println)

In Drill, a simple group by query printed the results, but would not return
to the prompt without hitting ctrl-c (after 6s).
SELECT httpResponseCode, count(*) FROM
dfs.`/user/ahunt/test/2014/10/28/part-*` GROUP BY httpResponseCode;

In Spark SQL, it finished in 3.6s
sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
httpResponseCode").collect().foreach(println)

In Drill, this query never finished (probably due to the issue described
above).
SELECT httpResponseCode, count(*) FROM
dfs.`/user/ahunt/test/2014/10/28/` GROUP
BY httpResponseCode ORDER BY httpResponseCode DESC;

In Spark SQL, the same query finished in 5s.
sqlContext.sql("SELECT httpResponseCode,count(*) FROM testData GROUP BY
httpResponseCode ORDER BY httpResponseCode DESC").collect().foreach(println)

Although Drill seems very promising, it seems that it has a few issues to
work out, and since I already use Spark I'm going to stick with Spark SQL
for now.

Adam

On Wed, Oct 29, 2014 at 10:00 AM, Tridib Samanta <tr...@live.com>
wrote:

> Hello Experts,
> I am new in Apache Drill. To me it's very similar to Spark SQL. I was
> wandering how does it differ from Spark SQL. What are the use case where
> Apache Drill thrives compare to Spark SQL?
>
> Thanks & Regards
> Tridib
>