You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Ming Han Teh <te...@gmail.com> on 2015/08/16 10:47:17 UTC

Benchmarks for Apache Drill

Hi,

Are there any benchmarks on Apache Drill?
(standalone benchmarks OR vs Impala/Presto)

Thanks,
Ming Han

Re: Benchmarks for Apache Drill

Posted by Ted Dunning <te...@gmail.com>.

The Drill project itself has not focussed on performance, other than in the
basic architecture.

There have been some external benchmarks independent of the Drill project
by Intel and another group whose name escapes me.

The intel work is presented here:
http://www.slideshare.net/Hadoop_Summit/notonlyhadoop-the-dag-showdown

Some notes and cautions are in order, however,

1) the team doing the benchmarking worked completely independently of the
Drill project and didn't get any advice about configuration of Drill (they
got information from some other projects).

2) the version of Drill used was 0.9 which is considerably slower than 1.0
and 1.1 (the most current).

3) the limitations on language cited has been substantially improved since
0.9

On Sun, Aug 16, 2015 at 1:47 AM, Ming Han Teh <te...@gmail.com> wrote:

> Hi,
>
> Are there any benchmarks on Apache Drill?
> (standalone benchmarks OR vs Impala/Presto)
>
> Thanks,
> Ming Han
>

Re: Benchmarks for Apache Drill

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Aug 17, 2015 at 8:30 PM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

> Thanks!  Amazing how much that reminds me of writing .NET CLR functions
> and aggregates for SQL Server, something I've covered in our SQL Server
> book for the last 10 years.
>

It is similar, I think, particularly with regard to the annotation style. I
imagine that there may be some changes here and there to the style,
particularly in the semantic and lexical constraints on UDF functions in
Drill (which as you point out is considerably less mature than SQL Server).

> Meanwhile, and forgive me if I'm being thick, but how does that
> architecture lend itself to vectorization of the code?
>

Well, the code that you see isn't the code that runs.

What Drill does is use your compiled code to find the annotations to
understand the code. Then it uses the source to generate the actual code
that is run. But then because your code is inserted in-line with all the
other code in the query (with appropriate lexical constraints), the JIT
optimizer can see everything it needs in order to heavily rewrite your
code.  For simple operations like additions and such, the optimizer can
even insert vectorized code. Drill can also have special purpose operators
that recognize the potential for vectorization and insert those vectorized
operators as practical.

>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, August 17, 2015 3:37 AM
> To: user <us...@drill.apache.org>
> Subject: Re: Benchmarks for Apache Drill
>
> Docs:
>
> https://drill.apache.org/docs/develop-custom-functions-introduction/
>
> Some usable examples:
>
> https://github.com/mapr-demos/simple-drill-functions
>

RE: Benchmarks for Apache Drill

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

Thanks!  Amazing how much that reminds me of writing .NET CLR functions and aggregates for SQL Server, something I've covered in our SQL Server book for the last 10 years.

Meanwhile, and forgive me if I'm being thick, but how does that architecture lend itself to vectorization of the code?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, August 17, 2015 3:37 AM
To: user <us...@drill.apache.org>
Subject: Re: Benchmarks for Apache Drill

Docs:

https://drill.apache.org/docs/develop-custom-functions-introduction/

Some usable examples:

https://github.com/mapr-demos/simple-drill-functions


On Sun, Aug 16, 2015 at 11:06 PM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

> >> the unusual code-embedding UDF system that Drill has <<
> Have a good link where I could read more about that?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, August 17, 2015 1:52 AM
> To: user
> Subject: Re: Benchmarks for Apache Drill
>
> On Sun, Aug 16, 2015 at 10:22 PM, Andrew Brust <
> andrew.brust@bluebadgeinsights.com> wrote:
>
> > I have to admit, I didn't realize columnar was such a big part of Drill.
> > I guess that's consistent with Dremel, so it makes sense.  I always
> > thought the emphasis was on heterogenous data access, not on perf.  Cool!
> >
>
> The focus of Drill is to combine polymorphic heterogenous data with
> performance.  Some systems have polymorphism, some have performance.
> Essentially none have both.
>
>
>
> >
> > So with that in mind, does drill do much with vector processing/SIMD
> > operation?
> >
>
> Actually, a number of basic operations in Drill do get vectorized to use
> the SIMD instructions of the underlying processor.  Enabling that is the
> rationale behind the unusual code-embedding UDF system that Drill has.
>

Re: Benchmarks for Apache Drill

Posted by Ted Dunning <te...@gmail.com>.

Docs:

https://drill.apache.org/docs/develop-custom-functions-introduction/

Some usable examples:

https://github.com/mapr-demos/simple-drill-functions


On Sun, Aug 16, 2015 at 11:06 PM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

> >> the unusual code-embedding UDF system that Drill has <<
> Have a good link where I could read more about that?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, August 17, 2015 1:52 AM
> To: user
> Subject: Re: Benchmarks for Apache Drill
>
> On Sun, Aug 16, 2015 at 10:22 PM, Andrew Brust <
> andrew.brust@bluebadgeinsights.com> wrote:
>
> > I have to admit, I didn't realize columnar was such a big part of Drill.
> > I guess that's consistent with Dremel, so it makes sense.  I always
> > thought the emphasis was on heterogenous data access, not on perf.  Cool!
> >
>
> The focus of Drill is to combine polymorphic heterogenous data with
> performance.  Some systems have polymorphism, some have performance.
> Essentially none have both.
>
>
>
> >
> > So with that in mind, does drill do much with vector processing/SIMD
> > operation?
> >
>
> Actually, a number of basic operations in Drill do get vectorized to use
> the SIMD instructions of the underlying processor.  Enabling that is the
> rationale behind the unusual code-embedding UDF system that Drill has.
>

RE: Benchmarks for Apache Drill

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

>> the unusual code-embedding UDF system that Drill has <<
Have a good link where I could read more about that?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, August 17, 2015 1:52 AM
To: user
Subject: Re: Benchmarks for Apache Drill

On Sun, Aug 16, 2015 at 10:22 PM, Andrew Brust < andrew.brust@bluebadgeinsights.com> wrote:

> I have to admit, I didn't realize columnar was such a big part of Drill.
> I guess that's consistent with Dremel, so it makes sense.  I always 
> thought the emphasis was on heterogenous data access, not on perf.  Cool!
>

The focus of Drill is to combine polymorphic heterogenous data with performance.  Some systems have polymorphism, some have performance.
Essentially none have both.

>
> So with that in mind, does drill do much with vector processing/SIMD 
> operation?
>

Actually, a number of basic operations in Drill do get vectorized to use the SIMD instructions of the underlying processor.  Enabling that is the rationale behind the unusual code-embedding UDF system that Drill has.

Re: Benchmarks for Apache Drill

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Aug 16, 2015 at 10:22 PM, Andrew Brust <
andrew.brust@bluebadgeinsights.com> wrote:

> I have to admit, I didn't realize columnar was such a big part of Drill.
> I guess that's consistent with Dremel, so it makes sense.  I always thought
> the emphasis was on heterogenous data access, not on perf.  Cool!
>

The focus of Drill is to combine polymorphic heterogenous data with
performance.  Some systems have polymorphism, some have performance.
Essentially none have both.

>
> So with that in mind, does drill do much with vector processing/SIMD
> operation?
>

Actually, a number of basic operations in Drill do get vectorized to use
the SIMD instructions of the underlying processor.  Enabling that is the
rationale behind the unusual code-embedding UDF system that Drill has.

RE: Benchmarks for Apache Drill

Posted by Andrew Brust <an...@bluebadgeinsights.com>.

I have to admit, I didn't realize columnar was such a big part of Drill.  I guess that's consistent with Dremel, so it makes sense.  I always thought the emphasis was on heterogenous data access, not on perf.  Cool!

So with that in mind, does drill do much with vector processing/SIMD operation?

-----Original Message-----
From: Jacques Nadeau [mailto:jacques@dremio.com] 
Sent: Monday, August 17, 2015 1:17 AM
To: user@drill.apache.org
Subject: Re: Benchmarks for Apache Drill

Drill is very fast.  This is because nearly everybody on the Drill team is focused on performance.  We haven't published any formal benchmarks yet.
That being said, there are a few out there.  I see that Ted mentioned the Intel one.  Another is here [1]. As Ted mentioned, these blogs test older and pre-release versions of Drill.  Nonetheless, Drill already outshines nearly all of the competition.  That being said, the reality is that most benchmarks are very skewed and poorly executed so I strongly recommend you try out Drill on your workload.  Once you get setup, ask the community for help to tune the system.  Many others are finding it to be incredibly fast and it has repeatedly displaced commercial MPP databases and older open source technologies.

Drill is the only open source pure columnar in-memory execution engine today.  This means that Drill has the right architecture to continue to increase its lead over other engines. (Think of this as future-proofing.)  We'll be enhancing the engine with items including columnar functions, compilation optimizations and customized relational operators in the coming months.  This will simply extend Drill's performance lead.

thanks,
Jacques

[1] http://allegro.tech/fast-data-hackathon.html

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Aug 16, 2015 at 1:47 AM, Ming Han Teh <te...@gmail.com> wrote:

> Hi,
>
> Are there any benchmarks on Apache Drill?
> (standalone benchmarks OR vs Impala/Presto)
>
> Thanks,
> Ming Han
>

Re: Benchmarks for Apache Drill

Posted by Jacques Nadeau <ja...@dremio.com>.

Drill is very fast.  This is because nearly everybody on the Drill team is
focused on performance.  We haven't published any formal benchmarks yet.
That being said, there are a few out there.  I see that Ted mentioned the
Intel one.  Another is here [1]. As Ted mentioned, these blogs test older
and pre-release versions of Drill.  Nonetheless, Drill already outshines
nearly all of the competition.  That being said, the reality is that most
benchmarks are very skewed and poorly executed so I strongly recommend you
try out Drill on your workload.  Once you get setup, ask the community for
help to tune the system.  Many others are finding it to be incredibly fast
and it has repeatedly displaced commercial MPP databases and older open
source technologies.

Drill is the only open source pure columnar in-memory execution engine
today.  This means that Drill has the right architecture to continue to
increase its lead over other engines. (Think of this as future-proofing.)
 We'll be enhancing the engine with items including columnar functions,
compilation optimizations and customized relational operators in the coming
months.  This will simply extend Drill's performance lead.

thanks,
Jacques

[1] http://allegro.tech/fast-data-hackathon.html

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Aug 16, 2015 at 1:47 AM, Ming Han Teh <te...@gmail.com> wrote:

> Hi,
>
> Are there any benchmarks on Apache Drill?
> (standalone benchmarks OR vs Impala/Presto)
>
> Thanks,
> Ming Han
>