You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Karan Hegde <ka...@premexcorp.com> on 2018/03/09 19:10:28 UTC

Drill-SQL Server Performance Issues

Hi,

First of, Drill is an amazing tool and makes my work so much easier when I
have different data types and sources to deal with, its a real life-saver.
So thank you all for creating it!

I've been testing Drill locally with an MS SQL Server, using the PyDrill
Python driver. My larger queries are taking a long time to run. For e.g. a
query extracting ~1 million rows and 8 columns takes 30-40 minutes to
execute.

I have a few questions I'm hoping to get some clarity on:

   1. Is this performance expected or unusual and if so, is it because
   Drill is not optimized for RDBMS querying?

   2. Is there any way for me to speed up queries (apart from running it on
   a cluster in distributed mode), for e.g. by specifying the schema in the
   query so that Drill doesn't need to spend time in schema discovery?

   3. When I look the Physical Plan on the Drill interface for the query, I
   don't see any values in *cumulative cost *in any of my fragment profiles
   *:*

   cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}

   Is there a particular reason for this?

If you've got any other tips/tricks for working with relational databases
in Drill, please do let me know.

Thank You!!
Karan Hegde

Re: Drill-SQL Server Performance Issues

Posted by Kunal Khatua <ku...@apache.org>.

   1. Drill connects to JDBC sources using the JDBC Storage plugin and you
   should probably try that.That said, the JDBC storage plugin, AFAIK, does
   not do push down of filters to the database. But you could try this to
   confirm.
   2. What you need to look at first is the query profile for these queries
   that you are running. The operators overview section will tell you where is
   most of the time being spent.
   3. The phsyical plan is basically a text that describes what Drill
   thinks is the nature of teh underlying data and will create an execution
   plan based on that. For things, such as network cost, we don't currently
   have any information which is why those are zero.

Coming back to the details you provided, here is what you should know..

   - The pyDrill python driver is a community-provided driver which makes
   use of REST APIs to do the actual query. REST APIs are inefficient for
   getching millions of rows because it involces using HTTP requests and
   numerous layers between your application and Drill.
   - You could try connecting to Drill with it's JDBC driver by using the
   JayDeBeApi package in Python (
   https://pypi.python.org/pypi/JayDeBeApi/#usage).

Disclaimer: I've never tested Drill with JayDeBeApi , but it looked like
something more performant than pyDrill. If you can get this to work, we'd
love it if you can share your experience (and the steps) with us.

~ Kunal

On Fri, Mar 9, 2018 at 11:10 AM, Karan Hegde <ka...@premexcorp.com>
wrote:

> Hi,
>
> First of, Drill is an amazing tool and makes my work so much easier when I
> have different data types and sources to deal with, its a real life-saver.
> So thank you all for creating it!
>
> I've been testing Drill locally with an MS SQL Server, using the PyDrill
> Python driver. My larger queries are taking a long time to run. For e.g. a
> query extracting ~1 million rows and 8 columns takes 30-40 minutes to
> execute.
>
> I have a few questions I'm hoping to get some clarity on:
>
>    1. Is this performance expected or unusual and if so, is it because
>    Drill is not optimized for RDBMS querying?
>
>    2. Is there any way for me to speed up queries (apart from running it on
>    a cluster in distributed mode), for e.g. by specifying the schema in the
>    query so that Drill doesn't need to spend time in schema discovery?
>
>    3. When I look the Physical Plan on the Drill interface for the query, I
>    don't see any values in *cumulative cost *in any of my fragment profiles
>    *:*
>
>    cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}
>
>    Is there a particular reason for this?
>
> If you've got any other tips/tricks for working with relational databases
> in Drill, please do let me know.
>
> Thank You!!
> Karan Hegde
>