You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Ivan Kovacevic <iv...@gmail.com> on 2017/04/26 14:12:42 UTC

Apache Drill Query Planning Performance

Dear Sir or Madam,

I would like to ask a question regarding query planning, since I am writing
a chapter about Apache Drill in my master thesis.
My DrillBit is installed within a Cloudera VM, and there is a separate VM
with MongoDb installed.
At the time of writing, I'm performing analysis on the yelp academic
dataset contained in Hive tables, and joining it with separate data in
MongoDb.
When running queries, I have noticed that there is a significant difference
in the duration of the first planning phase of a query and the following
planning phases of the same query, e.g.:

SELECT COUNT(*) FROM `hive.yelp_academic_dataset`.review_impala;

   - The first time the query is run:
      - PLANNING:* 30.230 sec*
      - EXECUTION: 27.968 sec
      - [image: Ugrađena slika 1]
      - [image: Ugrađena slika 2]
      - The next time the same query is run (given that other queries are
   not run in the meantime)
      - PLANNING: *0.087 sec*
      - EXECUTION: 34.682 sec
      - [image: Ugrađena slika 3]
      - [image: Ugrađena slika 4]


The reason I find it rather odd is that if another query runs in the
meantime, the next time the first query is re-run, it will again take a
long time to finish the query planning phase.
What causes such difference in the query planning phase duration?
I'm looking forward to Your answer.

Best Regards,
Ivan Kovačević

Re: Apache Drill Query Planning Performance

Posted by Jinfeng Ni <jn...@apache.org>.

Try to increase config `hive.metastore.cache-ttl-seconds`. The default
is 60 seconds.  I guess the reason you saw long planning time after
another query is probably the cache from the first run expired.


1. https://drill.apache.org/docs/hive-metadata-caching/

On Wed, Apr 26, 2017 at 3:47 PM, rahul challapalli
<ch...@gmail.com> wrote:
> If your hive metastore contains a lot of metadata (many databases, tables,
> columns etc), then drill might spend a significant time in fetching the
> metadata the first time. It caches the metadata, so subsequent runs should
> be faster. The fact that other queries are run in-between the first and
> second run of your query does not invalidate the cached metadata. Its not
> clear from what you mentioned whether the second run's planning time is as
> long as the first run when you run some other queries in the middle. If so
> there is something else going on.
>
> Also, if you have attached any images (or files), they will be filtered
> out. If you want to share something (logs, profiles etc), then go ahead and
> raise a jira with all the information you have.
>
> - Rahul
>
> On Wed, Apr 26, 2017 at 7:12 AM, Ivan Kovacevic <iv...@gmail.com>
> wrote:
>
>> Dear Sir or Madam,
>>
>> I would like to ask a question regarding query planning, since I am
>> writing a chapter about Apache Drill in my master thesis.
>> My DrillBit is installed within a Cloudera VM, and there is a separate VM
>> with MongoDb installed.
>> At the time of writing, I'm performing analysis on the yelp academic
>> dataset contained in Hive tables, and joining it with separate data in
>> MongoDb.
>> When running queries, I have noticed that there is a significant
>> difference in the duration of the first planning phase of a query and the
>> following planning phases of the same query, e.g.:
>>
>> SELECT COUNT(*) FROM `hive.yelp_academic_dataset`.review_impala;
>>
>>    - The first time the query is run:
>>       - PLANNING:* 30.230 sec*
>>       - EXECUTION: 27.968 sec
>>       - [image: Ugrađena slika 1]
>>       - [image: Ugrađena slika 2]
>>       - The next time the same query is run (given that other queries are
>>    not run in the meantime)
>>       - PLANNING: *0.087 sec*
>>       - EXECUTION: 34.682 sec
>>       - [image: Ugrađena slika 3]
>>       - [image: Ugrađena slika 4]
>>
>>
>> The reason I find it rather odd is that if another query runs in the
>> meantime, the next time the first query is re-run, it will again take a
>> long time to finish the query planning phase.
>> What causes such difference in the query planning phase duration?
>> I'm looking forward to Your answer.
>>
>> Best Regards,
>> Ivan Kovačević
>>

Re: Apache Drill Query Planning Performance

Posted by rahul challapalli <ch...@gmail.com>.

If your hive metastore contains a lot of metadata (many databases, tables,
columns etc), then drill might spend a significant time in fetching the
metadata the first time. It caches the metadata, so subsequent runs should
be faster. The fact that other queries are run in-between the first and
second run of your query does not invalidate the cached metadata. Its not
clear from what you mentioned whether the second run's planning time is as
long as the first run when you run some other queries in the middle. If so
there is something else going on.

Also, if you have attached any images (or files), they will be filtered
out. If you want to share something (logs, profiles etc), then go ahead and
raise a jira with all the information you have.

- Rahul

On Wed, Apr 26, 2017 at 7:12 AM, Ivan Kovacevic <iv...@gmail.com>
wrote:

> Dear Sir or Madam,
>
> I would like to ask a question regarding query planning, since I am
> writing a chapter about Apache Drill in my master thesis.
> My DrillBit is installed within a Cloudera VM, and there is a separate VM
> with MongoDb installed.
> At the time of writing, I'm performing analysis on the yelp academic
> dataset contained in Hive tables, and joining it with separate data in
> MongoDb.
> When running queries, I have noticed that there is a significant
> difference in the duration of the first planning phase of a query and the
> following planning phases of the same query, e.g.:
>
> SELECT COUNT(*) FROM `hive.yelp_academic_dataset`.review_impala;
>
>    - The first time the query is run:
>       - PLANNING:* 30.230 sec*
>       - EXECUTION: 27.968 sec
>       - [image: Ugrađena slika 1]
>       - [image: Ugrađena slika 2]
>       - The next time the same query is run (given that other queries are
>    not run in the meantime)
>       - PLANNING: *0.087 sec*
>       - EXECUTION: 34.682 sec
>       - [image: Ugrađena slika 3]
>       - [image: Ugrađena slika 4]
>
>
> The reason I find it rather odd is that if another query runs in the
> meantime, the next time the first query is re-run, it will again take a
> long time to finish the query planning phase.
> What causes such difference in the query planning phase duration?
> I'm looking forward to Your answer.
>
> Best Regards,
> Ivan Kovačević
>