You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Divya Gehlot <di...@gmail.com> on 2017/08/17 03:15:41 UTC

Query Optimization

Hi,
I have data in parquet file format .
when I run the query the data and see the execution plan I could see
following
statistics

> TOTAL FRAGMENTS: 1
>> DURATION: 01 min 0.233 sec
>> PLANNING: 59.818 sec
>> QUEUED: Not Available
>> EXECUTION: 0.415 sec
>
>

As its a paquet file format I tried enabling refresh meta data
and run below command
REFRESH TABLE METADATA <path to table> ;
then run the same query again on the same table same data (no changes in
data)  and could find the statistics as show below :

TOTAL FRAGMENTS: 13
>> DURATION: 14 min 14.604 sec
>> PLANNING: 33.087 sec
>> QUEUED: Not Available
>> EXECUTION: Not Available
>
>
The query is still running .

Can somebody help me  understand why the query taking so long once I issue
the refresh metadata command.

Aprreciate the help !

Thanks,
Divya

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi ,
Here is the JIRA  DRILL-5739
<https://issues.apache.org/jira/browse/DRILL-5739>  link  of the issue .

Thanks,
Divya

On 22 August 2017 at 04:28, Padma Penumarthy <pp...@mapr.com> wrote:

> That is definitely not the design strategy. Also, I don’t think what you
> are seeing is same
> as DRILL-3846.  The difference between with and without metadata caching
> is a
> factor of 2-4 times in DRILL-3846 where as what you see is huge order of
> magnitude
> different.
>
> You should file a JIRA and include details that will help us reproduce the
> problem.
> Please add as much information as possible.
> A sample dataset, how you are creating the table (i.e. partition info),
> logs, query profiles will be very helpful.
>
> Thanks,
> Padma
>
>
> > On Aug 20, 2017, at 7:03 PM, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> > Hi ,
> > Yes As Rahul mentioned I am running into a bug
> > https://issues.apache.org/jira/browse/DRILL-3846 ?
> >
> > As asked the usedMetadataFile is true once I run the Metadata cache
> query .
> > Any tentative or workaorund for the bug?
> >
> > Now my ask is if metadata cache is enabled the does Drill reads all the
> > files instead of intended ones ?
> > Is it Drill design strategy ?
> >
> > Thanks,
> > divya
> >
> > On 18 August 2017 at 12:13, Padma Penumarthy <pp...@mapr.com>
> wrote:
> >
> >> It is supposed to work like you expected. May be you are running into a
> >> bug.
> >> Why is it reading all files after metadata refresh ? That is difficult
> to
> >> answer without
> >> looking at the logs and query profile. If you look at the query profile,
> >> you can may
> >> be check what usedMetadataFile flag says for scan.
> >> Also, I am thinking if you created so many files, your metadata
> >> cache file could be big. May be you can manually sanity
> >> check if it looks ok (look for .drill.parquet.metadata file in the root
> >> directory) and not
> >> corrupted ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfaraaz@mapr.com<mailto:
> kfara
> >> az@mapr.com>> wrote:
> >>
> >> Please share your SQL query and the query plan.
> >>
> >> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
> >>
> >>
> >> Thanks,
> >>
> >> Khurram
> >>
> >> ________________________________
> >> From: Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com
> >>>>
> >> Sent: Friday, August 18, 2017 7:15:18 AM
> >> To: user@drill.apache.org<ma...@drill.apache.org>
> >> Subject: Re: Query Optimization
> >>
> >> Hi ,
> >> Yes its the same query its just the ran the metadata refresh command .
> >> My understanding is metadata refresh command saves reading the metadata.
> >> How about column values ... Why is it reading all the files after
> metedata
> >> refresh ?
> >> Partition helps to retrieve data faster .
> >> Like in hive how it happens when you mention the partition column in
> where
> >> condition
> >> it just goes and read and improves the query performace .
> >> In my query also I where conidtion has  partioning column it should go
> and
> >> read those partitioned files right ?
> >> Why is it taking more time ?
> >> Does the Drill works in different way compare to hive ?
> >>
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumarthy@mapr.com
> <mailto:
> >> ppenumarthy@mapr.com>> wrote:
> >>
> >> It might read all those files if some new data gets added after running
> >> refresh metadata cache.
> >> If everything is same before and after metadata refresh i.e. no
> >> new data added and query is exactly the same, then it should not do
> that.
> >> Also, check if you can partition in  a way that will not create so many
> >> files in the
> >> first place.
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htconex@gmail.com<
> >> mailto:divya.htconex@gmail.com>>
> >> wrote:
> >>
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >>
> >> Total number of parquet files in directory  - 102290
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >>
> >> This is how the refresh metadata works I mean it scans each and every
> >> files
> >> and get the results ?
> >>
> >> I dont  have access to logs now .
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:48, Divya Gehlot <divya.htconex@gmail.com<
> mailto:
> >> divya.htconex@gmail.com>>
> >> wrote:
> >>
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumarthy@mapr.com
> <mailto:
> >> ppenumarthy@mapr.com>>
> >> wrote:
> >>
> >> Does your query have partition filter ?
> >> Execution time is increased most likely because partition pruning is
> >> not
> >> happening.
> >> Did you get a chance to look at the logs ?  That might give some clues.
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htconex@gmail.com<
> mailto:
> >> divya.htconex@gmail.com>>
> >> wrote:
> >>
> >> Hi,
> >> Even I am surprised .
> >> I am running Drill version 1.10  on MapR enterprise version.
> >> *Query *- Selecting all the columns on partitioned parquet table
> >>
> >> I observed few things from Query statistics :
> >>
> >> Value
> >>
> >> Before Refresh Metadata
> >>
> >> After Refresh Metadata
> >>
> >> Fragments
> >>
> >> 1
> >>
> >> 13
> >>
> >> DURATION
> >>
> >> 01 min 0.233 sec
> >>
> >> 18 min 0.744 sec
> >>
> >> PLANNING
> >>
> >> 59.818 sec
> >>
> >> 33.087 sec
> >>
> >> QUEUED
> >>
> >> Not Available
> >>
> >> Not Available
> >>
> >> EXECUTION
> >>
> >> 0.415 sec
> >>
> >> 17 min 27.657 sec
> >>
> >> The planning time is being reduced by approx 60% but the execution
> >> time
> >> increased  drastically.
> >> I would like to understand why the exceution time increases after the
> >> metadata refresh .
> >>
> >>
> >> Appreciate the help.
> >>
> >> Thanks,
> >> divya
> >>
> >>
> >> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumarthy@mapr.com
> <mailto:
> >> ppenumarthy@mapr.com>>
> >> wrote:
> >>
> >> Refresh table metadata should  help reduce query planning time.
> >> It is odd that it went up after you did refresh table metadata.
> >> Did you check the logs to see what is happening ? You might have to
> >> turn on some debugs if needed.
> >> BTW, what version of Drill are you running ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htconex@gmail.com<
> mailto:
> >> divya.htconex@gmail.com>>
> >> wrote:
> >>
> >> Hi,
> >> I have data in parquet file format .
> >> when I run the query the data and see the execution plan I could see
> >> following
> >> statistics
> >>
> >> TOTAL FRAGMENTS: 1
> >> DURATION: 01 min 0.233 sec
> >> PLANNING: 59.818 sec
> >> QUEUED: Not Available
> >> EXECUTION: 0.415 sec
> >>
> >>
> >>
> >> As its a paquet file format I tried enabling refresh meta data
> >> and run below command
> >> REFRESH TABLE METADATA <path to table> ;
> >> then run the same query again on the same table same data (no
> >> changes
> >> in
> >> data)  and could find the statistics as show below :
> >>
> >> TOTAL FRAGMENTS: 13
> >> DURATION: 14 min 14.604 sec
> >> PLANNING: 33.087 sec
> >> QUEUED: Not Available
> >> EXECUTION: Not Available
> >>
> >>
> >> The query is still running .
> >>
> >> Can somebody help me  understand why the query taking so long once I
> >> issue
> >> the refresh metadata command.
> >>
> >> Aprreciate the help !
> >>
> >> Thanks,
> >> Divya
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: Query Optimization

Posted by Padma Penumarthy <pp...@mapr.com>.

That is definitely not the design strategy. Also, I don’t think what you are seeing is same
as DRILL-3846.  The difference between with and without metadata caching is a
factor of 2-4 times in DRILL-3846 where as what you see is huge order of magnitude
different.

You should file a JIRA and include details that will help us reproduce the problem.
Please add as much information as possible.
A sample dataset, how you are creating the table (i.e. partition info), 
logs, query profiles will be very helpful.

Thanks,
Padma


> On Aug 20, 2017, at 7:03 PM, Divya Gehlot <di...@gmail.com> wrote:
> 
> Hi ,
> Yes As Rahul mentioned I am running into a bug
> https://issues.apache.org/jira/browse/DRILL-3846 ?
> 
> As asked the usedMetadataFile is true once I run the Metadata cache query .
> Any tentative or workaorund for the bug?
> 
> Now my ask is if metadata cache is enabled the does Drill reads all the
> files instead of intended ones ?
> Is it Drill design strategy ?
> 
> Thanks,
> divya
> 
> On 18 August 2017 at 12:13, Padma Penumarthy <pp...@mapr.com> wrote:
> 
>> It is supposed to work like you expected. May be you are running into a
>> bug.
>> Why is it reading all files after metadata refresh ? That is difficult to
>> answer without
>> looking at the logs and query profile. If you look at the query profile,
>> you can may
>> be check what usedMetadataFile flag says for scan.
>> Also, I am thinking if you created so many files, your metadata
>> cache file could be big. May be you can manually sanity
>> check if it looks ok (look for .drill.parquet.metadata file in the root
>> directory) and not
>> corrupted ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfaraaz@mapr.com<mailto:kfara
>> az@mapr.com>> wrote:
>> 
>> Please share your SQL query and the query plan.
>> 
>> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
>> 
>> 
>> Thanks,
>> 
>> Khurram
>> 
>> ________________________________
>> From: Divya Gehlot <divya.htconex@gmail.com<mailto:divya.htconex@gmail.com
>>>> 
>> Sent: Friday, August 18, 2017 7:15:18 AM
>> To: user@drill.apache.org<ma...@drill.apache.org>
>> Subject: Re: Query Optimization
>> 
>> Hi ,
>> Yes its the same query its just the ran the metadata refresh command .
>> My understanding is metadata refresh command saves reading the metadata.
>> How about column values ... Why is it reading all the files after metedata
>> refresh ?
>> Partition helps to retrieve data faster .
>> Like in hive how it happens when you mention the partition column in where
>> condition
>> it just goes and read and improves the query performace .
>> In my query also I where conidtion has  partioning column it should go and
>> read those partitioned files right ?
>> Why is it taking more time ?
>> Does the Drill works in different way compare to hive ?
>> 
>> 
>> Thanks,
>> Divya
>> 
>> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
>> ppenumarthy@mapr.com>> wrote:
>> 
>> It might read all those files if some new data gets added after running
>> refresh metadata cache.
>> If everything is same before and after metadata refresh i.e. no
>> new data added and query is exactly the same, then it should not do that.
>> Also, check if you can partition in  a way that will not create so many
>> files in the
>> first place.
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htconex@gmail.com<
>> mailto:divya.htconex@gmail.com>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> 
>> Total number of parquet files in directory  - 102290
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> 
>> This is how the refresh metadata works I mean it scans each and every
>> files
>> and get the results ?
>> 
>> I dont  have access to logs now .
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:48, Divya Gehlot <divya.htconex@gmail.com<mailto:
>> divya.htconex@gmail.com>>
>> wrote:
>> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
>> ppenumarthy@mapr.com>>
>> wrote:
>> 
>> Does your query have partition filter ?
>> Execution time is increased most likely because partition pruning is
>> not
>> happening.
>> Did you get a chance to look at the logs ?  That might give some clues.
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
>> divya.htconex@gmail.com>>
>> wrote:
>> 
>> Hi,
>> Even I am surprised .
>> I am running Drill version 1.10  on MapR enterprise version.
>> *Query *- Selecting all the columns on partitioned parquet table
>> 
>> I observed few things from Query statistics :
>> 
>> Value
>> 
>> Before Refresh Metadata
>> 
>> After Refresh Metadata
>> 
>> Fragments
>> 
>> 1
>> 
>> 13
>> 
>> DURATION
>> 
>> 01 min 0.233 sec
>> 
>> 18 min 0.744 sec
>> 
>> PLANNING
>> 
>> 59.818 sec
>> 
>> 33.087 sec
>> 
>> QUEUED
>> 
>> Not Available
>> 
>> Not Available
>> 
>> EXECUTION
>> 
>> 0.415 sec
>> 
>> 17 min 27.657 sec
>> 
>> The planning time is being reduced by approx 60% but the execution
>> time
>> increased  drastically.
>> I would like to understand why the exceution time increases after the
>> metadata refresh .
>> 
>> 
>> Appreciate the help.
>> 
>> Thanks,
>> divya
>> 
>> 
>> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
>> ppenumarthy@mapr.com>>
>> wrote:
>> 
>> Refresh table metadata should  help reduce query planning time.
>> It is odd that it went up after you did refresh table metadata.
>> Did you check the logs to see what is happening ? You might have to
>> turn on some debugs if needed.
>> BTW, what version of Drill are you running ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
>> divya.htconex@gmail.com>>
>> wrote:
>> 
>> Hi,
>> I have data in parquet file format .
>> when I run the query the data and see the execution plan I could see
>> following
>> statistics
>> 
>> TOTAL FRAGMENTS: 1
>> DURATION: 01 min 0.233 sec
>> PLANNING: 59.818 sec
>> QUEUED: Not Available
>> EXECUTION: 0.415 sec
>> 
>> 
>> 
>> As its a paquet file format I tried enabling refresh meta data
>> and run below command
>> REFRESH TABLE METADATA <path to table> ;
>> then run the same query again on the same table same data (no
>> changes
>> in
>> data)  and could find the statistics as show below :
>> 
>> TOTAL FRAGMENTS: 13
>> DURATION: 14 min 14.604 sec
>> PLANNING: 33.087 sec
>> QUEUED: Not Available
>> EXECUTION: Not Available
>> 
>> 
>> The query is still running .
>> 
>> Can somebody help me  understand why the query taking so long once I
>> issue
>> the refresh metadata command.
>> 
>> Aprreciate the help !
>> 
>> Thanks,
>> Divya
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi ,
Yes As Rahul mentioned I am running into a bug
 https://issues.apache.org/jira/browse/DRILL-3846 ?

As asked the usedMetadataFile is true once I run the Metadata cache query .
Any tentative or workaorund for the bug?

Now my ask is if metadata cache is enabled the does Drill reads all the
files instead of intended ones ?
Is it Drill design strategy ?

Thanks,
divya

On 18 August 2017 at 12:13, Padma Penumarthy <pp...@mapr.com> wrote:

> It is supposed to work like you expected. May be you are running into a
> bug.
> Why is it reading all files after metadata refresh ? That is difficult to
> answer without
> looking at the logs and query profile. If you look at the query profile,
> you can may
> be check what usedMetadataFile flag says for scan.
> Also, I am thinking if you created so many files, your metadata
> cache file could be big. May be you can manually sanity
> check if it looks ok (look for .drill.parquet.metadata file in the root
> directory) and not
> corrupted ?
>
> Thanks,
> Padma
>
>
> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfaraaz@mapr.com<mailto:kfara
> az@mapr.com>> wrote:
>
> Please share your SQL query and the query plan.
>
> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
>
>
> Thanks,
>
> Khurram
>
> ________________________________
> From: Divya Gehlot <divya.htconex@gmail.com<mailto:divya.htconex@gmail.com
> >>
> Sent: Friday, August 18, 2017 7:15:18 AM
> To: user@drill.apache.org<ma...@drill.apache.org>
> Subject: Re: Query Optimization
>
> Hi ,
> Yes its the same query its just the ran the metadata refresh command .
> My understanding is metadata refresh command saves reading the metadata.
> How about column values ... Why is it reading all the files after metedata
> refresh ?
> Partition helps to retrieve data faster .
> Like in hive how it happens when you mention the partition column in where
> condition
> it just goes and read and improves the query performace .
> In my query also I where conidtion has  partioning column it should go and
> read those partitioned files right ?
> Why is it taking more time ?
> Does the Drill works in different way compare to hive ?
>
>
> Thanks,
> Divya
>
> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>> wrote:
>
> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htconex@gmail.com<
> mailto:divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
>
> Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
>
> This is how the refresh metadata works I mean it scans each and every
> files
> and get the results ?
>
> I dont  have access to logs now .
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:48, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>>
> wrote:
>
> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is
> not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
>
> I observed few things from Query statistics :
>
> Value
>
> Before Refresh Metadata
>
> After Refresh Metadata
>
> Fragments
>
> 1
>
> 13
>
> DURATION
>
> 01 min 0.233 sec
>
> 18 min 0.744 sec
>
> PLANNING
>
> 59.818 sec
>
> 33.087 sec
>
> QUEUED
>
> Not Available
>
> Not Available
>
> EXECUTION
>
> 0.415 sec
>
> 17 min 27.657 sec
>
> The planning time is being reduced by approx 60% but the execution
> time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
>
>
> Appreciate the help.
>
> Thanks,
> divya
>
>
> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>>
> wrote:
>
> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
>
> TOTAL FRAGMENTS: 1
> DURATION: 01 min 0.233 sec
> PLANNING: 59.818 sec
> QUEUED: Not Available
> EXECUTION: 0.415 sec
>
>
>
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA <path to table> ;
> then run the same query again on the same table same data (no
> changes
> in
> data)  and could find the statistics as show below :
>
> TOTAL FRAGMENTS: 13
> DURATION: 14 min 14.604 sec
> PLANNING: 33.087 sec
> QUEUED: Not Available
> EXECUTION: Not Available
>
>
> The query is still running .
>
> Can somebody help me  understand why the query taking so long once I
> issue
> the refresh metadata command.
>
> Aprreciate the help !
>
> Thanks,
> Divya
>
>
>
>
>
>
>
>
>

Re: Query Optimization

Posted by rahul challapalli <ch...@gmail.com>.

Could you be running into https://issues.apache.org/jira/browse/DRILL-3846 ?

- Rahul

On Thu, Aug 17, 2017 at 9:13 PM, Padma Penumarthy <pp...@mapr.com>
wrote:

> It is supposed to work like you expected. May be you are running into a
> bug.
> Why is it reading all files after metadata refresh ? That is difficult to
> answer without
> looking at the logs and query profile. If you look at the query profile,
> you can may
> be check what usedMetadataFile flag says for scan.
> Also, I am thinking if you created so many files, your metadata
> cache file could be big. May be you can manually sanity
> check if it looks ok (look for .drill.parquet.metadata file in the root
> directory) and not
> corrupted ?
>
> Thanks,
> Padma
>
>
> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfaraaz@mapr.com<mailto:kfara
> az@mapr.com>> wrote:
>
> Please share your SQL query and the query plan.
>
> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
>
>
> Thanks,
>
> Khurram
>
> ________________________________
> From: Divya Gehlot <divya.htconex@gmail.com<mailto:divya.htconex@gmail.com
> >>
> Sent: Friday, August 18, 2017 7:15:18 AM
> To: user@drill.apache.org<ma...@drill.apache.org>
> Subject: Re: Query Optimization
>
> Hi ,
> Yes its the same query its just the ran the metadata refresh command .
> My understanding is metadata refresh command saves reading the metadata.
> How about column values ... Why is it reading all the files after metedata
> refresh ?
> Partition helps to retrieve data faster .
> Like in hive how it happens when you mention the partition column in where
> condition
> it just goes and read and improves the query performace .
> In my query also I where conidtion has  partioning column it should go and
> read those partitioned files right ?
> Why is it taking more time ?
> Does the Drill works in different way compare to hive ?
>
>
> Thanks,
> Divya
>
> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>> wrote:
>
> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htconex@gmail.com<
> mailto:divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
>
> Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
>
> This is how the refresh metadata works I mean it scans each and every
> files
> and get the results ?
>
> I dont  have access to logs now .
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:48, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>>
> wrote:
>
> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is
> not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
>
> I observed few things from Query statistics :
>
> Value
>
> Before Refresh Metadata
>
> After Refresh Metadata
>
> Fragments
>
> 1
>
> 13
>
> DURATION
>
> 01 min 0.233 sec
>
> 18 min 0.744 sec
>
> PLANNING
>
> 59.818 sec
>
> 33.087 sec
>
> QUEUED
>
> Not Available
>
> Not Available
>
> EXECUTION
>
> 0.415 sec
>
> 17 min 27.657 sec
>
> The planning time is being reduced by approx 60% but the execution
> time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
>
>
> Appreciate the help.
>
> Thanks,
> divya
>
>
> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumarthy@mapr.com<mailto:
> ppenumarthy@mapr.com>>
> wrote:
>
> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htconex@gmail.com<mailto:
> divya.htconex@gmail.com>>
> wrote:
>
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
>
> TOTAL FRAGMENTS: 1
> DURATION: 01 min 0.233 sec
> PLANNING: 59.818 sec
> QUEUED: Not Available
> EXECUTION: 0.415 sec
>
>
>
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA <path to table> ;
> then run the same query again on the same table same data (no
> changes
> in
> data)  and could find the statistics as show below :
>
> TOTAL FRAGMENTS: 13
> DURATION: 14 min 14.604 sec
> PLANNING: 33.087 sec
> QUEUED: Not Available
> EXECUTION: Not Available
>
>
> The query is still running .
>
> Can somebody help me  understand why the query taking so long once I
> issue
> the refresh metadata command.
>
> Aprreciate the help !
>
> Thanks,
> Divya
>
>
>
>
>
>
>
>
>

Re: Query Optimization

Posted by Padma Penumarthy <pp...@mapr.com>.

It is supposed to work like you expected. May be you are running into a bug.
Why is it reading all files after metadata refresh ? That is difficult to answer without
looking at the logs and query profile. If you look at the query profile, you can may
be check what usedMetadataFile flag says for scan.
Also, I am thinking if you created so many files, your metadata
cache file could be big. May be you can manually sanity
check if it looks ok (look for .drill.parquet.metadata file in the root directory) and not
corrupted ?

Thanks,
Padma


On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kf...@mapr.com>> wrote:

Please share your SQL query and the query plan.

To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;


Thanks,

Khurram

________________________________
From: Divya Gehlot <di...@gmail.com>>
Sent: Friday, August 18, 2017 7:15:18 AM
To: user@drill.apache.org<ma...@drill.apache.org>
Subject: Re: Query Optimization

Hi ,
Yes its the same query its just the ran the metadata refresh command .
My understanding is metadata refresh command saves reading the metadata.
How about column values ... Why is it reading all the files after metedata
refresh ?
Partition helps to retrieve data faster .
Like in hive how it happens when you mention the partition column in where
condition
it just goes and read and improves the query performace .
In my query also I where conidtion has  partioning column it should go and
read those partitioned files right ?
Why is it taking more time ?
Does the Drill works in different way compare to hive ?


Thanks,
Divya

On 18 August 2017 at 07:37, Padma Penumarthy <pp...@mapr.com>> wrote:

It might read all those files if some new data gets added after running
refresh metadata cache.
If everything is same before and after metadata refresh i.e. no
new data added and query is exactly the same, then it should not do that.
Also, check if you can partition in  a way that will not create so many
files in the
first place.

Thanks,
Padma


On Aug 16, 2017, at 10:54 PM, Divya Gehlot <di...@gmail.com>>
wrote:

Hi,
Another observation is
My query had where conditions based on the partition values

Total number of parquet files in directory  - 102290
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files


This is how the refresh metadata works I mean it scans each and every
files
and get the results ?

I dont  have access to logs now .

Thanks,
Divya

On 17 August 2017 at 13:48, Divya Gehlot <di...@gmail.com>>
wrote:

Hi,
Another observation is
My query had where conditions based on the partition values
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files

Thanks,
Divya

On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com>>
wrote:

Does your query have partition filter ?
Execution time is increased most likely because partition pruning is
not
happening.
Did you get a chance to look at the logs ?  That might give some clues.

Thanks,
Padma


On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>>
wrote:

Hi,
Even I am surprised .
I am running Drill version 1.10  on MapR enterprise version.
*Query *- Selecting all the columns on partitioned parquet table

I observed few things from Query statistics :

Value

Before Refresh Metadata

After Refresh Metadata

Fragments

1

13

DURATION

01 min 0.233 sec

18 min 0.744 sec

PLANNING

59.818 sec

33.087 sec

QUEUED

Not Available

Not Available

EXECUTION

0.415 sec

17 min 27.657 sec

The planning time is being reduced by approx 60% but the execution
time
increased  drastically.
I would like to understand why the exceution time increases after the
metadata refresh .


Appreciate the help.

Thanks,
divya


On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>>
wrote:

Refresh table metadata should  help reduce query planning time.
It is odd that it went up after you did refresh table metadata.
Did you check the logs to see what is happening ? You might have to
turn on some debugs if needed.
BTW, what version of Drill are you running ?

Thanks,
Padma


On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>>
wrote:

Hi,
I have data in parquet file format .
when I run the query the data and see the execution plan I could see
following
statistics

TOTAL FRAGMENTS: 1
DURATION: 01 min 0.233 sec
PLANNING: 59.818 sec
QUEUED: Not Available
EXECUTION: 0.415 sec



As its a paquet file format I tried enabling refresh meta data
and run below command
REFRESH TABLE METADATA <path to table> ;
then run the same query again on the same table same data (no
changes
in
data)  and could find the statistics as show below :

TOTAL FRAGMENTS: 13
DURATION: 14 min 14.604 sec
PLANNING: 33.087 sec
QUEUED: Not Available
EXECUTION: Not Available


The query is still running .

Can somebody help me  understand why the query taking so long once I
issue
the refresh metadata command.

Aprreciate the help !

Thanks,
Divya

Re: Query Optimization

Posted by Khurram Faraaz <kf...@mapr.com>.

Please share your SQL query and the query plan.

To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;


Thanks,

Khurram

________________________________
From: Divya Gehlot <di...@gmail.com>
Sent: Friday, August 18, 2017 7:15:18 AM
To: user@drill.apache.org
Subject: Re: Query Optimization

Hi ,
Yes its the same query its just the ran the metadata refresh command .
My understanding is metadata refresh command saves reading the metadata.
How about column values ... Why is it reading all the files after metedata
refresh ?
Partition helps to retrieve data faster .
Like in hive how it happens when you mention the partition column in where
condition
it just goes and read and improves the query performace .
In my query also I where conidtion has  partioning column it should go and
read those partitioned files right ?
Why is it taking more time ?
Does the Drill works in different way compare to hive ?


Thanks,
Divya

On 18 August 2017 at 07:37, Padma Penumarthy <pp...@mapr.com> wrote:

> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 10:54 PM, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> > Hi,
> > Another observation is
> > My query had where conditions based on the partition values
> >
> > Total number of parquet files in directory  - 102290
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >
> >
> > This is how the refresh metadata works I mean it scans each and every
> files
> > and get the results ?
> >
> > I dont  have access to logs now .
> >
> > Thanks,
> > Divya
> >
> > On 17 August 2017 at 13:48, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com>
> wrote:
> >>
> >>> Does your query have partition filter ?
> >>> Execution time is increased most likely because partition pruning is
> not
> >>> happening.
> >>> Did you get a chance to look at the logs ?  That might give some clues.
> >>>
> >>> Thanks,
> >>> Padma
> >>>
> >>>
> >>>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>> Even I am surprised .
> >>>> I am running Drill version 1.10  on MapR enterprise version.
> >>>> *Query *- Selecting all the columns on partitioned parquet table
> >>>>
> >>>> I observed few things from Query statistics :
> >>>>
> >>>> Value
> >>>>
> >>>> Before Refresh Metadata
> >>>>
> >>>> After Refresh Metadata
> >>>>
> >>>> Fragments
> >>>>
> >>>> 1
> >>>>
> >>>> 13
> >>>>
> >>>> DURATION
> >>>>
> >>>> 01 min 0.233 sec
> >>>>
> >>>> 18 min 0.744 sec
> >>>>
> >>>> PLANNING
> >>>>
> >>>> 59.818 sec
> >>>>
> >>>> 33.087 sec
> >>>>
> >>>> QUEUED
> >>>>
> >>>> Not Available
> >>>>
> >>>> Not Available
> >>>>
> >>>> EXECUTION
> >>>>
> >>>> 0.415 sec
> >>>>
> >>>> 17 min 27.657 sec
> >>>>
> >>>> The planning time is being reduced by approx 60% but the execution
> time
> >>>> increased  drastically.
> >>>> I would like to understand why the exceution time increases after the
> >>>> metadata refresh .
> >>>>
> >>>>
> >>>> Appreciate the help.
> >>>>
> >>>> Thanks,
> >>>> divya
> >>>>
> >>>>
> >>>> On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>
> >>> wrote:
> >>>>
> >>>>> Refresh table metadata should  help reduce query planning time.
> >>>>> It is odd that it went up after you did refresh table metadata.
> >>>>> Did you check the logs to see what is happening ? You might have to
> >>>>> turn on some debugs if needed.
> >>>>> BTW, what version of Drill are you running ?
> >>>>>
> >>>>> Thanks,
> >>>>> Padma
> >>>>>
> >>>>>
> >>>>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>> I have data in parquet file format .
> >>>>>> when I run the query the data and see the execution plan I could see
> >>>>>> following
> >>>>>> statistics
> >>>>>>
> >>>>>>> TOTAL FRAGMENTS: 1
> >>>>>>>> DURATION: 01 min 0.233 sec
> >>>>>>>> PLANNING: 59.818 sec
> >>>>>>>> QUEUED: Not Available
> >>>>>>>> EXECUTION: 0.415 sec
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> As its a paquet file format I tried enabling refresh meta data
> >>>>>> and run below command
> >>>>>> REFRESH TABLE METADATA <path to table> ;
> >>>>>> then run the same query again on the same table same data (no
> changes
> >>> in
> >>>>>> data)  and could find the statistics as show below :
> >>>>>>
> >>>>>> TOTAL FRAGMENTS: 13
> >>>>>>>> DURATION: 14 min 14.604 sec
> >>>>>>>> PLANNING: 33.087 sec
> >>>>>>>> QUEUED: Not Available
> >>>>>>>> EXECUTION: Not Available
> >>>>>>>
> >>>>>>>
> >>>>>> The query is still running .
> >>>>>>
> >>>>>> Can somebody help me  understand why the query taking so long once I
> >>>>> issue
> >>>>>> the refresh metadata command.
> >>>>>>
> >>>>>> Aprreciate the help !
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Divya
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi ,
Yes its the same query its just the ran the metadata refresh command .
My understanding is metadata refresh command saves reading the metadata.
How about column values ... Why is it reading all the files after metedata
refresh ?
Partition helps to retrieve data faster .
Like in hive how it happens when you mention the partition column in where
condition
it just goes and read and improves the query performace .
In my query also I where conidtion has  partioning column it should go and
read those partitioned files right ?
Why is it taking more time ?
Does the Drill works in different way compare to hive ?


Thanks,
Divya

On 18 August 2017 at 07:37, Padma Penumarthy <pp...@mapr.com> wrote:

> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 10:54 PM, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> > Hi,
> > Another observation is
> > My query had where conditions based on the partition values
> >
> > Total number of parquet files in directory  - 102290
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >
> >
> > This is how the refresh metadata works I mean it scans each and every
> files
> > and get the results ?
> >
> > I dont  have access to logs now .
> >
> > Thanks,
> > Divya
> >
> > On 17 August 2017 at 13:48, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> >> Hi,
> >> Another observation is
> >> My query had where conditions based on the partition values
> >> Before Metadata refresh - Its reading only 4 files
> >> After metadata refresh - its reading 102290 files
> >>
> >> Thanks,
> >> Divya
> >>
> >> On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com>
> wrote:
> >>
> >>> Does your query have partition filter ?
> >>> Execution time is increased most likely because partition pruning is
> not
> >>> happening.
> >>> Did you get a chance to look at the logs ?  That might give some clues.
> >>>
> >>> Thanks,
> >>> Padma
> >>>
> >>>
> >>>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>> Even I am surprised .
> >>>> I am running Drill version 1.10  on MapR enterprise version.
> >>>> *Query *- Selecting all the columns on partitioned parquet table
> >>>>
> >>>> I observed few things from Query statistics :
> >>>>
> >>>> Value
> >>>>
> >>>> Before Refresh Metadata
> >>>>
> >>>> After Refresh Metadata
> >>>>
> >>>> Fragments
> >>>>
> >>>> 1
> >>>>
> >>>> 13
> >>>>
> >>>> DURATION
> >>>>
> >>>> 01 min 0.233 sec
> >>>>
> >>>> 18 min 0.744 sec
> >>>>
> >>>> PLANNING
> >>>>
> >>>> 59.818 sec
> >>>>
> >>>> 33.087 sec
> >>>>
> >>>> QUEUED
> >>>>
> >>>> Not Available
> >>>>
> >>>> Not Available
> >>>>
> >>>> EXECUTION
> >>>>
> >>>> 0.415 sec
> >>>>
> >>>> 17 min 27.657 sec
> >>>>
> >>>> The planning time is being reduced by approx 60% but the execution
> time
> >>>> increased  drastically.
> >>>> I would like to understand why the exceution time increases after the
> >>>> metadata refresh .
> >>>>
> >>>>
> >>>> Appreciate the help.
> >>>>
> >>>> Thanks,
> >>>> divya
> >>>>
> >>>>
> >>>> On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>
> >>> wrote:
> >>>>
> >>>>> Refresh table metadata should  help reduce query planning time.
> >>>>> It is odd that it went up after you did refresh table metadata.
> >>>>> Did you check the logs to see what is happening ? You might have to
> >>>>> turn on some debugs if needed.
> >>>>> BTW, what version of Drill are you running ?
> >>>>>
> >>>>> Thanks,
> >>>>> Padma
> >>>>>
> >>>>>
> >>>>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>> I have data in parquet file format .
> >>>>>> when I run the query the data and see the execution plan I could see
> >>>>>> following
> >>>>>> statistics
> >>>>>>
> >>>>>>> TOTAL FRAGMENTS: 1
> >>>>>>>> DURATION: 01 min 0.233 sec
> >>>>>>>> PLANNING: 59.818 sec
> >>>>>>>> QUEUED: Not Available
> >>>>>>>> EXECUTION: 0.415 sec
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> As its a paquet file format I tried enabling refresh meta data
> >>>>>> and run below command
> >>>>>> REFRESH TABLE METADATA <path to table> ;
> >>>>>> then run the same query again on the same table same data (no
> changes
> >>> in
> >>>>>> data)  and could find the statistics as show below :
> >>>>>>
> >>>>>> TOTAL FRAGMENTS: 13
> >>>>>>>> DURATION: 14 min 14.604 sec
> >>>>>>>> PLANNING: 33.087 sec
> >>>>>>>> QUEUED: Not Available
> >>>>>>>> EXECUTION: Not Available
> >>>>>>>
> >>>>>>>
> >>>>>> The query is still running .
> >>>>>>
> >>>>>> Can somebody help me  understand why the query taking so long once I
> >>>>> issue
> >>>>>> the refresh metadata command.
> >>>>>>
> >>>>>> Aprreciate the help !
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Divya
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Query Optimization

Posted by Padma Penumarthy <pp...@mapr.com>.

It might read all those files if some new data gets added after running
refresh metadata cache. 
If everything is same before and after metadata refresh i.e. no 
new data added and query is exactly the same, then it should not do that.
Also, check if you can partition in  a way that will not create so many files in the
first place.

Thanks,
Padma


> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <di...@gmail.com> wrote:
> 
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> 
> Total number of parquet files in directory  - 102290
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
> 
> 
> This is how the refresh metadata works I mean it scans each and every files
> and get the results ?
> 
> I dont  have access to logs now .
> 
> Thanks,
> Divya
> 
> On 17 August 2017 at 13:48, Divya Gehlot <di...@gmail.com> wrote:
> 
>> Hi,
>> Another observation is
>> My query had where conditions based on the partition values
>> Before Metadata refresh - Its reading only 4 files
>> After metadata refresh - its reading 102290 files
>> 
>> Thanks,
>> Divya
>> 
>> On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com> wrote:
>> 
>>> Does your query have partition filter ?
>>> Execution time is increased most likely because partition pruning is not
>>> happening.
>>> Did you get a chance to look at the logs ?  That might give some clues.
>>> 
>>> Thanks,
>>> Padma
>>> 
>>> 
>>>> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> Even I am surprised .
>>>> I am running Drill version 1.10  on MapR enterprise version.
>>>> *Query *- Selecting all the columns on partitioned parquet table
>>>> 
>>>> I observed few things from Query statistics :
>>>> 
>>>> Value
>>>> 
>>>> Before Refresh Metadata
>>>> 
>>>> After Refresh Metadata
>>>> 
>>>> Fragments
>>>> 
>>>> 1
>>>> 
>>>> 13
>>>> 
>>>> DURATION
>>>> 
>>>> 01 min 0.233 sec
>>>> 
>>>> 18 min 0.744 sec
>>>> 
>>>> PLANNING
>>>> 
>>>> 59.818 sec
>>>> 
>>>> 33.087 sec
>>>> 
>>>> QUEUED
>>>> 
>>>> Not Available
>>>> 
>>>> Not Available
>>>> 
>>>> EXECUTION
>>>> 
>>>> 0.415 sec
>>>> 
>>>> 17 min 27.657 sec
>>>> 
>>>> The planning time is being reduced by approx 60% but the execution time
>>>> increased  drastically.
>>>> I would like to understand why the exceution time increases after the
>>>> metadata refresh .
>>>> 
>>>> 
>>>> Appreciate the help.
>>>> 
>>>> Thanks,
>>>> divya
>>>> 
>>>> 
>>>> On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>
>>> wrote:
>>>> 
>>>>> Refresh table metadata should  help reduce query planning time.
>>>>> It is odd that it went up after you did refresh table metadata.
>>>>> Did you check the logs to see what is happening ? You might have to
>>>>> turn on some debugs if needed.
>>>>> BTW, what version of Drill are you running ?
>>>>> 
>>>>> Thanks,
>>>>> Padma
>>>>> 
>>>>> 
>>>>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> I have data in parquet file format .
>>>>>> when I run the query the data and see the execution plan I could see
>>>>>> following
>>>>>> statistics
>>>>>> 
>>>>>>> TOTAL FRAGMENTS: 1
>>>>>>>> DURATION: 01 min 0.233 sec
>>>>>>>> PLANNING: 59.818 sec
>>>>>>>> QUEUED: Not Available
>>>>>>>> EXECUTION: 0.415 sec
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> As its a paquet file format I tried enabling refresh meta data
>>>>>> and run below command
>>>>>> REFRESH TABLE METADATA <path to table> ;
>>>>>> then run the same query again on the same table same data (no changes
>>> in
>>>>>> data)  and could find the statistics as show below :
>>>>>> 
>>>>>> TOTAL FRAGMENTS: 13
>>>>>>>> DURATION: 14 min 14.604 sec
>>>>>>>> PLANNING: 33.087 sec
>>>>>>>> QUEUED: Not Available
>>>>>>>> EXECUTION: Not Available
>>>>>>> 
>>>>>>> 
>>>>>> The query is still running .
>>>>>> 
>>>>>> Can somebody help me  understand why the query taking so long once I
>>>>> issue
>>>>>> the refresh metadata command.
>>>>>> 
>>>>>> Aprreciate the help !
>>>>>> 
>>>>>> Thanks,
>>>>>> Divya
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi,
Another observation is
My query had where conditions based on the partition values

Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files


This is how the refresh metadata works I mean it scans each and every files
and get the results ?

I dont  have access to logs now .

Thanks,
Divya

On 17 August 2017 at 13:48, Divya Gehlot <di...@gmail.com> wrote:

> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com> wrote:
>
>> Does your query have partition filter ?
>> Execution time is increased most likely because partition pruning is not
>> happening.
>> Did you get a chance to look at the logs ?  That might give some clues.
>>
>> Thanks,
>> Padma
>>
>>
>> > On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> > Even I am surprised .
>> > I am running Drill version 1.10  on MapR enterprise version.
>> > *Query *- Selecting all the columns on partitioned parquet table
>> >
>> > I observed few things from Query statistics :
>> >
>> > Value
>> >
>> > Before Refresh Metadata
>> >
>> > After Refresh Metadata
>> >
>> > Fragments
>> >
>> > 1
>> >
>> > 13
>> >
>> > DURATION
>> >
>> > 01 min 0.233 sec
>> >
>> > 18 min 0.744 sec
>> >
>> > PLANNING
>> >
>> > 59.818 sec
>> >
>> > 33.087 sec
>> >
>> > QUEUED
>> >
>> > Not Available
>> >
>> > Not Available
>> >
>> > EXECUTION
>> >
>> > 0.415 sec
>> >
>> > 17 min 27.657 sec
>> >
>> > The planning time is being reduced by approx 60% but the execution time
>> > increased  drastically.
>> > I would like to understand why the exceution time increases after the
>> > metadata refresh .
>> >
>> >
>> > Appreciate the help.
>> >
>> > Thanks,
>> > divya
>> >
>> >
>> > On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>
>> wrote:
>> >
>> >> Refresh table metadata should  help reduce query planning time.
>> >> It is odd that it went up after you did refresh table metadata.
>> >> Did you check the logs to see what is happening ? You might have to
>> >> turn on some debugs if needed.
>> >> BTW, what version of Drill are you running ?
>> >>
>> >> Thanks,
>> >> Padma
>> >>
>> >>
>> >>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>> I have data in parquet file format .
>> >>> when I run the query the data and see the execution plan I could see
>> >>> following
>> >>> statistics
>> >>>
>> >>>> TOTAL FRAGMENTS: 1
>> >>>>> DURATION: 01 min 0.233 sec
>> >>>>> PLANNING: 59.818 sec
>> >>>>> QUEUED: Not Available
>> >>>>> EXECUTION: 0.415 sec
>> >>>>
>> >>>>
>> >>>
>> >>> As its a paquet file format I tried enabling refresh meta data
>> >>> and run below command
>> >>> REFRESH TABLE METADATA <path to table> ;
>> >>> then run the same query again on the same table same data (no changes
>> in
>> >>> data)  and could find the statistics as show below :
>> >>>
>> >>> TOTAL FRAGMENTS: 13
>> >>>>> DURATION: 14 min 14.604 sec
>> >>>>> PLANNING: 33.087 sec
>> >>>>> QUEUED: Not Available
>> >>>>> EXECUTION: Not Available
>> >>>>
>> >>>>
>> >>> The query is still running .
>> >>>
>> >>> Can somebody help me  understand why the query taking so long once I
>> >> issue
>> >>> the refresh metadata command.
>> >>>
>> >>> Aprreciate the help !
>> >>>
>> >>> Thanks,
>> >>> Divya
>> >>
>> >>
>>
>>
>

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi,
Another observation is
My query had where conditions based on the partition values
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files

Thanks,
Divya

On 17 August 2017 at 13:03, Padma Penumarthy <pp...@mapr.com> wrote:

> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> > Hi,
> > Even I am surprised .
> > I am running Drill version 1.10  on MapR enterprise version.
> > *Query *- Selecting all the columns on partitioned parquet table
> >
> > I observed few things from Query statistics :
> >
> > Value
> >
> > Before Refresh Metadata
> >
> > After Refresh Metadata
> >
> > Fragments
> >
> > 1
> >
> > 13
> >
> > DURATION
> >
> > 01 min 0.233 sec
> >
> > 18 min 0.744 sec
> >
> > PLANNING
> >
> > 59.818 sec
> >
> > 33.087 sec
> >
> > QUEUED
> >
> > Not Available
> >
> > Not Available
> >
> > EXECUTION
> >
> > 0.415 sec
> >
> > 17 min 27.657 sec
> >
> > The planning time is being reduced by approx 60% but the execution time
> > increased  drastically.
> > I would like to understand why the exceution time increases after the
> > metadata refresh .
> >
> >
> > Appreciate the help.
> >
> > Thanks,
> > divya
> >
> >
> > On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com>
> wrote:
> >
> >> Refresh table metadata should  help reduce query planning time.
> >> It is odd that it went up after you did refresh table metadata.
> >> Did you check the logs to see what is happening ? You might have to
> >> turn on some debugs if needed.
> >> BTW, what version of Drill are you running ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>> I have data in parquet file format .
> >>> when I run the query the data and see the execution plan I could see
> >>> following
> >>> statistics
> >>>
> >>>> TOTAL FRAGMENTS: 1
> >>>>> DURATION: 01 min 0.233 sec
> >>>>> PLANNING: 59.818 sec
> >>>>> QUEUED: Not Available
> >>>>> EXECUTION: 0.415 sec
> >>>>
> >>>>
> >>>
> >>> As its a paquet file format I tried enabling refresh meta data
> >>> and run below command
> >>> REFRESH TABLE METADATA <path to table> ;
> >>> then run the same query again on the same table same data (no changes
> in
> >>> data)  and could find the statistics as show below :
> >>>
> >>> TOTAL FRAGMENTS: 13
> >>>>> DURATION: 14 min 14.604 sec
> >>>>> PLANNING: 33.087 sec
> >>>>> QUEUED: Not Available
> >>>>> EXECUTION: Not Available
> >>>>
> >>>>
> >>> The query is still running .
> >>>
> >>> Can somebody help me  understand why the query taking so long once I
> >> issue
> >>> the refresh metadata command.
> >>>
> >>> Aprreciate the help !
> >>>
> >>> Thanks,
> >>> Divya
> >>
> >>
>
>

Re: Query Optimization

Posted by Padma Penumarthy <pp...@mapr.com>.

Does your query have partition filter ? 
Execution time is increased most likely because partition pruning is not happening.
Did you get a chance to look at the logs ?  That might give some clues.

Thanks,
Padma


> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <di...@gmail.com> wrote:
> 
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
> 
> I observed few things from Query statistics :
> 
> Value
> 
> Before Refresh Metadata
> 
> After Refresh Metadata
> 
> Fragments
> 
> 1
> 
> 13
> 
> DURATION
> 
> 01 min 0.233 sec
> 
> 18 min 0.744 sec
> 
> PLANNING
> 
> 59.818 sec
> 
> 33.087 sec
> 
> QUEUED
> 
> Not Available
> 
> Not Available
> 
> EXECUTION
> 
> 0.415 sec
> 
> 17 min 27.657 sec
> 
> The planning time is being reduced by approx 60% but the execution time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
> 
> 
> Appreciate the help.
> 
> Thanks,
> divya
> 
> 
> On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com> wrote:
> 
>> Refresh table metadata should  help reduce query planning time.
>> It is odd that it went up after you did refresh table metadata.
>> Did you check the logs to see what is happening ? You might have to
>> turn on some debugs if needed.
>> BTW, what version of Drill are you running ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
>> wrote:
>>> 
>>> Hi,
>>> I have data in parquet file format .
>>> when I run the query the data and see the execution plan I could see
>>> following
>>> statistics
>>> 
>>>> TOTAL FRAGMENTS: 1
>>>>> DURATION: 01 min 0.233 sec
>>>>> PLANNING: 59.818 sec
>>>>> QUEUED: Not Available
>>>>> EXECUTION: 0.415 sec
>>>> 
>>>> 
>>> 
>>> As its a paquet file format I tried enabling refresh meta data
>>> and run below command
>>> REFRESH TABLE METADATA <path to table> ;
>>> then run the same query again on the same table same data (no changes in
>>> data)  and could find the statistics as show below :
>>> 
>>> TOTAL FRAGMENTS: 13
>>>>> DURATION: 14 min 14.604 sec
>>>>> PLANNING: 33.087 sec
>>>>> QUEUED: Not Available
>>>>> EXECUTION: Not Available
>>>> 
>>>> 
>>> The query is still running .
>>> 
>>> Can somebody help me  understand why the query taking so long once I
>> issue
>>> the refresh metadata command.
>>> 
>>> Aprreciate the help !
>>> 
>>> Thanks,
>>> Divya
>> 
>>

Re: Query Optimization

Posted by Divya Gehlot <di...@gmail.com>.

Hi,
Even I am surprised .
I am running Drill version 1.10  on MapR enterprise version.
*Query *- Selecting all the columns on partitioned parquet table

I observed few things from Query statistics :

Value

Before Refresh Metadata

After Refresh Metadata

Fragments

1

13

DURATION

01 min 0.233 sec

 18 min 0.744 sec

PLANNING

59.818 sec

33.087 sec

QUEUED

Not Available

Not Available

EXECUTION

0.415 sec

17 min 27.657 sec

The planning time is being reduced by approx 60% but the execution time
increased  drastically.
I would like to understand why the exceution time increases after the
metadata refresh .


Appreciate the help.

Thanks,
divya


On 17 August 2017 at 11:54, Padma Penumarthy <pp...@mapr.com> wrote:

> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com>
> wrote:
> >
> > Hi,
> > I have data in parquet file format .
> > when I run the query the data and see the execution plan I could see
> > following
> > statistics
> >
> >> TOTAL FRAGMENTS: 1
> >>> DURATION: 01 min 0.233 sec
> >>> PLANNING: 59.818 sec
> >>> QUEUED: Not Available
> >>> EXECUTION: 0.415 sec
> >>
> >>
> >
> > As its a paquet file format I tried enabling refresh meta data
> > and run below command
> > REFRESH TABLE METADATA <path to table> ;
> > then run the same query again on the same table same data (no changes in
> > data)  and could find the statistics as show below :
> >
> > TOTAL FRAGMENTS: 13
> >>> DURATION: 14 min 14.604 sec
> >>> PLANNING: 33.087 sec
> >>> QUEUED: Not Available
> >>> EXECUTION: Not Available
> >>
> >>
> > The query is still running .
> >
> > Can somebody help me  understand why the query taking so long once I
> issue
> > the refresh metadata command.
> >
> > Aprreciate the help !
> >
> > Thanks,
> > Divya
>
>

Re: Query Optimization

Posted by Padma Penumarthy <pp...@mapr.com>.

Refresh table metadata should  help reduce query planning time.
It is odd that it went up after you did refresh table metadata.
Did you check the logs to see what is happening ? You might have to
turn on some debugs if needed.
BTW, what version of Drill are you running ?

Thanks,
Padma


> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <di...@gmail.com> wrote:
> 
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
> 
>> TOTAL FRAGMENTS: 1
>>> DURATION: 01 min 0.233 sec
>>> PLANNING: 59.818 sec
>>> QUEUED: Not Available
>>> EXECUTION: 0.415 sec
>> 
>> 
> 
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA <path to table> ;
> then run the same query again on the same table same data (no changes in
> data)  and could find the statistics as show below :
> 
> TOTAL FRAGMENTS: 13
>>> DURATION: 14 min 14.604 sec
>>> PLANNING: 33.087 sec
>>> QUEUED: Not Available
>>> EXECUTION: Not Available
>> 
>> 
> The query is still running .
> 
> Can somebody help me  understand why the query taking so long once I issue
> the refresh metadata command.
> 
> Aprreciate the help !
> 
> Thanks,
> Divya