You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Avner Levy <av...@gmail.com> on 2020/06/04 18:43:14 UTC

Planning times

I'm running Apache Drill (1.18 master branch) in a docker with data stored
in Parquet files on S3.
When I run queries, even the most simple ones such as:

select name from `parquet/data/data.parquet` limit 1

The "Planning" time is 0.7-1.5 sec while the "Execution" is only 0.112 sec.
These proportions are maintained even if I run the same query multiple
times in a row.
Since I'm trying to minimize query times to a minimum, I was wondering if
such planning times (compared to execution) make sense and is there any way
to reduce it? (some plan caching mechanism)
Thanks,
  Avner

Re: Planning times

Posted by Avner Levy <av...@gmail.com>.
Thanks Rafael!
I've done as you suggested and planning on local files takes 0.3 second
compared to 1.5 second on the same query & files on S3.
I'll try to use some predefined schema and see if it helps to reduce query
planning times.
Thank you all for your help.
Avner

On Wed, Jun 10, 2020 at 12:04 PM Rafael Jaimes III <ra...@gmail.com>
wrote:

> Parquet uses AVRO schema file, but I think that might be going down
> the wrong path. Drill's schema discovery should be really quick unless
> you have a really weird schema or non-consistent schema.
>
> It may indeed have something to do with S3 or data/network access
> latency. It's very straightforward to deploy a local embedded instance
> of Drill by using the embedded drillbit on a Linux machine. You can
> also use the Docker image from dockerhub if you are more comfortable
> with that. It takes less than 5 minutes to go from zero to querying a
> local parquet file you can use to test. You'll see the planning times
> significantly less on your local machine. Although the plugin is
> called "dfs" for HDFS, it works just fine on a regular drive and
> filesystem.
>
> Let us know if you see the improvement with a local query. If not, it
> may indicate something sub-optimal with your Parquet file. If yes, it
> would narrow down something with S3 as the limitation.
>
> I also recommend 1.17, unless there is something specific about 1.18
> that you are using.
>
> On Sun, Jun 7, 2020 at 10:53 PM Charles Givre <cg...@gmail.com> wrote:
> >
> > Hi Avner,
> > So there is a way to define a schema both at query time and for a group
> of files.
> >
> > To add a schema at query time you can do the following:
> >
> > SELECT *
> > FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath =>
> '/compound', schema => 'inline=(field_1 int not null, field_2 double not
> null, field_3 varchar not null, fixed_field int not null default `20`)'))
> > WHERE field_1 < 5
> >
> > I think this will work for parquet files, but I'm not 100% certain.
> > Secondly, there is also the Drill metastore.  You can read about that
> here:  http://drill.apache.org/docs/using-drill-metastore/ <
> http://drill.apache.org/docs/using-drill-metastore/>
> >
> > Lastly, there is a way to provide an actual schema file for your parquet
> files but I'll have to dig that up.  However, with all that said, I didn't
> realize that you were doing this on S3. I'd concur with Rafael,  that you
> should try on your local filesystem first and see what the query times look
> like.  Generally speaking, for a small file size like that, querying a
> parquet file should be nearly instantaneous, with or without the schema or
> metastore.
> > Good luck!
> > -- C
> >
> >
> > > On Jun 7, 2020, at 11:08 AM, Avner Levy <av...@gmail.com> wrote:
> > >
> > > Thanks Rafael,
> > > I'm doing my first steps with Drill so I'm still trying to figure out
> all
> > > the moving parts.
> > > I started with the master branch (1.18) Drill docker and just added
> the S3
> > > access plugin.
> > > Then I just submit the trivial query of asking one field from one file
> > > (trivial parquet file with few int/string fields) which has only 13
> rows
> > > (written with Parquet.Net) and has one row group.
> > > So I think I'm working as you suggested directly with the dfs directly.
> > > I've run the query several times.
> > > That is why I'm surprised from the high planning times for such a
> > > trivial scenario.
> > > Does the planning phase include accessing the file and analyzing the
> > > fields? (perhaps accessing S3 is taking the time?)
> > > If drill had the exact schema of the file (I thought this was the idea
> of
> > > the metastore), would it reduce the planning phase?
> > > Thanks,
> > > Avner
> > >
> > > On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <ra...@gmail.com>
> > > wrote:
> > >
> > >> Hi Avner,
> > >>
> > >> What do you mean by metastore, are you running it through a Hive
> metastore
> > >> and plugin?
> > >>
> > >> I would try to query against the dfs directly. I'm seeing much shorter
> > >> planning times than you with Drill 1.17 and no metastore. I don't
> usually
> > >> query a single file but I imagine that would be even faster.
> > >>
> > >> What program made the parquet file? Do you know what the row group
> size is
> > >> and is it the same as your HDFS block size? They should be for best
> > >> performance. Is the schema consistent within the file or do you have
> nested
> > >> fields?
> > >>
> > >> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <av...@gmail.com> wrote:
> > >>
> > >>> Hi Charles,
> > >>> I'm using master branch (1.18.0-SNAPSHOT) docker.
> > >>> I've enabled the metastore, session wise and run the same query
> twice but
> > >>> still got the following times.
> > >>> Is there a way to pre-define the table's schema in a way that will
> reduce
> > >>> the query time?
> > >>> The query is:
> > >>> *select name from `parquet/data.parquet` limit 1*
> > >>>
> > >>> Any idea why planning takes so long on such trivial query?
> > >>> Does it include accessing the file for schema discovery?
> > >>> I'm providing the specific filename in the queries in order to
> reduce the
> > >>> file listing part.
> > >>> Thanks for your help,
> > >>>  Avner
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090
> sec
> > >>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
> > >> true*
> > >>>
> > >>>
> > >>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com>
> wrote:
> > >>>
> > >>>> Hi Avner,
> > >>>> Maybe you said this already but what version of Drill are you using
> and
> > >>> do
> > >>>> you have the metastore enabled?
> > >>>> --C
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com>
> wrote:
> > >>>>>
> > >>>>> Thanks Rafael for your answer.
> > >>>>> As I wrote in the previous email these planning times occur even
> when
> > >>>>> selecting one fields from one tiny file (60k) that I pass directly
> by
> > >>>> full
> > >>>>> path (select name from `parquet/data/data.parquet` limit 1).
> > >>>>> Any idea what can influence the time in such a trivial scenario?
> > >>>>> In addition, doesn't Drill cache execution plans between similar
> > >>> queries
> > >>>>> executions?
> > >>>>> Best regards,
> > >>>>> Avner
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
> > >> rafjaimes@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Avner,
> > >>>>>>
> > >>>>>> One way you might be able to optimize this is by modifying the
> size
> > >>>>>> and number of the parquet files. How many files do you have and
> how
> > >>>>>> big are they? Do you know what the row group size is? What is the
> > >> HDFS
> > >>>>>> block size is on your storage?
> > >>>>>>
> > >>>>>> There's probably a lot more intricate ways to improve performance
> > >> with
> > >>>>>> the Drill settings, but I have not modified them.
> > >>>>>>
> > >>>>>> - Rafael
> > >>>>>>
> > >>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com>
> > >>> wrote:
> > >>>>>>>
> > >>>>>>> I'm running Apache Drill (1.18 master branch) in a docker with
> data
> > >>>>>> stored
> > >>>>>>> in Parquet files on S3.
> > >>>>>>> When I run queries, even the most simple ones such as:
> > >>>>>>>
> > >>>>>>> select name from `parquet/data/data.parquet` limit 1
> > >>>>>>>
> > >>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
> > >>> 0.112
> > >>>>>> sec.
> > >>>>>>> These proportions are maintained even if I run the same query
> > >>> multiple
> > >>>>>>> times in a row.
> > >>>>>>> Since I'm trying to minimize query times to a minimum, I was
> > >>> wondering
> > >>>> if
> > >>>>>>> such planning times (compared to execution) make sense and is
> there
> > >>> any
> > >>>>>> way
> > >>>>>>> to reduce it? (some plan caching mechanism)
> > >>>>>>> Thanks,
> > >>>>>>> Avner
> > >>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: Planning times

Posted by Rafael Jaimes III <ra...@gmail.com>.
Parquet uses AVRO schema file, but I think that might be going down
the wrong path. Drill's schema discovery should be really quick unless
you have a really weird schema or non-consistent schema.

It may indeed have something to do with S3 or data/network access
latency. It's very straightforward to deploy a local embedded instance
of Drill by using the embedded drillbit on a Linux machine. You can
also use the Docker image from dockerhub if you are more comfortable
with that. It takes less than 5 minutes to go from zero to querying a
local parquet file you can use to test. You'll see the planning times
significantly less on your local machine. Although the plugin is
called "dfs" for HDFS, it works just fine on a regular drive and
filesystem.

Let us know if you see the improvement with a local query. If not, it
may indicate something sub-optimal with your Parquet file. If yes, it
would narrow down something with S3 as the limitation.

I also recommend 1.17, unless there is something specific about 1.18
that you are using.

On Sun, Jun 7, 2020 at 10:53 PM Charles Givre <cg...@gmail.com> wrote:
>
> Hi Avner,
> So there is a way to define a schema both at query time and for a group of files.
>
> To add a schema at query time you can do the following:
>
> SELECT *
> FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath => '/compound', schema => 'inline=(field_1 int not null, field_2 double not null, field_3 varchar not null, fixed_field int not null default `20`)'))
> WHERE field_1 < 5
>
> I think this will work for parquet files, but I'm not 100% certain.
> Secondly, there is also the Drill metastore.  You can read about that here:  http://drill.apache.org/docs/using-drill-metastore/ <http://drill.apache.org/docs/using-drill-metastore/>
>
> Lastly, there is a way to provide an actual schema file for your parquet files but I'll have to dig that up.  However, with all that said, I didn't realize that you were doing this on S3. I'd concur with Rafael,  that you should try on your local filesystem first and see what the query times look like.  Generally speaking, for a small file size like that, querying a parquet file should be nearly instantaneous, with or without the schema or metastore.
> Good luck!
> -- C
>
>
> > On Jun 7, 2020, at 11:08 AM, Avner Levy <av...@gmail.com> wrote:
> >
> > Thanks Rafael,
> > I'm doing my first steps with Drill so I'm still trying to figure out all
> > the moving parts.
> > I started with the master branch (1.18) Drill docker and just added the S3
> > access plugin.
> > Then I just submit the trivial query of asking one field from one file
> > (trivial parquet file with few int/string fields) which has only 13 rows
> > (written with Parquet.Net) and has one row group.
> > So I think I'm working as you suggested directly with the dfs directly.
> > I've run the query several times.
> > That is why I'm surprised from the high planning times for such a
> > trivial scenario.
> > Does the planning phase include accessing the file and analyzing the
> > fields? (perhaps accessing S3 is taking the time?)
> > If drill had the exact schema of the file (I thought this was the idea of
> > the metastore), would it reduce the planning phase?
> > Thanks,
> > Avner
> >
> > On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <ra...@gmail.com>
> > wrote:
> >
> >> Hi Avner,
> >>
> >> What do you mean by metastore, are you running it through a Hive metastore
> >> and plugin?
> >>
> >> I would try to query against the dfs directly. I'm seeing much shorter
> >> planning times than you with Drill 1.17 and no metastore. I don't usually
> >> query a single file but I imagine that would be even faster.
> >>
> >> What program made the parquet file? Do you know what the row group size is
> >> and is it the same as your HDFS block size? They should be for best
> >> performance. Is the schema consistent within the file or do you have nested
> >> fields?
> >>
> >> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <av...@gmail.com> wrote:
> >>
> >>> Hi Charles,
> >>> I'm using master branch (1.18.0-SNAPSHOT) docker.
> >>> I've enabled the metastore, session wise and run the same query twice but
> >>> still got the following times.
> >>> Is there a way to pre-define the table's schema in a way that will reduce
> >>> the query time?
> >>> The query is:
> >>> *select name from `parquet/data.parquet` limit 1*
> >>>
> >>> Any idea why planning takes so long on such trivial query?
> >>> Does it include accessing the file for schema discovery?
> >>> I'm providing the specific filename in the queries in order to reduce the
> >>> file listing part.
> >>> Thanks for your help,
> >>>  Avner
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
> >>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
> >> true*
> >>>
> >>>
> >>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com> wrote:
> >>>
> >>>> Hi Avner,
> >>>> Maybe you said this already but what version of Drill are you using and
> >>> do
> >>>> you have the metastore enabled?
> >>>> --C
> >>>>
> >>>>
> >>>>
> >>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
> >>>>>
> >>>>> Thanks Rafael for your answer.
> >>>>> As I wrote in the previous email these planning times occur even when
> >>>>> selecting one fields from one tiny file (60k) that I pass directly by
> >>>> full
> >>>>> path (select name from `parquet/data/data.parquet` limit 1).
> >>>>> Any idea what can influence the time in such a trivial scenario?
> >>>>> In addition, doesn't Drill cache execution plans between similar
> >>> queries
> >>>>> executions?
> >>>>> Best regards,
> >>>>> Avner
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
> >> rafjaimes@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Avner,
> >>>>>>
> >>>>>> One way you might be able to optimize this is by modifying the size
> >>>>>> and number of the parquet files. How many files do you have and how
> >>>>>> big are they? Do you know what the row group size is? What is the
> >> HDFS
> >>>>>> block size is on your storage?
> >>>>>>
> >>>>>> There's probably a lot more intricate ways to improve performance
> >> with
> >>>>>> the Drill settings, but I have not modified them.
> >>>>>>
> >>>>>> - Rafael
> >>>>>>
> >>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com>
> >>> wrote:
> >>>>>>>
> >>>>>>> I'm running Apache Drill (1.18 master branch) in a docker with data
> >>>>>> stored
> >>>>>>> in Parquet files on S3.
> >>>>>>> When I run queries, even the most simple ones such as:
> >>>>>>>
> >>>>>>> select name from `parquet/data/data.parquet` limit 1
> >>>>>>>
> >>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
> >>> 0.112
> >>>>>> sec.
> >>>>>>> These proportions are maintained even if I run the same query
> >>> multiple
> >>>>>>> times in a row.
> >>>>>>> Since I'm trying to minimize query times to a minimum, I was
> >>> wondering
> >>>> if
> >>>>>>> such planning times (compared to execution) make sense and is there
> >>> any
> >>>>>> way
> >>>>>>> to reduce it? (some plan caching mechanism)
> >>>>>>> Thanks,
> >>>>>>> Avner
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>

Re: Planning times

Posted by Charles Givre <cg...@gmail.com>.
Hi Avner, 
So there is a way to define a schema both at query time and for a group of files. 

To add a schema at query time you can do the following:

SELECT * 
FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath => '/compound', schema => 'inline=(field_1 int not null, field_2 double not null, field_3 varchar not null, fixed_field int not null default `20`)')) 
WHERE field_1 < 5

I think this will work for parquet files, but I'm not 100% certain.  
Secondly, there is also the Drill metastore.  You can read about that here:  http://drill.apache.org/docs/using-drill-metastore/ <http://drill.apache.org/docs/using-drill-metastore/>

Lastly, there is a way to provide an actual schema file for your parquet files but I'll have to dig that up.  However, with all that said, I didn't realize that you were doing this on S3. I'd concur with Rafael,  that you should try on your local filesystem first and see what the query times look like.  Generally speaking, for a small file size like that, querying a parquet file should be nearly instantaneous, with or without the schema or metastore.
Good luck!
-- C 


> On Jun 7, 2020, at 11:08 AM, Avner Levy <av...@gmail.com> wrote:
> 
> Thanks Rafael,
> I'm doing my first steps with Drill so I'm still trying to figure out all
> the moving parts.
> I started with the master branch (1.18) Drill docker and just added the S3
> access plugin.
> Then I just submit the trivial query of asking one field from one file
> (trivial parquet file with few int/string fields) which has only 13 rows
> (written with Parquet.Net) and has one row group.
> So I think I'm working as you suggested directly with the dfs directly.
> I've run the query several times.
> That is why I'm surprised from the high planning times for such a
> trivial scenario.
> Does the planning phase include accessing the file and analyzing the
> fields? (perhaps accessing S3 is taking the time?)
> If drill had the exact schema of the file (I thought this was the idea of
> the metastore), would it reduce the planning phase?
> Thanks,
> Avner
> 
> On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <ra...@gmail.com>
> wrote:
> 
>> Hi Avner,
>> 
>> What do you mean by metastore, are you running it through a Hive metastore
>> and plugin?
>> 
>> I would try to query against the dfs directly. I'm seeing much shorter
>> planning times than you with Drill 1.17 and no metastore. I don't usually
>> query a single file but I imagine that would be even faster.
>> 
>> What program made the parquet file? Do you know what the row group size is
>> and is it the same as your HDFS block size? They should be for best
>> performance. Is the schema consistent within the file or do you have nested
>> fields?
>> 
>> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <av...@gmail.com> wrote:
>> 
>>> Hi Charles,
>>> I'm using master branch (1.18.0-SNAPSHOT) docker.
>>> I've enabled the metastore, session wise and run the same query twice but
>>> still got the following times.
>>> Is there a way to pre-define the table's schema in a way that will reduce
>>> the query time?
>>> The query is:
>>> *select name from `parquet/data.parquet` limit 1*
>>> 
>>> Any idea why planning takes so long on such trivial query?
>>> Does it include accessing the file for schema discovery?
>>> I'm providing the specific filename in the queries in order to reduce the
>>> file listing part.
>>> Thanks for your help,
>>>  Avner
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
>>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
>> true*
>>> 
>>> 
>>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com> wrote:
>>> 
>>>> Hi Avner,
>>>> Maybe you said this already but what version of Drill are you using and
>>> do
>>>> you have the metastore enabled?
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
>>>>> 
>>>>> Thanks Rafael for your answer.
>>>>> As I wrote in the previous email these planning times occur even when
>>>>> selecting one fields from one tiny file (60k) that I pass directly by
>>>> full
>>>>> path (select name from `parquet/data/data.parquet` limit 1).
>>>>> Any idea what can influence the time in such a trivial scenario?
>>>>> In addition, doesn't Drill cache execution plans between similar
>>> queries
>>>>> executions?
>>>>> Best regards,
>>>>> Avner
>>>>> 
>>>>> 
>>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
>> rafjaimes@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Avner,
>>>>>> 
>>>>>> One way you might be able to optimize this is by modifying the size
>>>>>> and number of the parquet files. How many files do you have and how
>>>>>> big are they? Do you know what the row group size is? What is the
>> HDFS
>>>>>> block size is on your storage?
>>>>>> 
>>>>>> There's probably a lot more intricate ways to improve performance
>> with
>>>>>> the Drill settings, but I have not modified them.
>>>>>> 
>>>>>> - Rafael
>>>>>> 
>>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>> I'm running Apache Drill (1.18 master branch) in a docker with data
>>>>>> stored
>>>>>>> in Parquet files on S3.
>>>>>>> When I run queries, even the most simple ones such as:
>>>>>>> 
>>>>>>> select name from `parquet/data/data.parquet` limit 1
>>>>>>> 
>>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
>>> 0.112
>>>>>> sec.
>>>>>>> These proportions are maintained even if I run the same query
>>> multiple
>>>>>>> times in a row.
>>>>>>> Since I'm trying to minimize query times to a minimum, I was
>>> wondering
>>>> if
>>>>>>> such planning times (compared to execution) make sense and is there
>>> any
>>>>>> way
>>>>>>> to reduce it? (some plan caching mechanism)
>>>>>>> Thanks,
>>>>>>> Avner
>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Re: Planning times

Posted by Avner Levy <av...@gmail.com>.
Thanks Rafael,
I'm doing my first steps with Drill so I'm still trying to figure out all
the moving parts.
I started with the master branch (1.18) Drill docker and just added the S3
access plugin.
Then I just submit the trivial query of asking one field from one file
(trivial parquet file with few int/string fields) which has only 13 rows
(written with Parquet.Net) and has one row group.
So I think I'm working as you suggested directly with the dfs directly.
I've run the query several times.
That is why I'm surprised from the high planning times for such a
trivial scenario.
Does the planning phase include accessing the file and analyzing the
fields? (perhaps accessing S3 is taking the time?)
If drill had the exact schema of the file (I thought this was the idea of
the metastore), would it reduce the planning phase?
Thanks,
 Avner

On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <ra...@gmail.com>
wrote:

> Hi Avner,
>
> What do you mean by metastore, are you running it through a Hive metastore
> and plugin?
>
> I would try to query against the dfs directly. I'm seeing much shorter
> planning times than you with Drill 1.17 and no metastore. I don't usually
> query a single file but I imagine that would be even faster.
>
> What program made the parquet file? Do you know what the row group size is
> and is it the same as your HDFS block size? They should be for best
> performance. Is the schema consistent within the file or do you have nested
> fields?
>
> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <av...@gmail.com> wrote:
>
> > Hi Charles,
> > I'm using master branch (1.18.0-SNAPSHOT) docker.
> > I've enabled the metastore, session wise and run the same query twice but
> > still got the following times.
> > Is there a way to pre-define the table's schema in a way that will reduce
> > the query time?
> > The query is:
> > *select name from `parquet/data.parquet` limit 1*
> >
> > Any idea why planning takes so long on such trivial query?
> > Does it include accessing the file for schema discovery?
> > I'm providing the specific filename in the queries in order to reduce the
> > file listing part.
> > Thanks for your help,
> >   Avner
> >
> >
> >
> >
> >
> >
> >
> > *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
> > 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
> true*
> >
> >
> > On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com> wrote:
> >
> > > Hi Avner,
> > > Maybe you said this already but what version of Drill are you using and
> > do
> > > you have the metastore enabled?
> > > --C
> > >
> > >
> > >
> > > > On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
> > > >
> > > > Thanks Rafael for your answer.
> > > > As I wrote in the previous email these planning times occur even when
> > > > selecting one fields from one tiny file (60k) that I pass directly by
> > > full
> > > > path (select name from `parquet/data/data.parquet` limit 1).
> > > > Any idea what can influence the time in such a trivial scenario?
> > > > In addition, doesn't Drill cache execution plans between similar
> > queries
> > > > executions?
> > > > Best regards,
> > > > Avner
> > > >
> > > >
> > > > On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
> rafjaimes@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Avner,
> > > >>
> > > >> One way you might be able to optimize this is by modifying the size
> > > >> and number of the parquet files. How many files do you have and how
> > > >> big are they? Do you know what the row group size is? What is the
> HDFS
> > > >> block size is on your storage?
> > > >>
> > > >> There's probably a lot more intricate ways to improve performance
> with
> > > >> the Drill settings, but I have not modified them.
> > > >>
> > > >> - Rafael
> > > >>
> > > >> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com>
> > wrote:
> > > >>>
> > > >>> I'm running Apache Drill (1.18 master branch) in a docker with data
> > > >> stored
> > > >>> in Parquet files on S3.
> > > >>> When I run queries, even the most simple ones such as:
> > > >>>
> > > >>> select name from `parquet/data/data.parquet` limit 1
> > > >>>
> > > >>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
> > 0.112
> > > >> sec.
> > > >>> These proportions are maintained even if I run the same query
> > multiple
> > > >>> times in a row.
> > > >>> Since I'm trying to minimize query times to a minimum, I was
> > wondering
> > > if
> > > >>> such planning times (compared to execution) make sense and is there
> > any
> > > >> way
> > > >>> to reduce it? (some plan caching mechanism)
> > > >>> Thanks,
> > > >>>  Avner
> > > >>
> > >
> > >
> >
>

Re: Planning times

Posted by Rafael Jaimes III <ra...@gmail.com>.
Hi Avner,

What do you mean by metastore, are you running it through a Hive metastore
and plugin?

I would try to query against the dfs directly. I'm seeing much shorter
planning times than you with Drill 1.17 and no metastore. I don't usually
query a single file but I imagine that would be even faster.

What program made the parquet file? Do you know what the row group size is
and is it the same as your HDFS block size? They should be for best
performance. Is the schema consistent within the file or do you have nested
fields?

On Sat, Jun 6, 2020, 3:51 PM Avner Levy <av...@gmail.com> wrote:

> Hi Charles,
> I'm using master branch (1.18.0-SNAPSHOT) docker.
> I've enabled the metastore, session wise and run the same query twice but
> still got the following times.
> Is there a way to pre-define the table's schema in a way that will reduce
> the query time?
> The query is:
> *select name from `parquet/data.parquet` limit 1*
>
> Any idea why planning takes so long on such trivial query?
> Does it include accessing the file for schema discovery?
> I'm providing the specific filename in the queries in order to reduce the
> file listing part.
> Thanks for your help,
>   Avner
>
>
>
>
>
>
>
> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled true*
>
>
> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com> wrote:
>
> > Hi Avner,
> > Maybe you said this already but what version of Drill are you using and
> do
> > you have the metastore enabled?
> > --C
> >
> >
> >
> > > On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
> > >
> > > Thanks Rafael for your answer.
> > > As I wrote in the previous email these planning times occur even when
> > > selecting one fields from one tiny file (60k) that I pass directly by
> > full
> > > path (select name from `parquet/data/data.parquet` limit 1).
> > > Any idea what can influence the time in such a trivial scenario?
> > > In addition, doesn't Drill cache execution plans between similar
> queries
> > > executions?
> > > Best regards,
> > > Avner
> > >
> > >
> > > On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <ra...@gmail.com>
> > > wrote:
> > >
> > >> Hi Avner,
> > >>
> > >> One way you might be able to optimize this is by modifying the size
> > >> and number of the parquet files. How many files do you have and how
> > >> big are they? Do you know what the row group size is? What is the HDFS
> > >> block size is on your storage?
> > >>
> > >> There's probably a lot more intricate ways to improve performance with
> > >> the Drill settings, but I have not modified them.
> > >>
> > >> - Rafael
> > >>
> > >> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com>
> wrote:
> > >>>
> > >>> I'm running Apache Drill (1.18 master branch) in a docker with data
> > >> stored
> > >>> in Parquet files on S3.
> > >>> When I run queries, even the most simple ones such as:
> > >>>
> > >>> select name from `parquet/data/data.parquet` limit 1
> > >>>
> > >>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
> 0.112
> > >> sec.
> > >>> These proportions are maintained even if I run the same query
> multiple
> > >>> times in a row.
> > >>> Since I'm trying to minimize query times to a minimum, I was
> wondering
> > if
> > >>> such planning times (compared to execution) make sense and is there
> any
> > >> way
> > >>> to reduce it? (some plan caching mechanism)
> > >>> Thanks,
> > >>>  Avner
> > >>
> >
> >
>

Re: Planning times

Posted by Avner Levy <av...@gmail.com>.
Hi Charles,
I'm using master branch (1.18.0-SNAPSHOT) docker.
I've enabled the metastore, session wise and run the same query twice but
still got the following times.
Is there a way to pre-define the table's schema in a way that will reduce
the query time?
The query is:
*select name from `parquet/data.parquet` limit 1*

Any idea why planning takes so long on such trivial query?
Does it include accessing the file for schema discovery?
I'm providing the specific filename in the queries in order to reduce the
file listing part.
Thanks for your help,
  Avner







*DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
0.773 secOptions Overview Session OptionsName Valuemetastore.enabled true*


On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cg...@gmail.com> wrote:

> Hi Avner,
> Maybe you said this already but what version of Drill are you using and do
> you have the metastore enabled?
> --C
>
>
>
> > On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
> >
> > Thanks Rafael for your answer.
> > As I wrote in the previous email these planning times occur even when
> > selecting one fields from one tiny file (60k) that I pass directly by
> full
> > path (select name from `parquet/data/data.parquet` limit 1).
> > Any idea what can influence the time in such a trivial scenario?
> > In addition, doesn't Drill cache execution plans between similar queries
> > executions?
> > Best regards,
> > Avner
> >
> >
> > On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <ra...@gmail.com>
> > wrote:
> >
> >> Hi Avner,
> >>
> >> One way you might be able to optimize this is by modifying the size
> >> and number of the parquet files. How many files do you have and how
> >> big are they? Do you know what the row group size is? What is the HDFS
> >> block size is on your storage?
> >>
> >> There's probably a lot more intricate ways to improve performance with
> >> the Drill settings, but I have not modified them.
> >>
> >> - Rafael
> >>
> >> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com> wrote:
> >>>
> >>> I'm running Apache Drill (1.18 master branch) in a docker with data
> >> stored
> >>> in Parquet files on S3.
> >>> When I run queries, even the most simple ones such as:
> >>>
> >>> select name from `parquet/data/data.parquet` limit 1
> >>>
> >>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only 0.112
> >> sec.
> >>> These proportions are maintained even if I run the same query multiple
> >>> times in a row.
> >>> Since I'm trying to minimize query times to a minimum, I was wondering
> if
> >>> such planning times (compared to execution) make sense and is there any
> >> way
> >>> to reduce it? (some plan caching mechanism)
> >>> Thanks,
> >>>  Avner
> >>
>
>

Re: Planning times

Posted by Charles Givre <cg...@gmail.com>.
Hi Avner, 
Maybe you said this already but what version of Drill are you using and do you have the metastore enabled?
--C



> On Jun 4, 2020, at 9:02 PM, Avner Levy <av...@gmail.com> wrote:
> 
> Thanks Rafael for your answer.
> As I wrote in the previous email these planning times occur even when
> selecting one fields from one tiny file (60k) that I pass directly by full
> path (select name from `parquet/data/data.parquet` limit 1).
> Any idea what can influence the time in such a trivial scenario?
> In addition, doesn't Drill cache execution plans between similar queries
> executions?
> Best regards,
> Avner
> 
> 
> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <ra...@gmail.com>
> wrote:
> 
>> Hi Avner,
>> 
>> One way you might be able to optimize this is by modifying the size
>> and number of the parquet files. How many files do you have and how
>> big are they? Do you know what the row group size is? What is the HDFS
>> block size is on your storage?
>> 
>> There's probably a lot more intricate ways to improve performance with
>> the Drill settings, but I have not modified them.
>> 
>> - Rafael
>> 
>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com> wrote:
>>> 
>>> I'm running Apache Drill (1.18 master branch) in a docker with data
>> stored
>>> in Parquet files on S3.
>>> When I run queries, even the most simple ones such as:
>>> 
>>> select name from `parquet/data/data.parquet` limit 1
>>> 
>>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only 0.112
>> sec.
>>> These proportions are maintained even if I run the same query multiple
>>> times in a row.
>>> Since I'm trying to minimize query times to a minimum, I was wondering if
>>> such planning times (compared to execution) make sense and is there any
>> way
>>> to reduce it? (some plan caching mechanism)
>>> Thanks,
>>>  Avner
>> 


Re: Planning times

Posted by Avner Levy <av...@gmail.com>.
Thanks Rafael for your answer.
As I wrote in the previous email these planning times occur even when
selecting one fields from one tiny file (60k) that I pass directly by full
path (select name from `parquet/data/data.parquet` limit 1).
Any idea what can influence the time in such a trivial scenario?
In addition, doesn't Drill cache execution plans between similar queries
executions?
Best regards,
 Avner


On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <ra...@gmail.com>
wrote:

> Hi Avner,
>
> One way you might be able to optimize this is by modifying the size
> and number of the parquet files. How many files do you have and how
> big are they? Do you know what the row group size is? What is the HDFS
> block size is on your storage?
>
> There's probably a lot more intricate ways to improve performance with
> the Drill settings, but I have not modified them.
>
> - Rafael
>
> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com> wrote:
> >
> > I'm running Apache Drill (1.18 master branch) in a docker with data
> stored
> > in Parquet files on S3.
> > When I run queries, even the most simple ones such as:
> >
> > select name from `parquet/data/data.parquet` limit 1
> >
> > The "Planning" time is 0.7-1.5 sec while the "Execution" is only 0.112
> sec.
> > These proportions are maintained even if I run the same query multiple
> > times in a row.
> > Since I'm trying to minimize query times to a minimum, I was wondering if
> > such planning times (compared to execution) make sense and is there any
> way
> > to reduce it? (some plan caching mechanism)
> > Thanks,
> >   Avner
>

Re: Planning times

Posted by Rafael Jaimes III <ra...@gmail.com>.
Hi Avner,

One way you might be able to optimize this is by modifying the size
and number of the parquet files. How many files do you have and how
big are they? Do you know what the row group size is? What is the HDFS
block size is on your storage?

There's probably a lot more intricate ways to improve performance with
the Drill settings, but I have not modified them.

- Rafael

On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <av...@gmail.com> wrote:
>
> I'm running Apache Drill (1.18 master branch) in a docker with data stored
> in Parquet files on S3.
> When I run queries, even the most simple ones such as:
>
> select name from `parquet/data/data.parquet` limit 1
>
> The "Planning" time is 0.7-1.5 sec while the "Execution" is only 0.112 sec.
> These proportions are maintained even if I run the same query multiple
> times in a row.
> Since I'm trying to minimize query times to a minimum, I was wondering if
> such planning times (compared to execution) make sense and is there any way
> to reduce it? (some plan caching mechanism)
> Thanks,
>   Avner