You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by PROJJWAL SAHA <pr...@gmail.com> on 2017/03/03 11:36:57 UTC

Minimise query plan time for dfs plugin for local file system on tsv file

Hello all,

I am quering select * from dfs.xxx where yyy (filter condition)

I am using dfs storage plugin that comes out of the box from drill on a 1GB
file, local to the drill cluster.
The 1GB file is split into 10 files of 100 MB each.
As expected I see 11 minor and 2 major fagments.
The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.

One observation is that the query plan time is more than 30 seconds. I ran
the explain plan query to validate this.
The query execution time is 2 secs.
total time taken is 32secs

I wanted to understand how can i minimise the query plan time. Suggestions ?
Is the time taken described above expected ?
Attached is result from explain plan query

Regards,
Projjwal

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by PROJJWAL SAHA <pr...@gmail.com>.

The files are copied to the same location in all the nodes of the cluster.
And all the nodes have equal access to the files.
The files are not located to a single shared file system.

On Fri, Mar 3, 2017 at 7:12 PM, John Omernik <jo...@omernik.com> wrote:

> Can you help me understand what "local to the cluster" means in the context
> of a 5 node cluster? In the plan, the files are all file:// Are the files
> replicated to each node? is it a common shared filesystem?  Do all 5 nodes
> have equal access to the 10 files? I wonder if using a local FS in a
> distributed cluster is having some effect on the planning...
>
> On Fri, Mar 3, 2017 at 6:08 AM, PROJJWAL SAHA <pr...@gmail.com> wrote:
>
> > I did not change the default values used by drill.
> > Are you talking of changing planner.memory_limit
> > and planner.memory.max_query_memory_per_node ?
> > If there are any other debug work that I can do, pls suggest
> >
> > Regards
> >
> > On Fri, Mar 3, 2017 at 5:14 PM, Nitin Pawar <ni...@gmail.com>
> > wrote:
> >
> > > how much memory have you set for planner ?
> > >
> > > On Fri, Mar 3, 2017 at 5:06 PM, PROJJWAL SAHA <pr...@gmail.com>
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am quering select * from dfs.xxx where yyy (filter condition)
> > > >
> > > > I am using dfs storage plugin that comes out of the box from drill
> on a
> > > > 1GB file, local to the drill cluster.
> > > > The 1GB file is split into 10 files of 100 MB each.
> > > > As expected I see 11 minor and 2 major fagments.
> > > > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> > > >
> > > > One observation is that the query plan time is more than 30 seconds.
> I
> > > ran
> > > > the explain plan query to validate this.
> > > > The query execution time is 2 secs.
> > > > total time taken is 32secs
> > > >
> > > > I wanted to understand how can i minimise the query plan time.
> > > Suggestions
> > > > ?
> > > > Is the time taken described above expected ?
> > > > Attached is result from explain plan query
> > > >
> > > > Regards,
> > > > Projjwal
> > > >
> > > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by John Omernik <jo...@omernik.com>.

Can you help me understand what "local to the cluster" means in the context
of a 5 node cluster? In the plan, the files are all file:// Are the files
replicated to each node? is it a common shared filesystem?  Do all 5 nodes
have equal access to the 10 files? I wonder if using a local FS in a
distributed cluster is having some effect on the planning...

On Fri, Mar 3, 2017 at 6:08 AM, PROJJWAL SAHA <pr...@gmail.com> wrote:

> I did not change the default values used by drill.
> Are you talking of changing planner.memory_limit
> and planner.memory.max_query_memory_per_node ?
> If there are any other debug work that I can do, pls suggest
>
> Regards
>
> On Fri, Mar 3, 2017 at 5:14 PM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
> > how much memory have you set for planner ?
> >
> > On Fri, Mar 3, 2017 at 5:06 PM, PROJJWAL SAHA <pr...@gmail.com>
> wrote:
> >
> > > Hello all,
> > >
> > > I am quering select * from dfs.xxx where yyy (filter condition)
> > >
> > > I am using dfs storage plugin that comes out of the box from drill on a
> > > 1GB file, local to the drill cluster.
> > > The 1GB file is split into 10 files of 100 MB each.
> > > As expected I see 11 minor and 2 major fagments.
> > > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> > >
> > > One observation is that the query plan time is more than 30 seconds. I
> > ran
> > > the explain plan query to validate this.
> > > The query execution time is 2 secs.
> > > total time taken is 32secs
> > >
> > > I wanted to understand how can i minimise the query plan time.
> > Suggestions
> > > ?
> > > Is the time taken described above expected ?
> > > Attached is result from explain plan query
> > >
> > > Regards,
> > > Projjwal
> > >
> > >
> >
> >
> > --
> > Nitin Pawar
> >
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by PROJJWAL SAHA <pr...@gmail.com>.

I did not change the default values used by drill.
Are you talking of changing planner.memory_limit
and planner.memory.max_query_memory_per_node ?
If there are any other debug work that I can do, pls suggest

Regards

On Fri, Mar 3, 2017 at 5:14 PM, Nitin Pawar <ni...@gmail.com> wrote:

> how much memory have you set for planner ?
>
> On Fri, Mar 3, 2017 at 5:06 PM, PROJJWAL SAHA <pr...@gmail.com> wrote:
>
> > Hello all,
> >
> > I am quering select * from dfs.xxx where yyy (filter condition)
> >
> > I am using dfs storage plugin that comes out of the box from drill on a
> > 1GB file, local to the drill cluster.
> > The 1GB file is split into 10 files of 100 MB each.
> > As expected I see 11 minor and 2 major fagments.
> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> >
> > One observation is that the query plan time is more than 30 seconds. I
> ran
> > the explain plan query to validate this.
> > The query execution time is 2 secs.
> > total time taken is 32secs
> >
> > I wanted to understand how can i minimise the query plan time.
> Suggestions
> > ?
> > Is the time taken described above expected ?
> > Attached is result from explain plan query
> >
> > Regards,
> > Projjwal
> >
> >
>
>
> --
> Nitin Pawar
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Nitin Pawar <ni...@gmail.com>.

how much memory have you set for planner ?

On Fri, Mar 3, 2017 at 5:06 PM, PROJJWAL SAHA <pr...@gmail.com> wrote:

> Hello all,
>
> I am quering select * from dfs.xxx where yyy (filter condition)
>
> I am using dfs storage plugin that comes out of the box from drill on a
> 1GB file, local to the drill cluster.
> The 1GB file is split into 10 files of 100 MB each.
> As expected I see 11 minor and 2 major fagments.
> The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>
> One observation is that the query plan time is more than 30 seconds. I ran
> the explain plan query to validate this.
> The query execution time is 2 secs.
> total time taken is 32secs
>
> I wanted to understand how can i minimise the query plan time. Suggestions
> ?
> Is the time taken described above expected ?
> Attached is result from explain plan query
>
> Regards,
> Projjwal
>
>


-- 
Nitin Pawar

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Zelaine Fong <zf...@mapr.com>.

I believe Chunhui has a fix for DRILL-5089 but is waiting on the Calcite portions of the fix to be accepted.

-- Zelaine

On 3/9/17, 7:47 AM, "Jinfeng Ni" <jn...@apache.org> wrote:

    DRILL-5089 has hit many drill users for very long time; it was
    reported several times on this drill user list. It would make sense to
    bump it's priority and get it fixed in the next Drill release.
    
    
    On Wed, Mar 8, 2017 at 8:45 AM, Zelaine Fong <zf...@mapr.com> wrote:
    > The slowness you’re seeing when you have additional storage plugins that are not being used by your query is due to DRILL-5089.
    >
    > -- Zelaine
    >
    > On 3/7/17, 11:20 PM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
    >
    >     Hi Kunal,
    >
    >     Good catch ! Thanks for this pointer.
    >     I enabled logging at org.apache.drill level and I found -
    >
    >     2017-03-08 01:39:01,822 [274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
    >      o.a.drill.exec.work.foreman.Foreman - Query text for query id
    >     274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
    >     dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
    >     ORDER_ID='41' and CUSTOMER_ID='568'
    >     *2017-03-08 01:39:01,823 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
    >     DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
    >     StoragePlugin configuration with zookeeper quorum 'localhost', port '2181'.*
    >     *2017-03-08 01:39:16,038 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
    >     DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
    >     schemas.*
    >
    >
    >     i am not sure why the hbase storage plugin comes in play as it is disabled.
    >     i then disabled all the other active plugins that i had and just kept the
    >     dfs plugin.
    >
    >     the planning time is now reduced to 0.9 secs
    >     and the query time for 1GB partitioned tsv data is taking 3.63 secs
    >
    >     is that a reasonable behaviour ?
    >
    >     Regards,
    >     Projjwal
    >
    >     On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua <kk...@mapr.com> wrote:
    >
    >     >
    >     > Looking at the 1st two lines of the log shows that the bulk of time was
    >     > lost before the query even went into the real planning stage of the query:
    >     >
    >     >
    >     > 2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
    >     > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
    >     > 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
    >     > dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
    >     > columns[0] ='41' and columns[3] ='568'
    >     > 2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
    >     > INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
    >     > took 0 ms, numFiles: 1
    >     >
    >     >
    >     > More than 30 secs is unaccounted for. Can you turn on the root logger to
    >     > be at the debug level and retry the explain plan?
    >     >
    >     >
    >     > Kunal Khatua
    >     >
    >     >
    >     > ________________________________
    >     > From: rahul challapalli <ch...@gmail.com>
    >     > Sent: Tuesday, March 7, 2017 5:24:43 AM
    >     > To: user
    >     > Subject: Re: Minimise query plan time for dfs plugin for local file system
    >     > on tsv file
    >     >
    >     > I did not get a chance to review the log file.
    >     >
    >     > However the next thing I would try is to make your cluster a single node
    >     > cluster first and then run the same explain plan query separately on each
    >     > individual file.
    >     >
    >     >
    >     >
    >     > On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
    >     >
    >     > > Hi Rahul,
    >     > >
    >     > > thanks for your suggestions. However, I am still not able to see any
    >     > > reduction in query planning time
    >     > > by explicit column names, removing extract headers and using
    >     > columns[index]
    >     > >
    >     > > As I said, I ran explain plan and its taking 30+ secs for me.
    >     > > My data is 1 GB tsv split into 20 files of 5 MB each.
    >     > > Each 5MB file has close to 50k records
    >     > > Its a 5 node cluster, and width per node is 4
    >     > > Therefore, total number of minor fragments for one major fragment is 20
    >     > > I have copied the source directory in all the drillbit nodes
    >     > >
    >     > > can you tell me a reasonable time estimate which I can expect drill to
    >     > > return result for query for the above described scenario.
    >     > > Query is - select columns[0] from dfs.root.`/scratch/localdisk/
    >     > drill/testdata/Cust_1G_20_tsv`
    >     > > where columns[0] ='41' and columns[3] ='568'
    >     > >
    >     > > attached is the json profile
    >     > > and the drillbit.log
    >     > >
    >     > > I also have the tracing enabled.
    >     > > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
    >     > > org.apache.drill.exec.work.foreman.Foreman
    >     > > however i see the duration of various steps in the order of ms in the
    >     > logs.
    >     > > i am not sure where planning time of the order of 30 secs is consumed.
    >     > >
    >     > > Please help
    >     > >
    >     > > Regards,
    >     > > Projjwal
    >     > >
    >     > >
    >     > >
    >     > >
    >     > >
    >     > >
    >     > >
    >     > > On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
    >     > > challapallirahul@gmail.com> wrote:
    >     > >
    >     > >> You can try the below things. For each of the below check the planning
    >     > >> time
    >     > >> individually
    >     > >>
    >     > >> 1. Run explain plan for a simple "select * from `
    >     > >> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
    >     > >> 2. Replace the '*' in your query with explicit column names
    >     > >> 3. Remove the extract header from your storage plugin configuration and
    >     > >> from your data files? Rewrite your query to use, columns[0_based_index]
    >     > >> instead of explicit column names
    >     > >>
    >     > >> Also how many columns do you have in your text files and what is the
    >     > size
    >     > >> of each file? Like gautam suggested, it would be good to take a look at
    >     > >> drillbit.log file (from the foreman node where planning occurred) and
    >     > the
    >     > >> query profile as well.
    >     > >>
    >     > >> - Rahul
    >     > >>
    >     > >> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
    >     > >>
    >     > >> > Can you please provide the drillbit.log file?
    >     > >> >
    >     > >> >
    >     > >> > Gautam
    >     > >> >
    >     > >> > ________________________________
    >     > >> > From: PROJJWAL SAHA <pr...@gmail.com>
    >     > >> > Sent: Monday, March 6, 2017 1:45:38 AM
    >     > >> > To: user@drill.apache.org
    >     > >> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
    >     > >> > system on tsv file
    >     > >> >
    >     > >> > all, please help me in giving suggestions on what areas i can look
    >     > into
    >     > >> > why the query planning time is taking so long for files which are
    >     > local
    >     > >> to
    >     > >> > the drill machines. I have the same directory structure copied on all
    >     > >> the 5
    >     > >> > nodes of the cluster. I am accessing the source files using out of the
    >     > >> box
    >     > >> > dfs storage plugin.
    >     > >> >
    >     > >> > Query planning time is approx 30 secs
    >     > >> > Query execution time is apprx 1.5 secs
    >     > >> >
    >     > >> > Regards,
    >     > >> > Projjwal
    >     > >> >
    >     > >> > ---------- Forwarded message ----------
    >     > >> > From: PROJJWAL SAHA <pr...@gmail.com>>
    >     > >> > Date: Fri, Mar 3, 2017 at 5:06 PM
    >     > >> > Subject: Minimise query plan time for dfs plugin for local file system
    >     > >> on
    >     > >> > tsv file
    >     > >> > To: user@drill.apache.org<ma...@drill.apache.org>
    >     > >> >
    >     > >> >
    >     > >> > Hello all,
    >     > >> >
    >     > >> > I am quering select * from dfs.xxx where yyy (filter condition)
    >     > >> >
    >     > >> > I am using dfs storage plugin that comes out of the box from drill on
    >     > a
    >     > >> > 1GB file, local to the drill cluster.
    >     > >> > The 1GB file is split into 10 files of 100 MB each.
    >     > >> > As expected I see 11 minor and 2 major fagments.
    >     > >> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
    >     > >> >
    >     > >> > One observation is that the query plan time is more than 30 seconds. I
    >     > >> ran
    >     > >> > the explain plan query to validate this.
    >     > >> > The query execution time is 2 secs.
    >     > >> > total time taken is 32secs
    >     > >> >
    >     > >> > I wanted to understand how can i minimise the query plan time.
    >     > >> Suggestions
    >     > >> > ?
    >     > >> > Is the time taken described above expected ?
    >     > >> > Attached is result from explain plan query
    >     > >> >
    >     > >> > Regards,
    >     > >> > Projjwal
    >     > >> >
    >     > >> >
    >     > >> >
    >     > >>
    >     > >
    >     > >
    >     >
    >
    >

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Jinfeng Ni <jn...@apache.org>.

DRILL-5089 has hit many drill users for very long time; it was
reported several times on this drill user list. It would make sense to
bump it's priority and get it fixed in the next Drill release.


On Wed, Mar 8, 2017 at 8:45 AM, Zelaine Fong <zf...@mapr.com> wrote:
> The slowness you’re seeing when you have additional storage plugins that are not being used by your query is due to DRILL-5089.
>
> -- Zelaine
>
> On 3/7/17, 11:20 PM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
>
>     Hi Kunal,
>
>     Good catch ! Thanks for this pointer.
>     I enabled logging at org.apache.drill level and I found -
>
>     2017-03-08 01:39:01,822 [274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
>      o.a.drill.exec.work.foreman.Foreman - Query text for query id
>     274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
>     dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
>     ORDER_ID='41' and CUSTOMER_ID='568'
>     *2017-03-08 01:39:01,823 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
>     DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
>     StoragePlugin configuration with zookeeper quorum 'localhost', port '2181'.*
>     *2017-03-08 01:39:16,038 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
>     DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
>     schemas.*
>
>
>     i am not sure why the hbase storage plugin comes in play as it is disabled.
>     i then disabled all the other active plugins that i had and just kept the
>     dfs plugin.
>
>     the planning time is now reduced to 0.9 secs
>     and the query time for 1GB partitioned tsv data is taking 3.63 secs
>
>     is that a reasonable behaviour ?
>
>     Regards,
>     Projjwal
>
>     On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua <kk...@mapr.com> wrote:
>
>     >
>     > Looking at the 1st two lines of the log shows that the bulk of time was
>     > lost before the query even went into the real planning stage of the query:
>     >
>     >
>     > 2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
>     > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
>     > 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
>     > dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
>     > columns[0] ='41' and columns[3] ='568'
>     > 2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
>     > INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
>     > took 0 ms, numFiles: 1
>     >
>     >
>     > More than 30 secs is unaccounted for. Can you turn on the root logger to
>     > be at the debug level and retry the explain plan?
>     >
>     >
>     > Kunal Khatua
>     >
>     >
>     > ________________________________
>     > From: rahul challapalli <ch...@gmail.com>
>     > Sent: Tuesday, March 7, 2017 5:24:43 AM
>     > To: user
>     > Subject: Re: Minimise query plan time for dfs plugin for local file system
>     > on tsv file
>     >
>     > I did not get a chance to review the log file.
>     >
>     > However the next thing I would try is to make your cluster a single node
>     > cluster first and then run the same explain plan query separately on each
>     > individual file.
>     >
>     >
>     >
>     > On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
>     >
>     > > Hi Rahul,
>     > >
>     > > thanks for your suggestions. However, I am still not able to see any
>     > > reduction in query planning time
>     > > by explicit column names, removing extract headers and using
>     > columns[index]
>     > >
>     > > As I said, I ran explain plan and its taking 30+ secs for me.
>     > > My data is 1 GB tsv split into 20 files of 5 MB each.
>     > > Each 5MB file has close to 50k records
>     > > Its a 5 node cluster, and width per node is 4
>     > > Therefore, total number of minor fragments for one major fragment is 20
>     > > I have copied the source directory in all the drillbit nodes
>     > >
>     > > can you tell me a reasonable time estimate which I can expect drill to
>     > > return result for query for the above described scenario.
>     > > Query is - select columns[0] from dfs.root.`/scratch/localdisk/
>     > drill/testdata/Cust_1G_20_tsv`
>     > > where columns[0] ='41' and columns[3] ='568'
>     > >
>     > > attached is the json profile
>     > > and the drillbit.log
>     > >
>     > > I also have the tracing enabled.
>     > > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>     > > org.apache.drill.exec.work.foreman.Foreman
>     > > however i see the duration of various steps in the order of ms in the
>     > logs.
>     > > i am not sure where planning time of the order of 30 secs is consumed.
>     > >
>     > > Please help
>     > >
>     > > Regards,
>     > > Projjwal
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >
>     > > On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
>     > > challapallirahul@gmail.com> wrote:
>     > >
>     > >> You can try the below things. For each of the below check the planning
>     > >> time
>     > >> individually
>     > >>
>     > >> 1. Run explain plan for a simple "select * from `
>     > >> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
>     > >> 2. Replace the '*' in your query with explicit column names
>     > >> 3. Remove the extract header from your storage plugin configuration and
>     > >> from your data files? Rewrite your query to use, columns[0_based_index]
>     > >> instead of explicit column names
>     > >>
>     > >> Also how many columns do you have in your text files and what is the
>     > size
>     > >> of each file? Like gautam suggested, it would be good to take a look at
>     > >> drillbit.log file (from the foreman node where planning occurred) and
>     > the
>     > >> query profile as well.
>     > >>
>     > >> - Rahul
>     > >>
>     > >> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
>     > >>
>     > >> > Can you please provide the drillbit.log file?
>     > >> >
>     > >> >
>     > >> > Gautam
>     > >> >
>     > >> > ________________________________
>     > >> > From: PROJJWAL SAHA <pr...@gmail.com>
>     > >> > Sent: Monday, March 6, 2017 1:45:38 AM
>     > >> > To: user@drill.apache.org
>     > >> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
>     > >> > system on tsv file
>     > >> >
>     > >> > all, please help me in giving suggestions on what areas i can look
>     > into
>     > >> > why the query planning time is taking so long for files which are
>     > local
>     > >> to
>     > >> > the drill machines. I have the same directory structure copied on all
>     > >> the 5
>     > >> > nodes of the cluster. I am accessing the source files using out of the
>     > >> box
>     > >> > dfs storage plugin.
>     > >> >
>     > >> > Query planning time is approx 30 secs
>     > >> > Query execution time is apprx 1.5 secs
>     > >> >
>     > >> > Regards,
>     > >> > Projjwal
>     > >> >
>     > >> > ---------- Forwarded message ----------
>     > >> > From: PROJJWAL SAHA <pr...@gmail.com>>
>     > >> > Date: Fri, Mar 3, 2017 at 5:06 PM
>     > >> > Subject: Minimise query plan time for dfs plugin for local file system
>     > >> on
>     > >> > tsv file
>     > >> > To: user@drill.apache.org<ma...@drill.apache.org>
>     > >> >
>     > >> >
>     > >> > Hello all,
>     > >> >
>     > >> > I am quering select * from dfs.xxx where yyy (filter condition)
>     > >> >
>     > >> > I am using dfs storage plugin that comes out of the box from drill on
>     > a
>     > >> > 1GB file, local to the drill cluster.
>     > >> > The 1GB file is split into 10 files of 100 MB each.
>     > >> > As expected I see 11 minor and 2 major fagments.
>     > >> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>     > >> >
>     > >> > One observation is that the query plan time is more than 30 seconds. I
>     > >> ran
>     > >> > the explain plan query to validate this.
>     > >> > The query execution time is 2 secs.
>     > >> > total time taken is 32secs
>     > >> >
>     > >> > I wanted to understand how can i minimise the query plan time.
>     > >> Suggestions
>     > >> > ?
>     > >> > Is the time taken described above expected ?
>     > >> > Attached is result from explain plan query
>     > >> >
>     > >> > Regards,
>     > >> > Projjwal
>     > >> >
>     > >> >
>     > >> >
>     > >>
>     > >
>     > >
>     >
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Zelaine Fong <zf...@mapr.com>.

The slowness you’re seeing when you have additional storage plugins that are not being used by your query is due to DRILL-5089.

-- Zelaine

On 3/7/17, 11:20 PM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:

    Hi Kunal,
    
    Good catch ! Thanks for this pointer.
    I enabled logging at org.apache.drill level and I found -
    
    2017-03-08 01:39:01,822 [274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
     o.a.drill.exec.work.foreman.Foreman - Query text for query id
    274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
    dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
    ORDER_ID='41' and CUSTOMER_ID='568'
    *2017-03-08 01:39:01,823 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
    DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
    StoragePlugin configuration with zookeeper quorum 'localhost', port '2181'.*
    *2017-03-08 01:39:16,038 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
    DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
    schemas.*
    
    
    i am not sure why the hbase storage plugin comes in play as it is disabled.
    i then disabled all the other active plugins that i had and just kept the
    dfs plugin.
    
    the planning time is now reduced to 0.9 secs
    and the query time for 1GB partitioned tsv data is taking 3.63 secs
    
    is that a reasonable behaviour ?
    
    Regards,
    Projjwal
    
    On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua <kk...@mapr.com> wrote:
    
    >
    > Looking at the 1st two lines of the log shows that the bulk of time was
    > lost before the query even went into the real planning stage of the query:
    >
    >
    > 2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
    > INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
    > 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
    > dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
    > columns[0] ='41' and columns[3] ='568'
    > 2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
    > INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
    > took 0 ms, numFiles: 1
    >
    >
    > More than 30 secs is unaccounted for. Can you turn on the root logger to
    > be at the debug level and retry the explain plan?
    >
    >
    > Kunal Khatua
    >
    >
    > ________________________________
    > From: rahul challapalli <ch...@gmail.com>
    > Sent: Tuesday, March 7, 2017 5:24:43 AM
    > To: user
    > Subject: Re: Minimise query plan time for dfs plugin for local file system
    > on tsv file
    >
    > I did not get a chance to review the log file.
    >
    > However the next thing I would try is to make your cluster a single node
    > cluster first and then run the same explain plan query separately on each
    > individual file.
    >
    >
    >
    > On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
    >
    > > Hi Rahul,
    > >
    > > thanks for your suggestions. However, I am still not able to see any
    > > reduction in query planning time
    > > by explicit column names, removing extract headers and using
    > columns[index]
    > >
    > > As I said, I ran explain plan and its taking 30+ secs for me.
    > > My data is 1 GB tsv split into 20 files of 5 MB each.
    > > Each 5MB file has close to 50k records
    > > Its a 5 node cluster, and width per node is 4
    > > Therefore, total number of minor fragments for one major fragment is 20
    > > I have copied the source directory in all the drillbit nodes
    > >
    > > can you tell me a reasonable time estimate which I can expect drill to
    > > return result for query for the above described scenario.
    > > Query is - select columns[0] from dfs.root.`/scratch/localdisk/
    > drill/testdata/Cust_1G_20_tsv`
    > > where columns[0] ='41' and columns[3] ='568'
    > >
    > > attached is the json profile
    > > and the drillbit.log
    > >
    > > I also have the tracing enabled.
    > > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
    > > org.apache.drill.exec.work.foreman.Foreman
    > > however i see the duration of various steps in the order of ms in the
    > logs.
    > > i am not sure where planning time of the order of 30 secs is consumed.
    > >
    > > Please help
    > >
    > > Regards,
    > > Projjwal
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > > On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
    > > challapallirahul@gmail.com> wrote:
    > >
    > >> You can try the below things. For each of the below check the planning
    > >> time
    > >> individually
    > >>
    > >> 1. Run explain plan for a simple "select * from `
    > >> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
    > >> 2. Replace the '*' in your query with explicit column names
    > >> 3. Remove the extract header from your storage plugin configuration and
    > >> from your data files? Rewrite your query to use, columns[0_based_index]
    > >> instead of explicit column names
    > >>
    > >> Also how many columns do you have in your text files and what is the
    > size
    > >> of each file? Like gautam suggested, it would be good to take a look at
    > >> drillbit.log file (from the foreman node where planning occurred) and
    > the
    > >> query profile as well.
    > >>
    > >> - Rahul
    > >>
    > >> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
    > >>
    > >> > Can you please provide the drillbit.log file?
    > >> >
    > >> >
    > >> > Gautam
    > >> >
    > >> > ________________________________
    > >> > From: PROJJWAL SAHA <pr...@gmail.com>
    > >> > Sent: Monday, March 6, 2017 1:45:38 AM
    > >> > To: user@drill.apache.org
    > >> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
    > >> > system on tsv file
    > >> >
    > >> > all, please help me in giving suggestions on what areas i can look
    > into
    > >> > why the query planning time is taking so long for files which are
    > local
    > >> to
    > >> > the drill machines. I have the same directory structure copied on all
    > >> the 5
    > >> > nodes of the cluster. I am accessing the source files using out of the
    > >> box
    > >> > dfs storage plugin.
    > >> >
    > >> > Query planning time is approx 30 secs
    > >> > Query execution time is apprx 1.5 secs
    > >> >
    > >> > Regards,
    > >> > Projjwal
    > >> >
    > >> > ---------- Forwarded message ----------
    > >> > From: PROJJWAL SAHA <pr...@gmail.com>>
    > >> > Date: Fri, Mar 3, 2017 at 5:06 PM
    > >> > Subject: Minimise query plan time for dfs plugin for local file system
    > >> on
    > >> > tsv file
    > >> > To: user@drill.apache.org<ma...@drill.apache.org>
    > >> >
    > >> >
    > >> > Hello all,
    > >> >
    > >> > I am quering select * from dfs.xxx where yyy (filter condition)
    > >> >
    > >> > I am using dfs storage plugin that comes out of the box from drill on
    > a
    > >> > 1GB file, local to the drill cluster.
    > >> > The 1GB file is split into 10 files of 100 MB each.
    > >> > As expected I see 11 minor and 2 major fagments.
    > >> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
    > >> >
    > >> > One observation is that the query plan time is more than 30 seconds. I
    > >> ran
    > >> > the explain plan query to validate this.
    > >> > The query execution time is 2 secs.
    > >> > total time taken is 32secs
    > >> >
    > >> > I wanted to understand how can i minimise the query plan time.
    > >> Suggestions
    > >> > ?
    > >> > Is the time taken described above expected ?
    > >> > Attached is result from explain plan query
    > >> >
    > >> > Regards,
    > >> > Projjwal
    > >> >
    > >> >
    > >> >
    > >>
    > >
    > >
    >

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by PROJJWAL SAHA <pr...@gmail.com>.

Hi Kunal,

Good catch ! Thanks for this pointer.
I enabled logging at org.apache.drill level and I found -

2017-03-08 01:39:01,822 [274058fa-79df-9c74-3219-3fcb83a04605:foreman] INFO
 o.a.drill.exec.work.foreman.Foreman - Query text for query id
274058fa-79df-9c74-3219-3fcb83a04605: explain plan for select * from
dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
ORDER_ID='41' and CUSTOMER_ID='568'
*2017-03-08 01:39:01,823 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
DEBUG o.a.d.e.s.h.HBaseStoragePluginConfig - Initializing HBase
StoragePlugin configuration with zookeeper quorum 'localhost', port '2181'.*
*2017-03-08 01:39:16,038 [274058fa-79df-9c74-3219-3fcb83a04605:foreman]
DEBUG o.a.drill.exec.store.SchemaFactory - Took 14214 ms to register
schemas.*


i am not sure why the hbase storage plugin comes in play as it is disabled.
i then disabled all the other active plugins that i had and just kept the
dfs plugin.

the planning time is now reduced to 0.9 secs
and the query time for 1GB partitioned tsv data is taking 3.63 secs

is that a reasonable behaviour ?

Regards,
Projjwal

On Wed, Mar 8, 2017 at 12:11 AM, Kunal Khatua <kk...@mapr.com> wrote:

>
> Looking at the 1st two lines of the log shows that the bulk of time was
> lost before the query even went into the real planning stage of the query:
>
>
> 2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from
> dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
> columns[0] ='41' and columns[3] ='568'
> 2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman]
> INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses()
> took 0 ms, numFiles: 1
>
>
> More than 30 secs is unaccounted for. Can you turn on the root logger to
> be at the debug level and retry the explain plan?
>
>
> Kunal Khatua
>
>
> ________________________________
> From: rahul challapalli <ch...@gmail.com>
> Sent: Tuesday, March 7, 2017 5:24:43 AM
> To: user
> Subject: Re: Minimise query plan time for dfs plugin for local file system
> on tsv file
>
> I did not get a chance to review the log file.
>
> However the next thing I would try is to make your cluster a single node
> cluster first and then run the same explain plan query separately on each
> individual file.
>
>
>
> On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:
>
> > Hi Rahul,
> >
> > thanks for your suggestions. However, I am still not able to see any
> > reduction in query planning time
> > by explicit column names, removing extract headers and using
> columns[index]
> >
> > As I said, I ran explain plan and its taking 30+ secs for me.
> > My data is 1 GB tsv split into 20 files of 5 MB each.
> > Each 5MB file has close to 50k records
> > Its a 5 node cluster, and width per node is 4
> > Therefore, total number of minor fragments for one major fragment is 20
> > I have copied the source directory in all the drillbit nodes
> >
> > can you tell me a reasonable time estimate which I can expect drill to
> > return result for query for the above described scenario.
> > Query is - select columns[0] from dfs.root.`/scratch/localdisk/
> drill/testdata/Cust_1G_20_tsv`
> > where columns[0] ='41' and columns[3] ='568'
> >
> > attached is the json profile
> > and the drillbit.log
> >
> > I also have the tracing enabled.
> > org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> > org.apache.drill.exec.work.foreman.Foreman
> > however i see the duration of various steps in the order of ms in the
> logs.
> > i am not sure where planning time of the order of 30 secs is consumed.
> >
> > Please help
> >
> > Regards,
> > Projjwal
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> >> You can try the below things. For each of the below check the planning
> >> time
> >> individually
> >>
> >> 1. Run explain plan for a simple "select * from `
> >> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
> >> 2. Replace the '*' in your query with explicit column names
> >> 3. Remove the extract header from your storage plugin configuration and
> >> from your data files? Rewrite your query to use, columns[0_based_index]
> >> instead of explicit column names
> >>
> >> Also how many columns do you have in your text files and what is the
> size
> >> of each file? Like gautam suggested, it would be good to take a look at
> >> drillbit.log file (from the foreman node where planning occurred) and
> the
> >> query profile as well.
> >>
> >> - Rahul
> >>
> >> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
> >>
> >> > Can you please provide the drillbit.log file?
> >> >
> >> >
> >> > Gautam
> >> >
> >> > ________________________________
> >> > From: PROJJWAL SAHA <pr...@gmail.com>
> >> > Sent: Monday, March 6, 2017 1:45:38 AM
> >> > To: user@drill.apache.org
> >> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
> >> > system on tsv file
> >> >
> >> > all, please help me in giving suggestions on what areas i can look
> into
> >> > why the query planning time is taking so long for files which are
> local
> >> to
> >> > the drill machines. I have the same directory structure copied on all
> >> the 5
> >> > nodes of the cluster. I am accessing the source files using out of the
> >> box
> >> > dfs storage plugin.
> >> >
> >> > Query planning time is approx 30 secs
> >> > Query execution time is apprx 1.5 secs
> >> >
> >> > Regards,
> >> > Projjwal
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: PROJJWAL SAHA <pr...@gmail.com>>
> >> > Date: Fri, Mar 3, 2017 at 5:06 PM
> >> > Subject: Minimise query plan time for dfs plugin for local file system
> >> on
> >> > tsv file
> >> > To: user@drill.apache.org<ma...@drill.apache.org>
> >> >
> >> >
> >> > Hello all,
> >> >
> >> > I am quering select * from dfs.xxx where yyy (filter condition)
> >> >
> >> > I am using dfs storage plugin that comes out of the box from drill on
> a
> >> > 1GB file, local to the drill cluster.
> >> > The 1GB file is split into 10 files of 100 MB each.
> >> > As expected I see 11 minor and 2 major fagments.
> >> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> >> >
> >> > One observation is that the query plan time is more than 30 seconds. I
> >> ran
> >> > the explain plan query to validate this.
> >> > The query execution time is 2 secs.
> >> > total time taken is 32secs
> >> >
> >> > I wanted to understand how can i minimise the query plan time.
> >> Suggestions
> >> > ?
> >> > Is the time taken described above expected ?
> >> > Attached is result from explain plan query
> >> >
> >> > Regards,
> >> > Projjwal
> >> >
> >> >
> >> >
> >>
> >
> >
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Kunal Khatua <kk...@mapr.com>.

Looking at the 1st two lines of the log shows that the bulk of time was lost before the query even went into the real planning stage of the query:


2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman] INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id 274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where columns[0] ='41' and columns[3] ='568'
2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman] INFO  o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, numFiles: 1


More than 30 secs is unaccounted for. Can you turn on the root logger to be at the debug level and retry the explain plan?


Kunal Khatua


________________________________
From: rahul challapalli <ch...@gmail.com>
Sent: Tuesday, March 7, 2017 5:24:43 AM
To: user
Subject: Re: Minimise query plan time for dfs plugin for local file system on tsv file

I did not get a chance to review the log file.

However the next thing I would try is to make your cluster a single node
cluster first and then run the same explain plan query separately on each
individual file.



On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:

> Hi Rahul,
>
> thanks for your suggestions. However, I am still not able to see any
> reduction in query planning time
> by explicit column names, removing extract headers and using columns[index]
>
> As I said, I ran explain plan and its taking 30+ secs for me.
> My data is 1 GB tsv split into 20 files of 5 MB each.
> Each 5MB file has close to 50k records
> Its a 5 node cluster, and width per node is 4
> Therefore, total number of minor fragments for one major fragment is 20
> I have copied the source directory in all the drillbit nodes
>
> can you tell me a reasonable time estimate which I can expect drill to
> return result for query for the above described scenario.
> Query is - select columns[0] from dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv`
> where columns[0] ='41' and columns[3] ='568'
>
> attached is the json profile
> and the drillbit.log
>
> I also have the tracing enabled.
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> org.apache.drill.exec.work.foreman.Foreman
> however i see the duration of various steps in the order of ms in the logs.
> i am not sure where planning time of the order of 30 secs is consumed.
>
> Please help
>
> Regards,
> Projjwal
>
>
>
>
>
>
>
> On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
>> You can try the below things. For each of the below check the planning
>> time
>> individually
>>
>> 1. Run explain plan for a simple "select * from `
>> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
>> 2. Replace the '*' in your query with explicit column names
>> 3. Remove the extract header from your storage plugin configuration and
>> from your data files? Rewrite your query to use, columns[0_based_index]
>> instead of explicit column names
>>
>> Also how many columns do you have in your text files and what is the size
>> of each file? Like gautam suggested, it would be good to take a look at
>> drillbit.log file (from the foreman node where planning occurred) and the
>> query profile as well.
>>
>> - Rahul
>>
>> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
>>
>> > Can you please provide the drillbit.log file?
>> >
>> >
>> > Gautam
>> >
>> > ________________________________
>> > From: PROJJWAL SAHA <pr...@gmail.com>
>> > Sent: Monday, March 6, 2017 1:45:38 AM
>> > To: user@drill.apache.org
>> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
>> > system on tsv file
>> >
>> > all, please help me in giving suggestions on what areas i can look into
>> > why the query planning time is taking so long for files which are local
>> to
>> > the drill machines. I have the same directory structure copied on all
>> the 5
>> > nodes of the cluster. I am accessing the source files using out of the
>> box
>> > dfs storage plugin.
>> >
>> > Query planning time is approx 30 secs
>> > Query execution time is apprx 1.5 secs
>> >
>> > Regards,
>> > Projjwal
>> >
>> > ---------- Forwarded message ----------
>> > From: PROJJWAL SAHA <pr...@gmail.com>>
>> > Date: Fri, Mar 3, 2017 at 5:06 PM
>> > Subject: Minimise query plan time for dfs plugin for local file system
>> on
>> > tsv file
>> > To: user@drill.apache.org<ma...@drill.apache.org>
>> >
>> >
>> > Hello all,
>> >
>> > I am quering select * from dfs.xxx where yyy (filter condition)
>> >
>> > I am using dfs storage plugin that comes out of the box from drill on a
>> > 1GB file, local to the drill cluster.
>> > The 1GB file is split into 10 files of 100 MB each.
>> > As expected I see 11 minor and 2 major fagments.
>> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>> >
>> > One observation is that the query plan time is more than 30 seconds. I
>> ran
>> > the explain plan query to validate this.
>> > The query execution time is 2 secs.
>> > total time taken is 32secs
>> >
>> > I wanted to understand how can i minimise the query plan time.
>> Suggestions
>> > ?
>> > Is the time taken described above expected ?
>> > Attached is result from explain plan query
>> >
>> > Regards,
>> > Projjwal
>> >
>> >
>> >
>>
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by rahul challapalli <ch...@gmail.com>.

I did not get a chance to review the log file.

However the next thing I would try is to make your cluster a single node
cluster first and then run the same explain plan query separately on each
individual file.



On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <pr...@gmail.com> wrote:

> Hi Rahul,
>
> thanks for your suggestions. However, I am still not able to see any
> reduction in query planning time
> by explicit column names, removing extract headers and using columns[index]
>
> As I said, I ran explain plan and its taking 30+ secs for me.
> My data is 1 GB tsv split into 20 files of 5 MB each.
> Each 5MB file has close to 50k records
> Its a 5 node cluster, and width per node is 4
> Therefore, total number of minor fragments for one major fragment is 20
> I have copied the source directory in all the drillbit nodes
>
> can you tell me a reasonable time estimate which I can expect drill to
> return result for query for the above described scenario.
> Query is - select columns[0] from dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv`
> where columns[0] ='41' and columns[3] ='568'
>
> attached is the json profile
> and the drillbit.log
>
> I also have the tracing enabled.
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> org.apache.drill.exec.work.foreman.Foreman
> however i see the duration of various steps in the order of ms in the logs.
> i am not sure where planning time of the order of 30 secs is consumed.
>
> Please help
>
> Regards,
> Projjwal
>
>
>
>
>
>
>
> On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
>> You can try the below things. For each of the below check the planning
>> time
>> individually
>>
>> 1. Run explain plan for a simple "select * from `
>> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
>> 2. Replace the '*' in your query with explicit column names
>> 3. Remove the extract header from your storage plugin configuration and
>> from your data files? Rewrite your query to use, columns[0_based_index]
>> instead of explicit column names
>>
>> Also how many columns do you have in your text files and what is the size
>> of each file? Like gautam suggested, it would be good to take a look at
>> drillbit.log file (from the foreman node where planning occurred) and the
>> query profile as well.
>>
>> - Rahul
>>
>> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
>>
>> > Can you please provide the drillbit.log file?
>> >
>> >
>> > Gautam
>> >
>> > ________________________________
>> > From: PROJJWAL SAHA <pr...@gmail.com>
>> > Sent: Monday, March 6, 2017 1:45:38 AM
>> > To: user@drill.apache.org
>> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
>> > system on tsv file
>> >
>> > all, please help me in giving suggestions on what areas i can look into
>> > why the query planning time is taking so long for files which are local
>> to
>> > the drill machines. I have the same directory structure copied on all
>> the 5
>> > nodes of the cluster. I am accessing the source files using out of the
>> box
>> > dfs storage plugin.
>> >
>> > Query planning time is approx 30 secs
>> > Query execution time is apprx 1.5 secs
>> >
>> > Regards,
>> > Projjwal
>> >
>> > ---------- Forwarded message ----------
>> > From: PROJJWAL SAHA <pr...@gmail.com>>
>> > Date: Fri, Mar 3, 2017 at 5:06 PM
>> > Subject: Minimise query plan time for dfs plugin for local file system
>> on
>> > tsv file
>> > To: user@drill.apache.org<ma...@drill.apache.org>
>> >
>> >
>> > Hello all,
>> >
>> > I am quering select * from dfs.xxx where yyy (filter condition)
>> >
>> > I am using dfs storage plugin that comes out of the box from drill on a
>> > 1GB file, local to the drill cluster.
>> > The 1GB file is split into 10 files of 100 MB each.
>> > As expected I see 11 minor and 2 major fagments.
>> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>> >
>> > One observation is that the query plan time is more than 30 seconds. I
>> ran
>> > the explain plan query to validate this.
>> > The query execution time is 2 secs.
>> > total time taken is 32secs
>> >
>> > I wanted to understand how can i minimise the query plan time.
>> Suggestions
>> > ?
>> > Is the time taken described above expected ?
>> > Attached is result from explain plan query
>> >
>> > Regards,
>> > Projjwal
>> >
>> >
>> >
>>
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by PROJJWAL SAHA <pr...@gmail.com>.

Hi Rahul,

thanks for your suggestions. However, I am still not able to see any
reduction in query planning time
by explicit column names, removing extract headers and using columns[index]

As I said, I ran explain plan and its taking 30+ secs for me.
My data is 1 GB tsv split into 20 files of 5 MB each.
Each 5MB file has close to 50k records
Its a 5 node cluster, and width per node is 4
Therefore, total number of minor fragments for one major fragment is 20
I have copied the source directory in all the drillbit nodes

can you tell me a reasonable time estimate which I can expect drill to
return result for query for the above described scenario.
Query is - select columns[0] from
dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where
columns[0] ='41' and columns[3] ='568'

attached is the json profile
and the drillbit.log

I also have the tracing enabled.
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
org.apache.drill.exec.work.foreman.Foreman
however i see the duration of various steps in the order of ms in the logs.
i am not sure where planning time of the order of 30 secs is consumed.

Please help

Regards,
Projjwal







On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> You can try the below things. For each of the below check the planning time
> individually
>
> 1. Run explain plan for a simple "select * from `
> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
> 2. Replace the '*' in your query with explicit column names
> 3. Remove the extract header from your storage plugin configuration and
> from your data files? Rewrite your query to use, columns[0_based_index]
> instead of explicit column names
>
> Also how many columns do you have in your text files and what is the size
> of each file? Like gautam suggested, it would be good to take a look at
> drillbit.log file (from the foreman node where planning occurred) and the
> query profile as well.
>
> - Rahul
>
> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:
>
> > Can you please provide the drillbit.log file?
> >
> >
> > Gautam
> >
> > ________________________________
> > From: PROJJWAL SAHA <pr...@gmail.com>
> > Sent: Monday, March 6, 2017 1:45:38 AM
> > To: user@drill.apache.org
> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
> > system on tsv file
> >
> > all, please help me in giving suggestions on what areas i can look into
> > why the query planning time is taking so long for files which are local
> to
> > the drill machines. I have the same directory structure copied on all
> the 5
> > nodes of the cluster. I am accessing the source files using out of the
> box
> > dfs storage plugin.
> >
> > Query planning time is approx 30 secs
> > Query execution time is apprx 1.5 secs
> >
> > Regards,
> > Projjwal
> >
> > ---------- Forwarded message ----------
> > From: PROJJWAL SAHA <pr...@gmail.com>>
> > Date: Fri, Mar 3, 2017 at 5:06 PM
> > Subject: Minimise query plan time for dfs plugin for local file system on
> > tsv file
> > To: user@drill.apache.org<ma...@drill.apache.org>
> >
> >
> > Hello all,
> >
> > I am quering select * from dfs.xxx where yyy (filter condition)
> >
> > I am using dfs storage plugin that comes out of the box from drill on a
> > 1GB file, local to the drill cluster.
> > The 1GB file is split into 10 files of 100 MB each.
> > As expected I see 11 minor and 2 major fagments.
> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> >
> > One observation is that the query plan time is more than 30 seconds. I
> ran
> > the explain plan query to validate this.
> > The query execution time is 2 secs.
> > total time taken is 32secs
> >
> > I wanted to understand how can i minimise the query plan time.
> Suggestions
> > ?
> > Is the time taken described above expected ?
> > Attached is result from explain plan query
> >
> > Regards,
> > Projjwal
> >
> >
> >
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by rahul challapalli <ch...@gmail.com>.

You can try the below things. For each of the below check the planning time
individually

1. Run explain plan for a simple "select * from `
/scratch/localdisk/drill/testdata/Cust_1G_tsv`"
2. Replace the '*' in your query with explicit column names
3. Remove the extract header from your storage plugin configuration and
from your data files? Rewrite your query to use, columns[0_based_index]
instead of explicit column names

Also how many columns do you have in your text files and what is the size
of each file? Like gautam suggested, it would be good to take a look at
drillbit.log file (from the foreman node where planning occurred) and the
query profile as well.

- Rahul

On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gp...@mapr.com> wrote:

> Can you please provide the drillbit.log file?
>
>
> Gautam
>
> ________________________________
> From: PROJJWAL SAHA <pr...@gmail.com>
> Sent: Monday, March 6, 2017 1:45:38 AM
> To: user@drill.apache.org
> Subject: Fwd: Minimise query plan time for dfs plugin for local file
> system on tsv file
>
> all, please help me in giving suggestions on what areas i can look into
> why the query planning time is taking so long for files which are local to
> the drill machines. I have the same directory structure copied on all the 5
> nodes of the cluster. I am accessing the source files using out of the box
> dfs storage plugin.
>
> Query planning time is approx 30 secs
> Query execution time is apprx 1.5 secs
>
> Regards,
> Projjwal
>
> ---------- Forwarded message ----------
> From: PROJJWAL SAHA <pr...@gmail.com>>
> Date: Fri, Mar 3, 2017 at 5:06 PM
> Subject: Minimise query plan time for dfs plugin for local file system on
> tsv file
> To: user@drill.apache.org<ma...@drill.apache.org>
>
>
> Hello all,
>
> I am quering select * from dfs.xxx where yyy (filter condition)
>
> I am using dfs storage plugin that comes out of the box from drill on a
> 1GB file, local to the drill cluster.
> The 1GB file is split into 10 files of 100 MB each.
> As expected I see 11 minor and 2 major fagments.
> The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>
> One observation is that the query plan time is more than 30 seconds. I ran
> the explain plan query to validate this.
> The query execution time is 2 secs.
> total time taken is 32secs
>
> I wanted to understand how can i minimise the query plan time. Suggestions
> ?
> Is the time taken described above expected ?
> Attached is result from explain plan query
>
> Regards,
> Projjwal
>
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by Gautam Parai <gp...@mapr.com>.

Can you please provide the drillbit.log file?

Gautam

________________________________
From: PROJJWAL SAHA <pr...@gmail.com>
Sent: Monday, March 6, 2017 1:45:38 AM
To: user@drill.apache.org
Subject: Fwd: Minimise query plan time for dfs plugin for local file system on tsv file

all, please help me in giving suggestions on what areas i can look into why the query planning time is taking so long for files which are local to the drill machines. I have the same directory structure copied on all the 5 nodes of the cluster. I am accessing the source files using out of the box dfs storage plugin.

Query planning time is approx 30 secs
Query execution time is apprx 1.5 secs

Regards,
Projjwal

---------- Forwarded message ----------
From: PROJJWAL SAHA <pr...@gmail.com>>
Date: Fri, Mar 3, 2017 at 5:06 PM
Subject: Minimise query plan time for dfs plugin for local file system on tsv file
To: user@drill.apache.org<ma...@drill.apache.org>

Hello all,

I am quering select * from dfs.xxx where yyy (filter condition)

I am using dfs storage plugin that comes out of the box from drill on a 1GB file, local to the drill cluster.
The 1GB file is split into 10 files of 100 MB each.
As expected I see 11 minor and 2 major fagments.
The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.

One observation is that the query plan time is more than 30 seconds. I ran the explain plan query to validate this.
The query execution time is 2 secs.
total time taken is 32secs

I wanted to understand how can i minimise the query plan time. Suggestions ?
Is the time taken described above expected ?
Attached is result from explain plan query

Regards,
Projjwal

Fwd: Minimise query plan time for dfs plugin for local file system on tsv file

Posted by PROJJWAL SAHA <pr...@gmail.com>.

all, please help me in giving suggestions on what areas i can look into why
the query planning time is taking so long for files which are local to the
drill machines. I have the same directory structure copied on all the 5
nodes of the cluster. I am accessing the source files using out of the box
dfs storage plugin.

Query planning time is approx 30 secs
Query execution time is apprx 1.5 secs

Regards,
Projjwal

---------- Forwarded message ----------
From: PROJJWAL SAHA <pr...@gmail.com>
Date: Fri, Mar 3, 2017 at 5:06 PM
Subject: Minimise query plan time for dfs plugin for local file system on
tsv file
To: user@drill.apache.org


Hello all,

I am quering select * from dfs.xxx where yyy (filter condition)

I am using dfs storage plugin that comes out of the box from drill on a 1GB
file, local to the drill cluster.
The 1GB file is split into 10 files of 100 MB each.
As expected I see 11 minor and 2 major fagments.
The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.

One observation is that the query plan time is more than 30 seconds. I ran
the explain plan query to validate this.
The query execution time is 2 secs.
total time taken is 32secs

I wanted to understand how can i minimise the query plan time. Suggestions ?
Is the time taken described above expected ?
Attached is result from explain plan query

Regards,
Projjwal