You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Kunal Khatua <ku...@apache.org> on 2019/04/16 23:57:03 UTC

Drill Profile Management

Hi guys

I'm working on a draft PR to improve the management of Drill's query profiles. 
https://github.com/apache/drill/pull/1750 [https://github.com/apache/drill/pull/1750]

The design basically partitions existing profiles into sub-directories based on the structure 'yyyy/MM/dd' (can be customized).
All new profiles are directly written into partitioned directories.
For existing profiles in the `profiles` directory, the Drillbit will partition the k-most-recent profiles (configurable) into the sub-directories; but only once (during startup) to ensure we don't have a Drillbit spending too long a time during startup. 
This improves response time for profile listing in the WebUI substantially. Especially when the number of profiles are in the range of 100s of thousands of profiles.

However, I have the challenge of figuring out what to do for users who might be wanting to dump a profile in the same directory for the purpose of rendering it in the WebUI. 

I have two options at the moment (and open to others):

1. Create a thread that periodically checks if there is a profile in the root of the `profiles` directory that needs to be 'indexed' into its correct partition.
2. Avoid having the need for creating a thread, by creating a unpartitioned sub directory within the `profiles` directory that is only meant for hosting profiles for WebUI rendering. 
For e.g., a developer should dump it into a `profiles/tmp` and view it.

I'm inclined towards option #1 because it allows for guarantee that eventually all profiles will be 'indexed' into their partitions and that we don't need to do it only during start up. 

With option #2, e.g., if I have 100,000 profiles and my Drillbits is configured to partition only 1000 most recent profiles at startup, i'll eventually get all profiles partitioned after 100 restarts!
However, #2 would ensure that profiles that are only for the purpose of rendering can be accessible (for sharing again) and not get indexed. Plus, there is no need for an additional thread to be added to the Drillbit.

Which one should I go for? Or is there a third alternative?

Thanks in advance!

 ~ Kunal




Re: Drill Profile Management

Posted by Kunal Khatua <ku...@apache.org>.
Very good points, Jinfeng! 

However, short lived queries can also provide valuable insight, so persisting them makes sense. For such queries, where the throughput of executed queries is high and we don't want to incur the penalty of writing profiles to disk, there is already an in-memory profile store option[1] that never touches the disk.

For profile auto clean (rotation doesn't make sense since we're dealing with multiple files and not one log), I would need a thread to do such janitor-type tasks. Apart from that, another reason I don't think auto cleanup is a must is because using the partitioned approach actually significantly reduces the overhead already. System admins can themselves, then, delete profiles based on time.

~ Kunal



[1] https://drill.apache.org/docs/persistent-configuration-storage/#storing-query-profiles-in-memory [https://drill.apache.org/docs/persistent-configuration-storage/#storing-query-profiles-in-memory]

On 4/16/2019 11:24:58 PM, Jinfeng Ni <jn...@apache.org> wrote:
two things that might be worth considering.

1. Add an option to persistent the profile to a file, only if query elapse
time exceeds a certain threshold (say 10 seconds)? Normally, people will
need look into profiles just to figure out performance bottleneck. It
probably makes sense to only store profiles for "slow" query.
2. profile auto clean/rotation, just like log, so that the total # of
profiles does not exceed certain limit.

In the scenario of Drill serving hundreds of queries per seconds, even with
the proposed idea of partition, we will soon see that the # of profiles is
too high, and hence causes various of problems in the system.



On Tue, Apr 16, 2019 at 5:26 PM Aman Sinha wrote:

> This would be a great improvement (and long overdue). Thanks for working
> on it.
> I would be inclined to option #2 and perhaps add an option to drillbit
> startup that allows partitioning all existing profiles in a forced manner
> (default can be the 1000 profiles that you proposed).
> The option makes the user aware that this could take longer.
> Having a separate thread is not quite needed since once the initial
> partitioning is done, the new profiles are anyways written to the
> sub-directories.
>
> Aman
>
> On Tue, Apr 16, 2019 at 4:57 PM Kunal Khatua wrote:
>
> > Hi guys
> >
> > I'm working on a draft PR to improve the management of Drill's query
> > profiles.
> > https://github.com/apache/drill/pull/1750 [
> > https://github.com/apache/drill/pull/1750]
> >
> > The design basically partitions existing profiles into sub-directories
> > based on the structure 'yyyy/MM/dd' (can be customized).
> > All new profiles are directly written into partitioned directories.
> > For existing profiles in the `profiles` directory, the Drillbit will
> > partition the k-most-recent profiles (configurable) into the
> > sub-directories; but only once (during startup) to ensure we don't have a
> > Drillbit spending too long a time during startup.
> > This improves response time for profile listing in the
> > WebUI substantially. Especially when the number of profiles are in the
> > range of 100s of thousands of profiles.
> >
> > However, I have the challenge of figuring out what to do for users who
> > might be wanting to dump a profile in the same directory for the purpose
> of
> > rendering it in the WebUI.
> >
> > I have two options at the moment (and open to others):
> >
> > 1. Create a thread that periodically checks if there is a profile in the
> > root of the `profiles` directory that needs to be 'indexed' into its
> > correct partition.
> > 2. Avoid having the need for creating a thread, by creating a
> > unpartitioned sub directory within the `profiles` directory that is only
> > meant for hosting profiles for WebUI rendering.
> > For e.g., a developer should dump it into a `profiles/tmp` and view it.
> >
> > I'm inclined towards option #1 because it allows for guarantee that
> > eventually all profiles will be 'indexed' into their partitions and that
> we
> > don't need to do it only during start up.
> >
> > With option #2, e.g., if I have 100,000 profiles and my Drillbits is
> > configured to partition only 1000 most recent profiles at startup, i'll
> > eventually get all profiles partitioned after 100 restarts!
> > However, #2 would ensure that profiles that are only for the purpose of
> > rendering can be accessible (for sharing again) and not get indexed.
> Plus,
> > there is no need for an additional thread to be added to the Drillbit.
> >
> > Which one should I go for? Or is there a third alternative?
> >
> > Thanks in advance!
> >
> > ~ Kunal
> >
> >
> >
> >
>

Re: Drill Profile Management

Posted by Jinfeng Ni <jn...@apache.org>.
two things that might be worth considering.

1. Add an option to persistent the profile to a file, only if query elapse
time exceeds a certain threshold (say 10 seconds)? Normally, people will
need look into profiles just to figure out performance bottleneck. It
probably makes sense to only store profiles for "slow" query.
2.  profile auto clean/rotation, just like log, so that the total # of
profiles does not exceed certain limit.

In the scenario of Drill serving hundreds of queries per seconds, even with
the proposed idea of partition, we will soon see that the # of profiles is
too high, and hence causes various of problems in the system.



On Tue, Apr 16, 2019 at 5:26 PM Aman Sinha <am...@gmail.com> wrote:

> This would be a great improvement (and long overdue).   Thanks for working
> on it.
> I would be inclined to option #2 and perhaps add an option to drillbit
> startup that allows partitioning all existing profiles in a forced manner
> (default can be the 1000 profiles that you proposed).
> The option makes the user aware that this could take longer.
> Having a separate thread is not quite needed since  once the initial
> partitioning is done, the new profiles are anyways written to the
> sub-directories.
>
> Aman
>
> On Tue, Apr 16, 2019 at 4:57 PM Kunal Khatua <ku...@apache.org> wrote:
>
> > Hi guys
> >
> > I'm working on a draft PR to improve the management of Drill's query
> > profiles.
> > https://github.com/apache/drill/pull/1750 [
> > https://github.com/apache/drill/pull/1750]
> >
> > The design basically partitions existing profiles into sub-directories
> > based on the structure 'yyyy/MM/dd' (can be customized).
> > All new profiles are directly written into partitioned directories.
> > For existing profiles in the `profiles` directory, the Drillbit will
> > partition the k-most-recent profiles (configurable) into the
> > sub-directories; but only once (during startup) to ensure we don't have a
> > Drillbit spending too long a time during startup.
> > This improves response time for profile listing in the
> > WebUI substantially. Especially when the number of profiles are in the
> > range of 100s of thousands of profiles.
> >
> > However, I have the challenge of figuring out what to do for users who
> > might be wanting to dump a profile in the same directory for the purpose
> of
> > rendering it in the WebUI.
> >
> > I have two options at the moment (and open to others):
> >
> > 1. Create a thread that periodically checks if there is a profile in the
> > root of the `profiles` directory that needs to be 'indexed' into its
> > correct partition.
> > 2. Avoid having the need for creating a thread, by creating a
> > unpartitioned sub directory within the `profiles` directory that is only
> > meant for hosting profiles for WebUI rendering.
> > For e.g., a developer should dump it into a `profiles/tmp` and view it.
> >
> > I'm inclined towards option #1 because it allows for guarantee that
> > eventually all profiles will be 'indexed' into their partitions and that
> we
> > don't need to do it only during start up.
> >
> > With option #2, e.g., if I have 100,000 profiles and my Drillbits is
> > configured to partition only 1000 most recent profiles at startup, i'll
> > eventually get all profiles partitioned after 100 restarts!
> > However, #2 would ensure that profiles that are only for the purpose of
> > rendering can be accessible (for sharing again) and not get indexed.
> Plus,
> > there is no need for an additional thread to be added to the Drillbit.
> >
> > Which one should I go for? Or is there a third alternative?
> >
> > Thanks in advance!
> >
> >  ~ Kunal
> >
> >
> >
> >
>

Re: Drill Profile Management

Posted by Aman Sinha <am...@gmail.com>.
This would be a great improvement (and long overdue).   Thanks for working
on it.
I would be inclined to option #2 and perhaps add an option to drillbit
startup that allows partitioning all existing profiles in a forced manner
(default can be the 1000 profiles that you proposed).
The option makes the user aware that this could take longer.
Having a separate thread is not quite needed since  once the initial
partitioning is done, the new profiles are anyways written to the
sub-directories.

Aman

On Tue, Apr 16, 2019 at 4:57 PM Kunal Khatua <ku...@apache.org> wrote:

> Hi guys
>
> I'm working on a draft PR to improve the management of Drill's query
> profiles.
> https://github.com/apache/drill/pull/1750 [
> https://github.com/apache/drill/pull/1750]
>
> The design basically partitions existing profiles into sub-directories
> based on the structure 'yyyy/MM/dd' (can be customized).
> All new profiles are directly written into partitioned directories.
> For existing profiles in the `profiles` directory, the Drillbit will
> partition the k-most-recent profiles (configurable) into the
> sub-directories; but only once (during startup) to ensure we don't have a
> Drillbit spending too long a time during startup.
> This improves response time for profile listing in the
> WebUI substantially. Especially when the number of profiles are in the
> range of 100s of thousands of profiles.
>
> However, I have the challenge of figuring out what to do for users who
> might be wanting to dump a profile in the same directory for the purpose of
> rendering it in the WebUI.
>
> I have two options at the moment (and open to others):
>
> 1. Create a thread that periodically checks if there is a profile in the
> root of the `profiles` directory that needs to be 'indexed' into its
> correct partition.
> 2. Avoid having the need for creating a thread, by creating a
> unpartitioned sub directory within the `profiles` directory that is only
> meant for hosting profiles for WebUI rendering.
> For e.g., a developer should dump it into a `profiles/tmp` and view it.
>
> I'm inclined towards option #1 because it allows for guarantee that
> eventually all profiles will be 'indexed' into their partitions and that we
> don't need to do it only during start up.
>
> With option #2, e.g., if I have 100,000 profiles and my Drillbits is
> configured to partition only 1000 most recent profiles at startup, i'll
> eventually get all profiles partitioned after 100 restarts!
> However, #2 would ensure that profiles that are only for the purpose of
> rendering can be accessible (for sharing again) and not get indexed. Plus,
> there is no need for an additional thread to be added to the Drillbit.
>
> Which one should I go for? Or is there a third alternative?
>
> Thanks in advance!
>
>  ~ Kunal
>
>
>
>