You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2022/01/31 09:50:49 UTC

[DISCUSS] Properties for scheduling compactions on specific queues

Hi all,

This email is an attempt to converge on which Hive/Tez/MR properties
someone should use in order to schedule a compaction on specific queues.
For those who are not familiar with how queues are used the YARN capacity
scheduler documentation [1] gives the general idea.

Using specific queues for compaction jobs is necessary to be able to
efficiently allocate resources for maintenance tasks (compaction) and
production workloads. Hive provides various ways to control the queues used
by the compactor and there have been various tickets with improvements and
fixes in this area (see list below).

The granularity we can select queues for compactions (all tables vs. per
table) currently depends on which compactor is in use (MR vs Query based)
and boils down to the following properties:

Global configuration:
* hive.compactor.job.queue
* mapred.job.queue.name
* tez.queue.name

Per table/statement configuration (table properties):
* compactor.mapred.job.queue.name (before HIVE-20723)
* compactor.hive.compactor.job.queue (after HIVE-20723)

Things are a bit blurred with respect to what properties someone should use
to achieve the desired result. Some changes, such as HIVE-20723, raise
backward compatibility concerns and other changes seem to have a larger
impact than the one specifically designed for. For example, after
HIVE-25595, map reduce queue properties can have an impact on the compactor
queues even when Tez is in use.

In order to avoid confusion and ensure long term support of these queue
selection features we should clarify which of the above properties should
be used.

Given the current situation, I would propose to officially support only the
following:
* hive.compactor.job.queue
* compactor.hive.compactor.job.queue
and align the implementation based on these (if necessary). In other words,
Hive users should not use mapred.job.queue.name and tez.queue.name
explicitly at least when it comes to the compactor. Hive should set them
transparently (as it happens now in various places) based on
[compactor.]hive.compactor.job.queue.

What do people think? Are there other ideas?

Best,
Stamatis

[1]
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

HIVE-11997: Add ability to send Compaction Jobs to specific queue
HIVE-13354: Add ability to specify Compaction options per table and per
request
HIVE-20723: Allow per table specification of compaction yarn queue
HIVE-24781: Allow to use custom queue for query based compaction
HIVE-25801: Custom queue settings is not honoured by Query based compaction
StatsUpdater
HIVE-25595: Custom queue settings is not honoured by compaction StatsUpdater

Re: [DISCUSS] Properties for scheduling compactions on specific queues

Posted by Alessandro Solimando <al...@gmail.com>.
Hi Stamatis,
the proposal seems reasonable to me.

I think that setting the two properties you mention, independently from the
underlying execution engine in use, should lead to the same result.

In addition, I also agree that we should deprecate the per-execution engine
properties.

Best regards,
Alessandro

On Mon, 31 Jan 2022 at 10:51, Stamatis Zampetakis <za...@gmail.com> wrote:

> Hi all,
>
> This email is an attempt to converge on which Hive/Tez/MR properties
> someone should use in order to schedule a compaction on specific queues.
> For those who are not familiar with how queues are used the YARN capacity
> scheduler documentation [1] gives the general idea.
>
> Using specific queues for compaction jobs is necessary to be able to
> efficiently allocate resources for maintenance tasks (compaction) and
> production workloads. Hive provides various ways to control the queues used
> by the compactor and there have been various tickets with improvements and
> fixes in this area (see list below).
>
> The granularity we can select queues for compactions (all tables vs. per
> table) currently depends on which compactor is in use (MR vs Query based)
> and boils down to the following properties:
>
> Global configuration:
> * hive.compactor.job.queue
> * mapred.job.queue.name
> * tez.queue.name
>
> Per table/statement configuration (table properties):
> * compactor.mapred.job.queue.name (before HIVE-20723)
> * compactor.hive.compactor.job.queue (after HIVE-20723)
>
> Things are a bit blurred with respect to what properties someone should
> use to achieve the desired result. Some changes, such as HIVE-20723, raise
> backward compatibility concerns and other changes seem to have a larger
> impact than the one specifically designed for. For example, after
> HIVE-25595, map reduce queue properties can have an impact on the compactor
> queues even when Tez is in use.
>
> In order to avoid confusion and ensure long term support of these queue
> selection features we should clarify which of the above properties should
> be used.
>
> Given the current situation, I would propose to officially support only
> the following:
> * hive.compactor.job.queue
> * compactor.hive.compactor.job.queue
> and align the implementation based on these (if necessary). In other
> words, Hive users should not use mapred.job.queue.name and tez.queue.name
> explicitly at least when it comes to the compactor. Hive should set them
> transparently (as it happens now in various places) based on
> [compactor.]hive.compactor.job.queue.
>
> What do people think? Are there other ideas?
>
> Best,
> Stamatis
>
> [1]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>
> HIVE-11997: Add ability to send Compaction Jobs to specific queue
> HIVE-13354: Add ability to specify Compaction options per table and per
> request
> HIVE-20723: Allow per table specification of compaction yarn queue
> HIVE-24781: Allow to use custom queue for query based compaction
> HIVE-25801: Custom queue settings is not honoured by Query based
> compaction StatsUpdater
> HIVE-25595: Custom queue settings is not honoured by compaction
> StatsUpdater
>

Re: [DISCUSS] Properties for scheduling compactions on specific queues

Posted by Alessandro Solimando <al...@gmail.com>.
Hi Stamatis,
the proposal seems reasonable to me.

I think that setting the two properties you mention, independently from the
underlying execution engine in use, should lead to the same result.

In addition, I also agree that we should deprecate the per-execution engine
properties.

Best regards,
Alessandro

On Mon, 31 Jan 2022 at 10:51, Stamatis Zampetakis <za...@gmail.com> wrote:

> Hi all,
>
> This email is an attempt to converge on which Hive/Tez/MR properties
> someone should use in order to schedule a compaction on specific queues.
> For those who are not familiar with how queues are used the YARN capacity
> scheduler documentation [1] gives the general idea.
>
> Using specific queues for compaction jobs is necessary to be able to
> efficiently allocate resources for maintenance tasks (compaction) and
> production workloads. Hive provides various ways to control the queues used
> by the compactor and there have been various tickets with improvements and
> fixes in this area (see list below).
>
> The granularity we can select queues for compactions (all tables vs. per
> table) currently depends on which compactor is in use (MR vs Query based)
> and boils down to the following properties:
>
> Global configuration:
> * hive.compactor.job.queue
> * mapred.job.queue.name
> * tez.queue.name
>
> Per table/statement configuration (table properties):
> * compactor.mapred.job.queue.name (before HIVE-20723)
> * compactor.hive.compactor.job.queue (after HIVE-20723)
>
> Things are a bit blurred with respect to what properties someone should
> use to achieve the desired result. Some changes, such as HIVE-20723, raise
> backward compatibility concerns and other changes seem to have a larger
> impact than the one specifically designed for. For example, after
> HIVE-25595, map reduce queue properties can have an impact on the compactor
> queues even when Tez is in use.
>
> In order to avoid confusion and ensure long term support of these queue
> selection features we should clarify which of the above properties should
> be used.
>
> Given the current situation, I would propose to officially support only
> the following:
> * hive.compactor.job.queue
> * compactor.hive.compactor.job.queue
> and align the implementation based on these (if necessary). In other
> words, Hive users should not use mapred.job.queue.name and tez.queue.name
> explicitly at least when it comes to the compactor. Hive should set them
> transparently (as it happens now in various places) based on
> [compactor.]hive.compactor.job.queue.
>
> What do people think? Are there other ideas?
>
> Best,
> Stamatis
>
> [1]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>
> HIVE-11997: Add ability to send Compaction Jobs to specific queue
> HIVE-13354: Add ability to specify Compaction options per table and per
> request
> HIVE-20723: Allow per table specification of compaction yarn queue
> HIVE-24781: Allow to use custom queue for query based compaction
> HIVE-25801: Custom queue settings is not honoured by Query based
> compaction StatsUpdater
> HIVE-25595: Custom queue settings is not honoured by compaction
> StatsUpdater
>