You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2022/01/31 22:02:37 UTC

[DISCUSS] Compactor (Query vs MR) roadmap

Hi all,

In the current master, there are two approaches for performing compactions
of ACID tables [1]:
* using hard-coded MapReduce jobs (aka. CompactorMR [2]);
* using HiveQL queries (aka. QueryCompactor [3]) and delegating the
execution to the underlying engine (MR, Tez, other);

The motivation for introducing the query compactor was to make compaction
tasks engine independent, and potentially more efficient. In principle the
query based compaction should be able to completely replace the respective
MR jobs but it appears that it is not there yet.

At the moment of writing this email the two compactor modes are
complementary to each other. Compactions on insert-only tables (aka.
micromanaged tables) can only be done in the using the query compactor.
Moreover, query-based compactions on ACID tables work only when the
underlying engine is Tez (various bugs [4] seem to be blocking the use of
MR as an execution engine). The latter means that if someone is using MR as
the execution engine they cannot use the query based compactor. Certain
features (e.g., per-table selection of compaction queues [5]) exist for one
mode (and apparently are important for end users) but are not yet
implemented for the other.

Currently the query based compactor is not part of any Apache Hive release
but would be nice if someone could shed some light to the roadmap around
this feature. I tried to summarize very briefly the state of this work
based on my understanding but I am sure people who have worked on these
areas of the code can provide much better insights. Some quick questions
that come to mind are the following:
Is there going to be support for MR based compactor in the next releases of
Hive?
Is the query based compactor gonna work with an engine other than Tez? Is
someone working on this?
Are there benefits in using the MR based compactor when the query based
compactor is available?
Are there major features that are not yet part of the query based compactor
(and they need to be)?

Finally, I don't see any documentation around the "new" query based
compaction mode in the wiki [6]. I think it would be good if someone can
update the respective part of the documentation before releasing the next
Hive version.

Best,
Stamatis

[1] HIVE-5317: Implement insert, update, and delete in Hive with full ACID
support
[2] HIVE-6319: Add compactor for ACID tables (Apr, 2014)
[3] HIVE-20699: Query based compactor for full CRUD Acid tables (Feb, 2019)
[4] HIVE-24015: Disable query-based compaction on MR execution engine
(Karen Coppage, reviewed by Laszlo Pinter)
[5] HIVE-20723: Allow per table specification of compaction yarn queue
[6]
https://cwiki.apache.org/confluence/display/hive/hive+transactions#HiveTransactions-Compactor

Re: [DISCUSS] Compactor (Query vs MR) roadmap

Posted by Stamatis Zampetakis <za...@gmail.com>.
Hi Karen,

Many thanks for joining the discussion.

The fact that there are two components with quite a bit of overlap in their
behavior is not something that can be easily maintained in the long term.
Additionally, I have the impression that some commercial offerings of Hive
are using the QB compactor by default. This along with all the other things
you mentioned (MR deprecation, missing support for MM tables, etc.) gives
me the impression that MR compactor is reaching EOL.

I haven't worked enough on this part of the code to have a strong opinion
on the way to move forward but if we are moving into deprecating the MR
compactor, which I find reasonable, it would be good to make this explicit
both for end users and Hive developers. I leave this decision to you and
the other people in the community who have contributed many fixes &
improvements to the compactor.

Best,
Stamatis


On Wed, Feb 2, 2022 at 11:58 AM Karen Coppage <kc...@gmail.com>
wrote:

> Hi Stamatis,
>
> Thanks for your questions. You bring up good points.
>
> A bit about the state of the two compaction implementations:
> MR compaction (uses class CompactorMR) is older and more stable. I have
> only seen a couple bugs in the past few years.
> QB (query-based) compaction is required when YARN is unavailable. And, as
> you mentioned, insert-only/MM tables use QB compaction (MM compaction has
> its own own semi-separate implementation from QB compaction of full ACID
> tables). If we ever extend ACID to support a file format outside of ORC, I
> can see QB compaction as the easier way forward. And lastly, QB compaction
> has never been officially released, since it belongs in version 4.0.0… so
> if we really want to get rid of MR compaction, it would probably be best to
> deprecate it first and leave it available as a backup option for a while.
> One big question remains unanswered, which is: which implementation is
> more efficient? If MR compaction is, then we should keep it and it should
> be used when possible. Otherwise it can be deprecated. I don’t think
> anybody’s working on the testing that would be necessary to answer this
> question, mostly because there are many small fires to put out around other
> parts of compaction.
> Another thing – since QB compaction runs insert queries, it involves a few
> extra move steps, which is slow with object storage. The “direct insert”
> feature will mitigate slowness but (a) it still has quite a few rough edges
> and (b) I don’t think it’s enabled for compaction queries at all.
>
> Questions I didn’t answer above:
>
> > Is the query based compactor gonna work with an engine other than Tez?
> Is someone working on this?
>
> The MR execution engine has been deprecated since Hive 2 (2015).
> Worst-case scenario, users can just run MR compaction. But I hope that this
> is not the case!
>
> > Are there major features that are not yet part of the query based
> compactor (and they need to be)?
>
> I’m pretty sure QB compaction does not yet honor the “WITH OVERWRITE
> TBLPROPERTIES” clause in an ALTER TABLE… COMPACT… statement. This could be
> something to add in the future.
>
> I agree that documenting QB compaction is a must, thanks for pointing this
> out!
>
> Cheers,
> Karen
>
>
> > On 2022. Jan 31., at 23:02, Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > In the current master, there are two approaches for performing
> compactions
> > of ACID tables [1]:
> > * using hard-coded MapReduce jobs (aka. CompactorMR [2]);
> > * using HiveQL queries (aka. QueryCompactor [3]) and delegating the
> > execution to the underlying engine (MR, Tez, other);
> >
> > The motivation for introducing the query compactor was to make compaction
> > tasks engine independent, and potentially more efficient. In principle
> the
> > query based compaction should be able to completely replace the
> respective
> > MR jobs but it appears that it is not there yet.
> >
> > At the moment of writing this email the two compactor modes are
> > complementary to each other. Compactions on insert-only tables (aka.
> > micromanaged tables) can only be done in the using the query compactor.
> > Moreover, query-based compactions on ACID tables work only when the
> > underlying engine is Tez (various bugs [4] seem to be blocking the use of
> > MR as an execution engine). The latter means that if someone is using MR
> as
> > the execution engine they cannot use the query based compactor. Certain
> > features (e.g., per-table selection of compaction queues [5]) exist for
> one
> > mode (and apparently are important for end users) but are not yet
> > implemented for the other.
> >
> > Currently the query based compactor is not part of any Apache Hive
> release
> > but would be nice if someone could shed some light to the roadmap around
> > this feature. I tried to summarize very briefly the state of this work
> > based on my understanding but I am sure people who have worked on these
> > areas of the code can provide much better insights. Some quick questions
> > that come to mind are the following:
> > Is there going to be support for MR based compactor in the next releases
> of
> > Hive?
> > Is the query based compactor gonna work with an engine other than Tez? Is
> > someone working on this?
> > Are there benefits in using the MR based compactor when the query based
> > compactor is available?
> > Are there major features that are not yet part of the query based
> compactor
> > (and they need to be)?
> >
> > Finally, I don't see any documentation around the "new" query based
> > compaction mode in the wiki [6]. I think it would be good if someone can
> > update the respective part of the documentation before releasing the next
> > Hive version.
> >
> > Best,
> > Stamatis
> >
> > [1] HIVE-5317: Implement insert, update, and delete in Hive with full
> ACID
> > support
> > [2] HIVE-6319: Add compactor for ACID tables (Apr, 2014)
> > [3] HIVE-20699: Query based compactor for full CRUD Acid tables (Feb,
> 2019)
> > [4] HIVE-24015: Disable query-based compaction on MR execution engine
> > (Karen Coppage, reviewed by Laszlo Pinter)
> > [5] HIVE-20723: Allow per table specification of compaction yarn queue
> > [6]
> >
> https://cwiki.apache.org/confluence/display/hive/hive+transactions#HiveTransactions-Compactor
>
>

Re: [DISCUSS] Compactor (Query vs MR) roadmap

Posted by Karen Coppage <kc...@gmail.com>.
Hi Stamatis,

Thanks for your questions. You bring up good points.

A bit about the state of the two compaction implementations:
MR compaction (uses class CompactorMR) is older and more stable. I have only seen a couple bugs in the past few years.
QB (query-based) compaction is required when YARN is unavailable. And, as you mentioned, insert-only/MM tables use QB compaction (MM compaction has its own own semi-separate implementation from QB compaction of full ACID tables). If we ever extend ACID to support a file format outside of ORC, I can see QB compaction as the easier way forward. And lastly, QB compaction has never been officially released, since it belongs in version 4.0.0… so if we really want to get rid of MR compaction, it would probably be best to deprecate it first and leave it available as a backup option for a while.
One big question remains unanswered, which is: which implementation is more efficient? If MR compaction is, then we should keep it and it should be used when possible. Otherwise it can be deprecated. I don’t think anybody’s working on the testing that would be necessary to answer this question, mostly because there are many small fires to put out around other parts of compaction. 
Another thing – since QB compaction runs insert queries, it involves a few extra move steps, which is slow with object storage. The “direct insert” feature will mitigate slowness but (a) it still has quite a few rough edges and (b) I don’t think it’s enabled for compaction queries at all.

Questions I didn’t answer above:

> Is the query based compactor gonna work with an engine other than Tez? Is someone working on this?

The MR execution engine has been deprecated since Hive 2 (2015). Worst-case scenario, users can just run MR compaction. But I hope that this is not the case!

> Are there major features that are not yet part of the query based compactor (and they need to be)?

I’m pretty sure QB compaction does not yet honor the “WITH OVERWRITE TBLPROPERTIES” clause in an ALTER TABLE… COMPACT… statement. This could be something to add in the future.

I agree that documenting QB compaction is a must, thanks for pointing this out!

Cheers,
Karen


> On 2022. Jan 31., at 23:02, Stamatis Zampetakis <za...@gmail.com> wrote:
> 
> Hi all,
> 
> In the current master, there are two approaches for performing compactions
> of ACID tables [1]:
> * using hard-coded MapReduce jobs (aka. CompactorMR [2]);
> * using HiveQL queries (aka. QueryCompactor [3]) and delegating the
> execution to the underlying engine (MR, Tez, other);
> 
> The motivation for introducing the query compactor was to make compaction
> tasks engine independent, and potentially more efficient. In principle the
> query based compaction should be able to completely replace the respective
> MR jobs but it appears that it is not there yet.
> 
> At the moment of writing this email the two compactor modes are
> complementary to each other. Compactions on insert-only tables (aka.
> micromanaged tables) can only be done in the using the query compactor.
> Moreover, query-based compactions on ACID tables work only when the
> underlying engine is Tez (various bugs [4] seem to be blocking the use of
> MR as an execution engine). The latter means that if someone is using MR as
> the execution engine they cannot use the query based compactor. Certain
> features (e.g., per-table selection of compaction queues [5]) exist for one
> mode (and apparently are important for end users) but are not yet
> implemented for the other.
> 
> Currently the query based compactor is not part of any Apache Hive release
> but would be nice if someone could shed some light to the roadmap around
> this feature. I tried to summarize very briefly the state of this work
> based on my understanding but I am sure people who have worked on these
> areas of the code can provide much better insights. Some quick questions
> that come to mind are the following:
> Is there going to be support for MR based compactor in the next releases of
> Hive?
> Is the query based compactor gonna work with an engine other than Tez? Is
> someone working on this?
> Are there benefits in using the MR based compactor when the query based
> compactor is available?
> Are there major features that are not yet part of the query based compactor
> (and they need to be)?
> 
> Finally, I don't see any documentation around the "new" query based
> compaction mode in the wiki [6]. I think it would be good if someone can
> update the respective part of the documentation before releasing the next
> Hive version.
> 
> Best,
> Stamatis
> 
> [1] HIVE-5317: Implement insert, update, and delete in Hive with full ACID
> support
> [2] HIVE-6319: Add compactor for ACID tables (Apr, 2014)
> [3] HIVE-20699: Query based compactor for full CRUD Acid tables (Feb, 2019)
> [4] HIVE-24015: Disable query-based compaction on MR execution engine
> (Karen Coppage, reviewed by Laszlo Pinter)
> [5] HIVE-20723: Allow per table specification of compaction yarn queue
> [6]
> https://cwiki.apache.org/confluence/display/hive/hive+transactions#HiveTransactions-Compactor