You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/02/28 18:19:48 UTC

[GitHub] [accumulo-website] milleruntime commented on a change in pull request #223: Adds a design document for external compactions

milleruntime commented on a change in pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#discussion_r385845996

##########
File path: design/external-compactions.md
##########
@@ -0,0 +1,245 @@
+---
+title: External Compaction Design
+---
+
+## Definitions
+
+ * *External compaction:* compactions of a tablet execute in a process other than
+ the tablet server hosting that tablet.
+
+ * *Internal compaction:* compactions of a tablet execute in the tablet server
+ process hosting the tablet.
+
+ * *Compactor:* Accumulo process that runs external compactions.
+
+## Introduction
+
+Currently, Accumulo only supports internal compactions. This can lead to
+uneven load on a cluster. For example, a few tablet servers could have many
+tablets to compact while many tablet servers are idle. If Accumulo supported
+external compactions, then compaction work could evenly spread across a
+cluster.
+
+Compactors could start with a command like:
+
+```
+ accumulo compactor <queue>
+```
+
+This would start a process that looks for compactions on the specified
+distributed queue and executes them. The command could easily run in a docker
+container in something like Kubernetes. A compactor would need to continually
+do the following
+
+ * Find tablets with files to compact in the queue
+ * Reserve files/work unit
+ * Compact files
+ * Commit compaction
+
+This document outlines an alternative design to the one outlined in [#1451].
+[#1451] proposes a pull+polling approach, early binding, leases and client side
+selection. This proposal has a distributed queue supporting late binding
+instead of the pull+polling approach. Selection is in the tablet server
+instead of an Accumulo client. Zookeeper ephemeral nodes are used instead of
+leases.
+
+## Selection
+
+Accumulo needs a mechanism to select files for compaction. Currently this is
+done in two ways. The first way is by looking for tablets with too many files
+according to a logarithmic size ratio. The second way is when a user initiates
+a compaction with a specified file selection criterion. User initiated
+compactions may also specify custom iterators that do things like filter out
+unwanted data.
+
+In both cases above, this selection code currently runs in the tablet server.
+User can optionally pass configuration to the selection code running in the
+tablet servers. Going forward the selection code could run in three places.
+
+ * In the tablet server.
+ * In the compactor processes.
+ * In a user process that initiates a user compaction.
+
+Determining where selection code runs is an important consideration for the
+overall user experience. However, for this document its assumed that selection
+code runs somewhere and queues compaction work. One possibility for selection
+is to use the approach outlined in [#564] with the additional capability of
+compaction managers to submit jobs to distributed external compaction queues
+(in addition to internal queues).
+
+## Queues
+
+A distributed queue for external compactions needs to support the following operations.
+
+ * Adding and removing compaction work for a tablet
+ * Prioritizing compaction work for a tablet
+ * Efficiently finding work
+
+One possibility to implement a distributed queue is a section in Accumulo’s
+metadata table for each queue. Giving each queue a unique row prefix within

Review comment:
Do we want external processes (possibly many) to constantly be scanning the metadata table? Would it be better to have a separate table for the compaction queue?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services