You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/02/01 17:45:23 UTC

[GitHub] [accumulo-website] ctubbsii commented on a change in pull request #232: Update Compaction documentation apache/accumulo#1613

ctubbsii commented on a change in pull request #232:
URL: https://github.com/apache/accumulo-website/pull/232#discussion_r568017759



##########
File path: _docs-2/administration/compaction.md
##########
@@ -0,0 +1,119 @@
+---
+title: Compactions
+category: administration
+order: 6
+---
+
+In Accumulo each tablet has a list of files associated with it.  As data is
+written to Accumulo it is buffered in memory. The data buffered in memory is
+eventually written to files in DFS on a per tablet basis. Files can also be
+added to tablets directly by bulk import. In the background tablet servers run
+major compactions to merge multiple files into one. The tablet server has to
+decide which tablets to compact and which files within a tablet to compact.
+
+Within each tablet server there are one or more user configurable Comapction
+Services that compact tablets.  Each Accumulo table has a user configurable
+Compaction Dispatcher that decides which compaction services that table will
+use.  Accumulo generates metrics for each compaction service which enable users
+to adjust compaction service settings based on actual activity.
+
+Each compaction service has a compaction planner that decides which files to
+compact.  The default compaction planner uses the table property {% plink
+table.compaction.major.ratio %} to decide which files to compact.  The
+compaction ratio is real number >= 1.0.  Assume LFS is the size of the largest
+file in a set, CR is the compaction ratio,  and FSS is the sum of file sizes in
+a set. The default planner looks for file sets where LFS*CR <= FSS.  By only
+compacting sets of files that meet this requirement the amount of work done by
+compactions is O(N * log<sub>CR</sub>(N)).  Increasing the ratio will
+result in less compaction work and more files per tablet.  More files per
+tablet means more higher query latency. So adjusting this ratio is a trade off
+between ingest and query performance.
+
+When CR=1.0 this will result in a goal of a single per file tablet, but the
+amount of work is O(N<sup>2</sup>) so 1.0 should be used with caution.  For
+example if a tablet has a 1G file and 1M file is added, then a compaction of
+the 1G and 1M file would be queued. 
+
+Compaction services and dispatchers were introduced in Accumulo 2.1, so much
+of this documentation only applies to Accumulo 2.1 and later.  

Review comment:
       > Is it worth splitting off a separate doc link, or are we better off sticking with that only for major releases?
   
   User-facing changes should generally be additive for minor releases, so it would be very redundant to have separate doc areas. I think we can include notes inline where we need to to indicate that something applies to a specific minor release and later, rather than copy/paste docs and maintain them in separate paths, when they are 99% identical.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org