You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@accumulo.apache.org by ct...@apache.org on 2021/02/09 23:56:16 UTC

[accumulo-website] branch next-release updated: Update Compaction documentation apache/accumulo#1613 (#232)

This is an automated email from the ASF dual-hosted git repository.

ctubbsii pushed a commit to branch next-release
in repository https://gitbox.apache.org/repos/asf/accumulo-website.git


The following commit(s) were added to refs/heads/next-release by this push:
     new 73f66ef  Update Compaction documentation apache/accumulo#1613 (#232)
73f66ef is described below

commit 73f66efaf21ac266fe154d9609da92a7e851cace
Author: Keith Turner <kt...@apache.org>
AuthorDate: Tue Feb 9 18:56:06 2021 -0500

    Update Compaction documentation apache/accumulo#1613 (#232)
    
    * Add throughput to example
---
 _docs-2/administration/compaction.md           | 119 +++++++++++++++++++++++++
 _docs-2/getting-started/table_configuration.md | 108 +---------------------
 css/accumulo.scss                              |   4 +-
 3 files changed, 122 insertions(+), 109 deletions(-)

diff --git a/_docs-2/administration/compaction.md b/_docs-2/administration/compaction.md
new file mode 100644
index 0000000..18786a6
--- /dev/null
+++ b/_docs-2/administration/compaction.md
@@ -0,0 +1,119 @@
+---
+title: Compactions
+category: administration
+order: 6
+---
+
+In Accumulo each tablet has a list of files associated with it.  As data is
+written to Accumulo it is buffered in memory. The data buffered in memory is
+eventually written to files in DFS on a per tablet basis. Files can also be
+added to tablets directly by bulk import. In the background tablet servers run
+major compactions to merge multiple files into one. The tablet server has to
+decide which tablets to compact and which files within a tablet to compact.
+
+Within each tablet server there are one or more user configurable Comapction
+Services that compact tablets.  Each Accumulo table has a user configurable
+Compaction Dispatcher that decides which compaction services that table will
+use.  Accumulo generates metrics for each compaction service which enable users
+to adjust compaction service settings based on actual activity.
+
+Each compaction service has a compaction planner that decides which files to
+compact.  The default compaction planner uses the table property {% plink
+table.compaction.major.ratio %} to decide which files to compact.  The
+compaction ratio is real number >= 1.0.  Assume LFS is the size of the largest
+file in a set, CR is the compaction ratio,  and FSS is the sum of file sizes in
+a set. The default planner looks for file sets where LFS*CR <= FSS.  By only
+compacting sets of files that meet this requirement the amount of work done by
+compactions is O(N * log<sub>CR</sub>(N)).  Increasing the ratio will
+result in less compaction work and more files per tablet.  More files per
+tablet means more higher query latency. So adjusting this ratio is a trade off
+between ingest and query performance.
+
+When CR=1.0 this will result in a goal of a single per file tablet, but the
+amount of work is O(N<sup>2</sup>) so 1.0 should be used with caution.  For
+example if a tablet has a 1G file and 1M file is added, then a compaction of
+the 1G and 1M file would be queued. 
+
+Compaction services and dispatchers were introduced in Accumulo 2.1, so much
+of this documentation only applies to Accumulo 2.1 and later.  
+
+## Configuration
+
+Below are some Accumulo shell commands that do the following :
+
+ * Create a compaction service named `cs1` that has three executors.  The first executor named `small` has 8 threads and runs compactions less than 16M.  The second executor `medium` runs compactions less than 128M with 4 threads.  The last executor `large` runs all other compactions.
+ * Create a compaction service named `cs2` that has three executors.  It has similar config to `cs1`, but its executors have less threads. Limits total I/O of all compactions within the service to 40MB/s.
+* Configure table `ci` to use compaction service `cs1` for system compactions and service `cs2` for user compactions.
+
+```
+config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
+config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","maxSize":"16M","numThreads":8},{"name":"medium","maxSize":"128M","numThreads":4},{"name":"large","numThreads":2}]'
+config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
+config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","maxSize":"16M","numThreads":4},{"name":"medium","maxSize":"128M","numThreads":2},{"name":"large","numThreads":1}]'
+config -s tserver.compaction.major.service.cs2.throughput=40M
+config -t ci -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher
+config -t ci -s table.compaction.dispatcher.opts.service=cs1
+config -t ci -s table.compaction.dispatcher.opts.service.user=cs2
+```
+
+For more information see the javadoc for {% jlink org.apache.accumulo.core.spi.compaction %}, 
+{% jlink org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner %} and 
+{% jlink org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher %}
+
+The names of the compaction services and executors are used for logging and metrics.
+
+## Logging
+
+The names of compaction services and executors are used in logging.  The log
+messages below are from a tserver with the configuration above with data being
+written to the ci table.  Also a compaction of the table was forced from the
+shell.
+
+```
+2020-06-25T16:34:31,669 [tablet.files] DEBUG: Compacting 3;667;6 on cs1.small for SYSTEM from [C00001cm.rf, C00001a7.rf, F00001db.rf] size 15 MB
+2020-06-25T16:34:45,165 [tablet.files] DEBUG: Compacted 3;667;6 for SYSTEM created hdfs://localhost:8020/accumulo/tables/3/t-000006f/C00001de.rf from [C00001cm.rf, C00001a7.rf, F00001db.rf]
+2020-06-25T16:35:01,965 [tablet.files] DEBUG: Compacting 3;667;6 on cs1.medium for SYSTEM from [C00001de.rf, A000017v.rf, F00001e7.rf] size 33 MB
+2020-06-25T16:35:11,686 [tablet.files] DEBUG: Compacted 3;667;6 for SYSTEM created hdfs://localhost:8020/accumulo/tables/3/t-000006f/A00001er.rf from [C00001de.rf, A000017v.rf, F00001e7.rf]
+2020-06-25T16:37:12,521 [tablet.files] DEBUG: Compacting 3;667;6 on cs2.medium for USER from [F00001f8.rf, A00001er.rf] size 35 MB config []
+2020-06-25T16:37:17,917 [tablet.files] DEBUG: Compacted 3;667;6 for USER created hdfs://localhost:8020/accumulo/tables/3/t-000006f/A00001fr.rf from [F00001f8.rf, A00001er.rf]
+```
+
+## Metrics
+
+The numbers of major and minor compactions running and queued is visible on the
+Accumulo monitor page. This allows you to see if compactions are backing up
+and adjustments to the above settings are needed. When adjusting the number of
+threads available for compactions, consider the number of cores and other tasks
+running on the nodes.
+
+The numbers displayed on the Accumulo monitor are an aggregate of all
+compaction services and executors.  Accumulo emits metrics about the number of
+compactions queued and running on each compaction executor.  Accumulo also
+emits metrics about the number of files per tablets.  These metrics can be used
+to guide adjusting compaction ratios and compaction service configurations to ensure
+tablets do not have to many files.
+
+For example if metrics show that some compaction executors within a compaction
+service are under utilized while others are over utilized, then the
+configuration for compaction service may need to be adjusted.  If the metrics
+show that all compaction executors are fully utilized for long periods then
+maybe the compaction ratio on a table needs to be increased.
+
+## User compactions
+
+Compactions can be initiated manually for a table. To initiate a minor
+compaction, use the `flush` command in the shell. To initiate a major compaction,
+use the `compact` command in the shell:
+
+    user@myinstance mytable> compact -t mytable
+
+If needed, the compaction can be canceled using `compact --cancel -t mytable`.
+
+The `compact` command will compact all tablets in a table to one file. Even tablets
+with one file are compacted. This is useful for the case where a major compaction
+filter is configured for a table. In 1.4, the ability to compact a range of a table
+was added. To use this feature specify start and stop rows for the compact command.
+This will only compact tablets that overlap the given row range.
+
+
+
diff --git a/_docs-2/getting-started/table_configuration.md b/_docs-2/getting-started/table_configuration.md
index 09e384d..0044790 100644
--- a/_docs-2/getting-started/table_configuration.md
+++ b/_docs-2/getting-started/table_configuration.md
@@ -343,109 +343,7 @@ in reduced read latency. Read the [Caching] documentation to learn more.
 
 ## Compaction
 
-As data is written to Accumulo it is buffered in memory. The data buffered in
-memory is eventually written to HDFS on a per tablet basis. Files can also be
-added to tablets directly by bulk import. In the background tablet servers run
-major compactions to merge multiple files into one. The tablet server has to
-decide which tablets to compact and which files within a tablet to compact.
-This decision is made using the compaction ratio, which is configurable on a
-per table basis by the [table.compaction.major.ratio] property.
-
-Increasing this ratio will result in more files per tablet and less compaction
-work. More files per tablet means more higher query latency. So adjusting
-this ratio is a trade off between ingest and query performance. The ratio
-defaults to 3.
-
-The way the ratio works is that a set of files is compacted into one file if the
-sum of the sizes of the files in the set is larger than the ratio multiplied by
-the size of the largest file in the set. If this is not true for the set of all
-files in a tablet, the largest file is removed from consideration, and the
-remaining files are considered for compaction. This is repeated until a
-compaction is triggered or there are no files left to consider.
-
-The number of background threads tablet servers use to run major and minor
-compactions is configured by the [tserver.compaction.major.concurrent.max]
-and [tserver.compaction.minor.concurrent.max] properties respectively.
-
-The numbers of major and minor compactions running and queued is visible on the
-Accumulo monitor page. This allows you to see if compactions are backing up
-and adjustments to the above settings are needed. When adjusting the number of
-threads available for compactions, consider the number of cores and other tasks
-running on the nodes such as maps and reduces.
-
-If major compactions are not keeping up, then the number of files per tablet
-will grow to a point such that query performance starts to suffer. One way to
-handle this situation is to increase the compaction ratio. For example, if the
-compaction ratio were set to 1, then every new file added to a tablet by minor
-compaction would immediately queue the tablet for major compaction. So if a
-tablet has a 200M file and minor compaction writes a 1M file, then the major
-compaction will attempt to merge the 200M and 1M file. If the tablet server
-has lots of tablets trying to do this sort of thing, then major compactions
-will back up and the number of files per tablet will start to grow, assuming
-data is being continuously written. Increasing the compaction ratio will
-alleviate backups by lowering the amount of major compaction work that needs to
-be done.
-
-Another option to deal with the files per tablet growing too large is to adjust
-the [table.file.max] property. When a tablet reaches this number of files and needs
-to flush its in-memory data to disk, it will choose to do a merging minor compaction.
-A merging minor compaction will merge the tablet's smallest file with the data in memory at
-minor compaction time. Therefore the number of files will not grow beyond this
-limit. This will make minor compactions take longer, which will cause ingest
-performance to decrease. This can cause ingest to slow down until major
-compactions have enough time to catch up. When adjusting this property, also
-consider adjusting the compaction ratio. Ideally, merging minor compactions
-never need to occur and major compactions will keep up. It is possible to
-configure the file max and compaction ratio such that only merging minor
-compactions occur and major compactions never occur. This should be avoided
-because doing only merging minor compactions causes O(N<sup>2</sup>) work to be done.
-The amount of work done by major compactions is `O(N*log<sub>R</sub>(N))` where
-R is the compaction ratio.
-
-Compactions can be initiated manually for a table. To initiate a minor
-compaction, use the `flush` command in the shell. To initiate a major compaction,
-use the `compact` command in the shell:
-
-    user@myinstance mytable> compact -t mytable
-
-If needed, the compaction can be canceled using `compact --cancel -t mytable`.
-
-The `compact` command will compact all tablets in a table to one file. Even tablets
-with one file are compacted. This is useful for the case where a major compaction
-filter is configured for a table. In 1.4, the ability to compact a range of a table
-was added. To use this feature specify start and stop rows for the compact command.
-This will only compact tablets that overlap the given row range.
-
-### Compaction Strategies
-
-The default behavior of major compactions is defined in the class {% jlink org.apache.accumulo.tserver.compaction.DefaultCompactionStrategy %}.
-This behavior can be changed by overriding [table.majc.compaction.strategy] with a fully
-qualified class name.
-
-Custom compaction strategies can have additional properties that are specified with the
-{% plink table.majc.compaction.strategy.opts.\* %} prefix.
-
-Accumulo provides a few classes that can be used as an alternative compaction strategy. These classes are located in the 
-{% jlink -f org.apache.accumulo.tserver.compaction %} package. {% jlink org.apache.accumulo.tserver.compaction.EverythingCompactionStrategy %}
-will simply compact all files. This is the strategy used by the user `compact` command. 
-
-{% jlink org.apache.accumulo.tserver.compaction.strategies.BasicCompactionStrategy %} is
-a compaction strategy that supports a few options based on file size.  It
-supports filtering out large files from ever being included in a compaction.
-It also supports using a different compression algorithm for larger files.
-This allows frequent compactions of smaller files to use a fast algorithm and
-infrequent compactions of more data to use a slower algorithm.  Using this may
-enable an increase in throughput w/o using a lot more space.
-
-The following shell command configures a table to use snappy for small files,
-gzip for files over 100M, and avoid compacting any file larger than 250M.
-
-    config -t myTable -s table.file.compress.type=snappy
-    config -t myTable -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.BasicCompactionStrategy
-    config -t myTable -s table.majc.compaction.strategy.opts.filter.size=250M
-    config -t myTable -s table.majc.compaction.strategy.opts.large.compress.threshold=100M
-    config -t myTable -s table.majc.compaction.strategy.opts.large.compress.type=gzip
-
+See {% dlink administration/compaction %}
 ## Pre-splitting tables
 
 Accumulo will balance and distribute tables across servers. Before a
@@ -719,9 +617,5 @@ preserved.
 [Scanner]: {% jurl org.apache.accumulo.core.client.Scanner %}
 [BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %}
 [Caching]: {% durl administration/caching %}
-[table.compaction.major.ratio]: {% purl table.compaction.major.ratio %}
-[tserver.compaction.major.concurrent.max]: {% purl tserver.compaction.major.concurrent.max %}
-[tserver.compaction.minor.concurrent.max]: {% purl tserver.compaction.minor.concurrent.max %}
-[table.file.max]: {% purl table.file.max %}
 [table.bloom.enabled]: {% purl table.bloom.enabled %}
 [table.file.compress.type]: {% purl table.file.compress.type %}
diff --git a/css/accumulo.scss b/css/accumulo.scss
index 17609cb..6579cc9 100644
--- a/css/accumulo.scss
+++ b/css/accumulo.scss
@@ -43,13 +43,13 @@ body {
 
 pre code {
   font-size: 14px;
+  /* override nowrap in bootstrap */
+  white-space: pre;
 }
 
 code {
   background-color: #f5f5f5;
   color: #555;
-  /* override nowrap in bootstrap */
-  white-space: normal;
 }
 
 #nav-logo {