You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@accumulo.apache.org by jm...@apache.org on 2023/02/27 16:53:49 UTC

[accumulo-examples] branch 2.1 updated: Update Compaction Strategy example (#117)

This is an automated email from the ASF dual-hosted git repository.

jmark99 pushed a commit to branch 2.1
in repository https://gitbox.apache.org/repos/asf/accumulo-examples.git


The following commit(s) were added to refs/heads/2.1 by this push:
     new bc2d9aa  Update Compaction Strategy example (#117)
bc2d9aa is described below

commit bc2d9aa0a199aecd887f9e8df92c573253c63664
Author: Mark Owens <jm...@apache.org>
AuthorDate: Mon Feb 27 11:53:43 2023 -0500

    Update Compaction Strategy example (#117)
    
    * Update Compaction Strategy example
    
    Update the Compaction Strategy example to make use of the new Compaction Configuration API.
---
 docs/compactionStrategy.md | 161 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 117 insertions(+), 44 deletions(-)

diff --git a/docs/compactionStrategy.md b/docs/compactionStrategy.md
index b0be2fa..e003ed7 100644
--- a/docs/compactionStrategy.md
+++ b/docs/compactionStrategy.md
@@ -16,64 +16,137 @@ limitations under the License.
 -->
 # Apache Accumulo Customizing the Compaction Strategy
 
-This tutorial uses the following Java classes, which can be found in org.apache.accumulo.tserver.compaction: 
-
- * DefaultCompactionStrategy.java - determines which files to compact based on table.compaction.major.ratio and table.file.max
- * EverythingCompactionStrategy.java - compacts all files
- * SizeLimitCompactionStrategy.java - compacts files no bigger than table.majc.compaction.strategy.opts.sizeLimit
- * BasicCompactionStrategy.java - uses default compression table.majc.compaction.strategy.opts.filter.size to filter input 
-                                  files based on size set and table.majc.compaction.strategy.opts.large.compress.threshold
-                                  and table.majc.compaction.strategy.opts.file.large.compress.type for larger files.                            
-                                  
-
 This is an example of how to configure a compaction strategy. By default, Accumulo will always use the DefaultCompactionStrategy, unless 
 these steps are taken to change the configuration.  Use the strategy and settings that best fits your Accumulo setup. This example shows
-how to configure and test one of the more complicated strategies, the BasicCompactionStrategy. Note that this example requires hadoop
-native libraries built with snappy in order to use snappy compression.
+how to configure a non-default strategy. Note that this example requires hadoop native libraries built with snappy in order to 
+use snappy compression. Within this example, commands starting with `user@uno>` are run from within the Accumulo shell, whereas
+commands beginning with `$` are executed from the command line terminal.
 
-To begin, run the command to create a table for testing.
+Start by creating a table that will be used for the compactions.
 
-```bash
-$ accumulo shell -u <username> -p <password> -e "createnamespace examples"
-$ accumulo shell -u <username> -p <password> -e "createtable examples.test1"
-```
+    user@uno> createnamespace examples
+    user@uno> createtable examples.test1
+
+Take note of the TableID for examples.test1. This will be needed later. The TableID can be found by running:
+
+
+    user@uno> tables -l
+    accumulo.metadata    =>        !0
+    accumulo.replication =>      +rep
+    accumulo.root        =>        +r
+    examples.test1       =>         2
 
-The commands below will configure the BasicCompactionStrategy to:
- - Avoid compacting files over 250M
- - Compact files over 100M using gz
+The commands below will configure the desired compaction strategy. The goals are:
+
+ - Avoid compacting files over 250M.
+ - Compact files over 100M using gz.
  - Compact files less than 100M using snappy.
- 
-```bash
- $ accumulo shell -u <username> -p <password> -e "config -t examples.test1 -s table.file.compress.type=snappy"
- $ accumulo shell -u <username> -p <password> -e "config -t examples.test1 -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.BasicCompactionStrategy"
- $ accumulo shell -u <username> -p <password> -e "config -t examples.test1 -s table.majc.compaction.strategy.opts.filter.size=250M"
- $ accumulo shell -u <username> -p <password> -e "config -t examples.test1 -s table.majc.compaction.strategy.opts.large.compress.threshold=100M"
- $ accumulo shell -u <username> -p <password> -e "config -t examples.test1 -s table.majc.compaction.strategy.opts.large.compress.type=gz"
-```
+ - Limit the compaction throughput to 40MB/s.
+
+Create a compaction service named `cs1` that has three executors. The first executor named `small` has 
+8 threads and runs compactions less than 16M. The second executor, `medium`, runs compactions less than 
+128M with 4 threads. The last executor, `large`, runs all other compactions with 2 threads.
+
+    user@uno> config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
+    user@uno> config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]'
+
+Create a compaction service named `cs2` that has three executors. It has a similar configuration to `cs1`, but its 
+executors have fewer threads. For service, `cs2`, files over 250M should not be compacted. It also limits 
+the total I/O of all compactions within the service to 40MB/s.
+
+    user@uno> config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
+    user@uno> config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]' 
+    user@uno> config -s tserver.compaction.major.service.cs2.rate.limit=40M
+
+Configurations can be verified for correctness with the  `check-compaction-config` tool in 
+Accumulo. Place your compaction configuration into a file and run the tool. For example, if you create a file
+`myconfig` that contains the following:
+
+    tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction2.DefaultCompactionPlanner
+    tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":8},{"name":"medium","type":"internal","maxSize":"128M","numThreads":4},{"name":"large","type":"internal","numThreads":2}]
+    tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
+    tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","type":"internal","maxSize":"16M","numThreads":4},{"name":"medium","type":"internal","maxSize":"128M","numThreads":2},{"name":"large","type":"internal","maxSize":"250M","numThreads":1}]
+    tserver.compaction.major.service.cs2.rate.limit=40M
+
+The following command would check the configuration for errors:
+
+    $ accumulo check-compaction-config /path/to/myconfig
+
+
+With the compaction configuration set, configure table specific properties.
+
+Configure the compression for table `examples.test1`. Files over 100M will be compressed using `gz`. All
+others will be compressed via `snappy`.
+
+    user@uno> config -t examples.test1 -s table.compaction.configurer=org.apache.accumulo.core.client.admin.compaction.CompressionConfigurer
+    user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.threshold=100M
+    user@uno> config -t examples.test1 -s table.compaction.configurer.opts.large.compress.type=gz
+    user@uno> config -t examples.test1 -s table.file.compress.type=snappy
+    user@uno> config -t examples.test1 -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher
+
+Set table `examples.test1` to use compaction service `cs1` for system compactions and service `cs2`
+for user compactions.
+
+    user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service=cs1
+    user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.user=cs2
+
+If needed, `chop` compactions can be configured also.
+    
+    user@uno> config -t examples.test1 -s table.compaction.dispatcher.opts.service.chop=cs2
 
 Generate some data and files in order to test the strategy:
 
-```bash
-$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 10000 --size 50
-$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
-$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 11000 --size 50
-$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
-$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 12000 --size 50
-$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
-$ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 13000 --size 50
-$ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
-```
+    $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000 --size 50
+    $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
 
-View the tserver log in <accumulo_home>/logs for the compaction and find the name of the `rfile` that was compacted for your table. Print info about this file using the PrintInfo tool:
+    $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 2000 --size 50
+    $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
+
+    $ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"
+
+View the `tserver` log in <accumulo_home>/logs for the compaction and find the name of the `rfile` that was
+compacted for your table. Print info about this file using the `rfile-info` tool. Replace the TableID with
+the TableID from above. Note, your filenames will differ from the ones within this example.
+
+    accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000a.rf
 
-```bash
-$ accumulo rfile-info <rfile>
-```
 Details about the rfile will be printed. The compression type should match the type used in the compaction.
+In this case, `snappy` is used since the size is less than 100M.
 
 ```bash    
 Meta block     : RFile.index
-      Raw size             : 319 bytes
-      Compressed size      : 180 bytes
+      Raw size             : 168 bytes
+      Compressed size      : 127 bytes
+      Compression type     : snappy
+```
+
+Continue to add additional data.
+
+    $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 0 --num 1000000 --size 50
+    $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
+
+    $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 1000000 --num 1000000 --size 50
+    $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
+
+    $ ./bin/runex client.SequentialBatchWriter -t examples.test1 --start 2000000 --num 1000000 --size 50
+    $ accumulo shell -u <username> -p <password> -e "flush -t examples.test1"
+
+    $ accumulo shell -u <username> -p <password> -e "compact -t examples.test1 -w"
+
+Again, view the tserver log in <accumulo_home>/logs for the compaction and find the name of the `rfile` that was
+compacted for your table. Print info about this file using the `rfile-info` tool:
+
+    accumulo rfile-info hdfs:///accumulo/tables/2/default_tablet/A000000o.rf
+
+In this case, the compression type should be `gz`. 
+
+```bash    
+Meta block     : RFile.index
+      Raw size             : 56,044 bytes
+      Compressed size      : 21,460 bytes
       Compression type     : gz
 ```
+
+Examining the size of `A000000o.rf` within HDFS should verify that the rfile is greater than 100M. 
+
+    $ hdfs dfs -ls -h /accumulo/tables/2/default_tablet/A000000o.rf