You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2020/05/18 19:32:54 UTC
[hbase-operator-tools] branch master updated: HBASE-24397
[hbase-operator-tools] Tool to Report on row sizes and column counts
This is an automated email from the ASF dual-hosted git repository.
stack pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hbase-operator-tools.git
The following commit(s) were added to refs/heads/master by this push:
new 5b84d7a HBASE-24397 [hbase-operator-tools] Tool to Report on row sizes and column counts
5b84d7a is described below
commit 5b84d7a9a8e5b006afd1de7b040a9273ee297fa4
Author: stack <st...@apache.org>
AuthorDate: Mon May 18 12:28:28 2020 -0700
HBASE-24397 [hbase-operator-tools] Tool to Report on row sizes and column counts
Add tool that reads a table and produces histogram/quantiles of row size
and column count.
---
README.md | 1 +
hbase-table-reporter/README.md | 48 ++
hbase-table-reporter/pom.xml | 102 ++++
hbase-table-reporter/src/main/gnuplot/freq.p | 6 +
hbase-table-reporter/src/main/gnuplot/histo.p | 17 +
.../org/apache/hbase/reporter/TableReporter.java | 525 +++++++++++++++++++++
.../src/main/resources/log4j.properties | 68 +++
.../apache/hbase/reporter/TestTableReporter.java | 94 ++++
pom.xml | 1 +
9 files changed, 862 insertions(+)
diff --git a/README.md b/README.md
index bf593ea..2ff66db 100644
--- a/README.md
+++ b/README.md
@@ -22,3 +22,4 @@ Host for [Apache HBase™](https://hbase.apache.org)
operator tools including:
* [HBCK2](https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2), the hbase-2.x fix-it tool, the successor to hbase-1's _hbck_ (A.K.A _hbck1_).
+ * [TableReporter](https://github.com/apache/hbase-operator-tools/tree/master/hbase-table-reporter), a tool to generate a basic report on Table column counts and row sizes; use when no distributed execution available.
diff --git a/hbase-table-reporter/README.md b/hbase-table-reporter/README.md
new file mode 100644
index 0000000..b361621
--- /dev/null
+++ b/hbase-table-reporter/README.md
@@ -0,0 +1,48 @@
+# hbase-table-reporter
+Basic report on Table column counts and row sizes. Used when no distributed
+execution engine available... Runs a Scan of all data to size. Allows
+specifying subset of Regions-in-a-Table and even specifying a
+Single Region only.
+
+Here is how you run it to count over the first 20% of the
+Regions of the Table using an executor pool of 8 threads (to
+make it so we are scanning 8 regions at a time):
+
+```HBASE_CLASSPATH_PREFIX=./hbase-table-reporter-1.0-SNAPSHOT.jar hbase com.apple.hbase.Reporter -t 8 -f 0.2 GENIE2_modality_session```
+
+The output is a report per Region with a totals for all Regions output at the end. Here is
+what a single Region report looks like:
+
+```
+2020-03-06T20:50:42.473Z region=prod:example,\x0B\x89\xE8\x18Z\xE7RRg/\x05\x81\xE5\xF0.\xB0,1583452813195.d383921761f3ad7b9b1303ee672ff806., duration=PT0.023S
+rowSize quantiles [6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 115464.0, 115464.0, 115464.0, 115464.0, 115464.0, [...]
+rowSize histo [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 4.0, 8.0, 0.0]
+rowSizestats N=13, min=6424.0, max=237408.0
+columnCount quantiles [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, [...]
+columnCount histo [0.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
+columnCountstats N=13, min=2.0, max=3.0
+```
+It starts w/ the timestamp for when we wrote this Region report followed by the name of the
+Region this report is about and how long it took to generate this report. Then we report 100
+quantiles to show size distribution in the Region. The next line is an histogram of sizes
+reported and then a count of rows seen and smallest and largest sizes encountered.
+
+## Making Plots
+
+When the above reporter runs, on the end, it dumps the data into files that can be used as src plotting diagrams in gnuplot.
+See the tail of the output made by the hbase-table-reporter. Per table, we generate files into the tmp dir with a '''hbase-table-reporter'' prefix.
+There'll be one for the table's rowSize histogram, rowSize percentiles, and ditto for column count. There are gnuplot *.p files in
+this repo at `src/main/gnuplot` that you can can use plotting. Edit the '.p' files to reference the generated files using
+appropriate `histo.p` or `freq.p`.
+
+The first line of the generated files is commented out. It lists vitals like table name, min and max, counts, etc.
+
+Here's a bit of script that might help generating the `.p` files:
+```
+$ for i in $(ls histogram); do echo -n \"pwd/$i\"; x=$(head -1 $i | sed -e 's/^# //' | sed -e 's//\\\\/g'); echo " title \"$x\" with lines, \\"; done
+```
+
+Or if all the reporter files are in the current directory, something like this to put the `.p` file references into files in the tmp dir:
+```
+$ for z in rowSize.histo rowSize.per columnCount.histo columnCount.per; do for i in $(ls *$z*); do echo -n \"`pwd`/$i\"; x=$(head -1 $i | sed -e 's/^# //' | sed -e 's/_/\\\\_/g'); echo " title \"$x\" with lines, \\"; done > /tmp/$z.txt; done
+```
diff --git a/hbase-table-reporter/pom.xml b/hbase-table-reporter/pom.xml
new file mode 100644
index 0000000..2895550
--- /dev/null
+++ b/hbase-table-reporter/pom.xml
@@ -0,0 +1,102 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+ <parent>
+ <groupId>org.apache.hbase.operator.tools</groupId>
+ <artifactId>hbase-operator-tools</artifactId>
+ <version>1.1.0-SNAPSHOT</version>
+ <relativePath>..</relativePath>
+ </parent>
+ <artifactId>hbase-table-reporter</artifactId>
+ <name>Apache HBase - Table Reporter</name>
+ <description>HBase Table Report</description>
+ <properties>
+ <maven.compiler.version>3.6.1</maven.compiler.version>
+ </properties>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-compiler-plugin</artifactId>
+ <version>${maven.compiler.version}</version>
+ <configuration>
+ <source>${compileSource}</source>
+ <target>${compileSource}</target>
+ <showWarnings>true</showWarnings>
+ <showDeprecation>false</showDeprecation>
+ <useIncrementalCompilation>false</useIncrementalCompilation>
+ <compilerArgument>-Xlint:-options</compilerArgument>
+ </configuration>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-shade-plugin</artifactId>
+ <version>3.1.1</version>
+ <executions>
+ <execution>
+ <phase>package</phase>
+ <goals>
+ <goal>shade</goal>
+ </goals>
+ <configuration>
+ <minimizeJar>true</minimizeJar>
+ <createDependencyReducedPom>true</createDependencyReducedPom>
+ <dependencyReducedPomLocation>${java.io.tmpdir}/dependency-reduced-pom.xml
+ </dependencyReducedPomLocation>
+ <artifactSet>
+ <excludes>
+ <exclude>org.apache.hbase:hbase-shaded-client</exclude>
+ <exclude>org.slf4j:slf4j-api</exclude>
+ <exclude>org.apache.htrace:htrace-core</exclude>
+ <exclude>org.apache.htrace:htrace-core4</exclude>
+ <exclude>org.slf4j:slf4j-log4j12</exclude>
+ <exclude>log4j:log4j:jar:</exclude>
+ <exclude>org.apache.yetus:audience-annotations</exclude>
+ <exclude>commons-logging:commons-logging</exclude>
+ <exclude>com.github.stephenc.findbugs:findbugs-annotation</exclude>
+ </excludes>
+ </artifactSet>
+ <transformers>
+ <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
+ <manifestEntries>
+ <addClasspath>true</addClasspath>
+ <mainClass>org.apache.hbase.reporter.TableReporter</mainClass>
+ </manifestEntries>
+ </transformer>
+ </transformers>
+ </configuration>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ <dependencies>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ <version>1.7.25</version>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-log4j12</artifactId>
+ <version>1.7.25</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.hbase</groupId>
+ <artifactId>hbase-shaded-client</artifactId>
+ <version>2.1.1</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.datasketches</groupId>
+ <artifactId>datasketches-java</artifactId>
+ <version>1.3.0-incubating</version>
+ </dependency>
+ <dependency>
+ <groupId>commons-cli</groupId>
+ <artifactId>commons-cli</artifactId>
+ <version>1.4</version>
+ </dependency>
+ </dependencies>
+</project>
diff --git a/hbase-table-reporter/src/main/gnuplot/freq.p b/hbase-table-reporter/src/main/gnuplot/freq.p
new file mode 100644
index 0000000..663e80b
--- /dev/null
+++ b/hbase-table-reporter/src/main/gnuplot/freq.p
@@ -0,0 +1,6 @@
+set terminal svg dynamic
+set title 'Percentiles Row Sizes' font ",36"
+set key font ",5"
+set autoscale
+plot \
+"table.rowSize.percentiles.4986160210664496121.gnuplotdata" title "table regions=589, duration=PT38.262S, N=35751, min=1136.0, max=2.38644832E8" with lines
diff --git a/hbase-table-reporter/src/main/gnuplot/histo.p b/hbase-table-reporter/src/main/gnuplot/histo.p
new file mode 100644
index 0000000..d6fe3ee
--- /dev/null
+++ b/hbase-table-reporter/src/main/gnuplot/histo.p
@@ -0,0 +1,17 @@
+set terminal svg dynamic
+set title 'Histogram Row Sizes' font ",36"
+set key font ",5"
+set ylabel 'Log10'
+set logscale y
+set autoscale
+set style data histograms
+set style fill solid
+set style histogram clustered gap 1
+# This xtics list needs to match what is up in the code.
+set xtics axis out rotate by 45 right ("" -1, "<1" 0, "<5" 1, "<10" 2, \
+ "<15" 3, "<20" 4, "<25" 5, "<100" 6, \
+ "<1k" 7, "<5k" 8, "<10k" 9, "<20k" 10, \
+ "<50k" 11, "<100k" 12, \
+ "<1M" 13, ">=1M" 14, "" 15)
+plot \
+"table.rowSize.histograms.7171879522414641440.gnuplotdata" title "table regions=589, duration=PT38.262S, N=35751, min=1136.0, max=2.38644832E8" with lines
diff --git a/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java b/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java
new file mode 100644
index 0000000..d33d7b2
--- /dev/null
+++ b/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java
@@ -0,0 +1,525 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hbase.reporter;
+
+import org.apache.commons.cli.Options;
+import org.apache.commons.cli.ParseException;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.CommandLineParser;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.datasketches.quantiles.DoublesSketch;
+import org.apache.datasketches.quantiles.DoublesUnion;
+import org.apache.datasketches.quantiles.UpdateDoublesSketch;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.ExtendedCell;
+import org.apache.hadoop.hbase.HBaseConfiguration;
+import org.apache.hadoop.hbase.HBaseIOException;
+import org.apache.hadoop.hbase.HRegionInfo;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.TableName;
+import org.apache.hadoop.hbase.client.*;
+import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.util.Bytes;
+
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileWriter;
+import java.io.IOException;
+import java.math.BigDecimal;
+import java.time.Duration;
+import java.time.Instant;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.ThreadFactory;
+import java.util.function.DoubleSupplier;
+import java.util.stream.Collectors;
+import java.util.stream.DoubleStream;
+
+/**
+ * Run a scan against a table reporting on row size, column size and count.
+ *
+ * So can run against cdh5, uses loads of deprecated API and copies some Cell sizing methods local.
+ */
+public class TableReporter {
+ private static String GNUPLOT_DATA_SUFFIX = ".gnuplotdata";
+
+ /**
+ * Quantile sketches. Has a print that dumps out sketches on stdout.
+ * To accumlate Sketches instances, see {@link AccumlatingSketch}
+ */
+ static class Sketches {
+ private static final DoubleSupplier IN_POINT_1_INC = new DoubleSupplier() {
+ private BigDecimal accumulator = new BigDecimal(0);
+ private final BigDecimal pointOhOne = new BigDecimal(0.01);
+
+ @Override
+ public double getAsDouble() {
+ double d = this.accumulator.doubleValue();
+ this.accumulator = this.accumulator.add(pointOhOne);
+ return d;
+ }
+ };
+
+ /**
+ * Make an array of 100 increasing numbers from 0-1.
+ */
+ static double [] NORMALIZED_RANKS = DoubleStream.generate(IN_POINT_1_INC).limit(100).toArray();
+
+ /**
+ * Bins that sort of make sense for the data we're seeing here. After some trial and error.
+ */
+ static double [] BINS = new double [] {1, 5, 10, 15, 20, 25, 100, 1024, 5120, 10240, 20480, 51200, 102400, 1048576};
+
+ /**
+ * Size of row.
+ */
+ final UpdateDoublesSketch rowSizeSketch;
+
+ /**
+ * Count of columns in row.
+ */
+ final UpdateDoublesSketch columnCountSketch;
+
+ Sketches() {
+ this(DoublesSketch.builder().setK(256).build(), DoublesSketch.builder().setK(256).build());
+ }
+
+ Sketches(UpdateDoublesSketch rowSizeSketch, UpdateDoublesSketch columnCountSketch) {
+ this.rowSizeSketch = rowSizeSketch;
+ this.columnCountSketch = columnCountSketch;
+ }
+
+ void print(String preamble) {
+ System.out.println(preamble);
+ print();
+ }
+
+ void print() {
+ print("rowSize", rowSizeSketch);
+ print("columnCount", columnCountSketch);
+ }
+
+ private static void print(String label, final DoublesSketch sketch) {
+ System.out.println(label + " quantiles " + Arrays.toString(sketch.getQuantiles(NORMALIZED_RANKS)));
+ double [] pmfs = sketch.getPMF(BINS);
+ // System.out.println(label + " pmfs " + Arrays.toString(pmfs));
+ System.out.println(label + " histo " +
+ (pmfs == null || pmfs.length == 0?
+ "null": Arrays.toString(Arrays.stream(pmfs).map(d -> d * sketch.getN()).toArray())));
+ System.out.println(label + "stats N=" + sketch.getN() + ", min=" + sketch.getMinValue() + ", max=" +
+ sketch.getMaxValue());
+ }
+ }
+
+ /**
+ * For aggregating {@link Sketches}
+ * To add sketches, need a DoublesUnion Sketch.
+ */
+ static class AccumlatingSketch {
+ DoublesUnion rowSizeUnion = DoublesUnion.builder().build();
+ DoublesUnion columnSizeUnion = DoublesUnion.builder().build();
+
+ void add(Sketches other) {
+ this.rowSizeUnion.update(other.rowSizeSketch);
+ this.columnSizeUnion.update(other.columnCountSketch);
+ }
+
+ /**
+ * @return A Sketches made of current state of aggregation.
+ */
+ Sketches get() {
+ return new Sketches(rowSizeUnion.getResult(), columnSizeUnion.getResult());
+ }
+ }
+
+ static void processRowResult(Result result, Sketches sketches) {
+ // System.out.println(result.toString());
+ long rowSize = 0;
+ int columnCount = 0;
+ for (Cell cell : result.rawCells()) {
+ rowSize += estimatedSizeOfCell(cell);
+ columnCount += 1;
+ }
+ sketches.rowSizeSketch.update(rowSize);
+ sketches.columnCountSketch.update(columnCount);
+ }
+
+ /**
+ * @return First <code>fraction</code> of Table's regions.
+ */
+ private static List<RegionInfo> getRegions(Connection connection, TableName tableName,
+ double fraction, String encodedRegionName) throws IOException {
+ try (Admin admin = connection.getAdmin()) {
+ // Use deprecated API because running against old hbase.
+ List<RegionInfo> regions = admin.getRegions(tableName);
+ if (regions.size() <= 0) {
+ throw new HBaseIOException("No regions found in " + tableName);
+ }
+ if (encodedRegionName != null) {
+ return regions.stream().filter(ri -> ri.getEncodedName().equals(encodedRegionName)).
+ collect(Collectors.toCollection(ArrayList::new));
+ }
+ return regions.subList(0, (int)(regions.size() * fraction)); // Rounds down.
+ }
+ }
+
+ /**
+ * Class that scans a Region to produce a Sketch.
+ */
+ static class SketchRegion implements Callable<SketchRegion> {
+ private final RegionInfo ri;
+ private final Connection connection;
+ private final TableName tableName;
+ private final int limit;
+ private Sketches sketches = new Sketches();
+ private volatile long duration;
+
+ SketchRegion(Connection connection, TableName tableName, RegionInfo ri, int limit) {
+ this.ri = ri;
+ this.connection = connection;
+ this.tableName = tableName;
+ this.limit = limit;
+ }
+
+ @Override
+ public SketchRegion call() {
+ try (Table table = this.connection.getTable(this.tableName)) {
+ Scan scan = new Scan();
+ scan.setStartRow(this.ri.getStartKey());
+ scan.setStopRow(this.ri.getEndKey());
+ scan.setAllowPartialResults(true);
+ long startTime = System.currentTimeMillis();
+ long count = 0;
+ try (ResultScanner resultScanner = table.getScanner(scan)) {
+ for (Result result : resultScanner) {
+ processRowResult(result, sketches);
+ count++;
+ if (this.limit >= 0 && count <= this.limit) {
+ break;
+ }
+ }
+ }
+ this.duration = System.currentTimeMillis() - startTime;
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ return this;
+ }
+
+ Sketches getSketches() {
+ return this.sketches;
+ }
+
+ RegionInfo getRegionInfo() {
+ return this.ri;
+ }
+
+ long getDuration() {
+ return this.duration;
+ }
+ }
+
+ private static void sketch(Configuration configuration, String tableNameAsStr, int limit,
+ double fraction, int threads, String isoNow, String encodedRegionName)
+ throws IOException, InterruptedException, ExecutionException {
+ TableName tableName = TableName.valueOf(tableNameAsStr);
+ AccumlatingSketch totalSketches = new AccumlatingSketch();
+ long startTime = System.currentTimeMillis();
+ int count = 0;
+ try (Connection connection = ConnectionFactory.createConnection(configuration)) {
+ // Get list of Regions. If 'fraction', get this fraction of all Regions. If
+ // encodedRegionName, then set fraction to 1.0 in case the returned set does not
+ // include the encodedRegionName we're looking for.
+ List<RegionInfo> regions = getRegions(connection, tableName, fraction, encodedRegionName);
+ count = regions.size();
+ if (count <= 0) {
+ throw new HBaseIOException("Empty regions list; fraction " + fraction +
+ " too severe or communication problems?");
+ } else {
+ System.out.println(Instant.now().toString() + " Scanning " + tableNameAsStr +
+ " regions=" + count + ", " + regions);
+ }
+ ExecutorService es =
+ Executors.newFixedThreadPool(threads, new ThreadFactory() {
+ @Override
+ public Thread newThread(Runnable r) {
+ Thread t = new Thread(r);
+ t.setDaemon(true);
+ return t;
+ }
+ });
+ try {
+ List<SketchRegion> srs = regions.stream().map(ri -> new SketchRegion(connection, tableName, ri, limit)).
+ collect(Collectors.toList());
+ List<Future<SketchRegion>> futures = new ArrayList<>(srs.size());
+ for (SketchRegion sr: srs) {
+ // Do submit rather than inokeall; invokeall blocks until all done. This way I get control back
+ // after all submitted.
+ futures.add(es.submit(sr));
+ }
+ // Avoid java.util.ConcurrentModificationException
+ List<Future<SketchRegion>> removals = new ArrayList<>();
+ while (!futures.isEmpty()) {
+ for (Future<SketchRegion> future: futures) {
+ if (future.isDone()) {
+ SketchRegion sr = future.get();
+ sr.getSketches().print(Instant.now().toString() +
+ " region=" + sr.getRegionInfo().getRegionNameAsString() + ", duration=" +
+ (Duration.ofMillis(sr.getDuration()).toString()));
+ totalSketches.add(sr.getSketches());
+ removals.add(future);
+ }
+ }
+ if (!removals.isEmpty()) {
+ futures.removeAll(removals);
+ removals.clear();
+ }
+ Thread.sleep(1000);
+ }
+ } finally {
+ es.shutdown();
+ }
+ }
+ Sketches sketches = totalSketches.get();
+ String isoDuration = Duration.ofMillis(System.currentTimeMillis() - startTime).toString();
+ sketches.print(Instant.now().toString() + " Totals for " + tableNameAsStr + " regions=" + count +
+ ", limit=" + limit + ", fraction=" + fraction + ", took=" + isoDuration);
+ // Dump out the gnuplot files. Saves time generating graphs.
+ dumpGnuplotDataFiles(isoNow, sketches, tableNameAsStr, count, isoDuration);
+ }
+
+ /**
+ * This is an estimate of the heap space occupied by a cell. When the cell is of type
+ * {@link HeapSize} we call {@link HeapSize#heapSize()} so cell can give a correct value. In other
+ * cases we just consider the bytes occupied by the cell components ie. row, CF, qualifier,
+ * timestamp, type, value and tags.
+ * Note that this can be the JVM heap space (on-heap) or the OS heap (off-heap)
+ * @return estimate of the heap space
+ */
+ public static long estimatedSizeOfCell(final Cell cell) {
+ if (cell instanceof HeapSize) {
+ return ((HeapSize) cell).heapSize();
+ }
+ // TODO: Add sizing of references that hold the row, family, etc., arrays.
+ return estimatedSerializedSizeOf(cell);
+ }
+
+ /**
+ * Estimate based on keyvalue's serialization format in the RPC layer. Note that there is an extra
+ * SIZEOF_INT added to the size here that indicates the actual length of the cell for cases where
+ * cell's are serialized in a contiguous format (For eg in RPCs).
+ * @return Estimate of the <code>cell</code> size in bytes plus an extra SIZEOF_INT indicating the
+ * actual cell length.
+ */
+ public static int estimatedSerializedSizeOf(final Cell cell) {
+ if (cell instanceof ExtendedCell) {
+ return ((ExtendedCell) cell).getSerializedSize(true) + Bytes.SIZEOF_INT;
+ }
+
+ return getSumOfCellElementLengths(cell) +
+ // Use the KeyValue's infrastructure size presuming that another implementation would have
+ // same basic cost.
+ KeyValue.ROW_LENGTH_SIZE + KeyValue.FAMILY_LENGTH_SIZE +
+ // Serialization is probably preceded by a length (it is in the KeyValueCodec at least).
+ Bytes.SIZEOF_INT;
+ }
+
+ /**
+ * @return Sum of the lengths of all the elements in a Cell; does not count in any infrastructure
+ */
+ private static int getSumOfCellElementLengths(final Cell cell) {
+ return getSumOfCellKeyElementLengths(cell) + cell.getValueLength() + cell.getTagsLength();
+ }
+
+ /**
+ * @return Sum of all elements that make up a key; does not include infrastructure, tags or
+ * values.
+ */
+ private static int getSumOfCellKeyElementLengths(final Cell cell) {
+ return cell.getRowLength() + cell.getFamilyLength() + cell.getQualifierLength()
+ + KeyValue.TIMESTAMP_TYPE_SIZE;
+ }
+
+ private static String getFileNamePrefix(String isoNow, String tableName, String sketchName) {
+ return "reporter." + isoNow + "." + tableName + "." + sketchName;
+ }
+
+ private static String getFileFirstLine(String tableName, int regions, String isoDuration, UpdateDoublesSketch sketch) {
+ return "# " + tableName + " regions=" + regions + ", duration=" + isoDuration + ", N=" + sketch.getN() +
+ ", min=" + sketch.getMinValue() + ", max=" + sketch.getMaxValue();
+ }
+
+ private static void dumpPercentilesFile(String prefix, String firstLine, UpdateDoublesSketch sketch)
+ throws IOException {
+ dumpFile(File.createTempFile(prefix + ".percentiles.", GNUPLOT_DATA_SUFFIX),
+ firstLine, sketch.getQuantiles(Sketches.NORMALIZED_RANKS));
+ }
+
+ private static void dumpHistogramFile(String prefix, String firstLine, UpdateDoublesSketch sketch)
+ throws IOException {
+ double [] pmfs = sketch.getPMF(Sketches.BINS);
+ double [] ds = Arrays.stream(pmfs).map(d -> d * sketch.getN()).toArray();
+ dumpFile(File.createTempFile(prefix + ".histograms.", GNUPLOT_DATA_SUFFIX),
+ firstLine, ds);
+ }
+
+ private static void dumpFile(File file, String firstLine, double [] ds) throws IOException {
+ try (BufferedWriter writer = new BufferedWriter(new FileWriter(file))) {
+ writer.write(firstLine);
+ writer.newLine();
+ for (double d : ds) {
+ writer.write(Double.toString(d));
+ writer.newLine();
+ }
+ }
+ System.out.println(Instant.now().toString() + " wrote " + file.toString());
+ }
+
+ private static void dumpFiles(String prefix, String firstLine, UpdateDoublesSketch sketch) throws IOException {
+ dumpPercentilesFile(prefix, firstLine, sketch);
+ dumpHistogramFile(prefix, firstLine, sketch);
+ }
+
+ /**
+ * Write four files, a histogram and percentiles, one each for each of the row size and column count sketches.
+ * Tie the four files with isoNow time.
+ */
+ private static void dumpGnuplotDataFiles(String isoNow, Sketches sketches, String tableName, int regions,
+ String isoDuration) throws IOException {
+ UpdateDoublesSketch sketch = sketches.columnCountSketch;
+ dumpFiles(getFileNamePrefix(isoNow, tableName, "columnCount"),
+ getFileFirstLine(tableName, regions, isoDuration, sketch), sketch);
+ sketch = sketches.rowSizeSketch;
+ dumpFiles(getFileNamePrefix(isoNow, tableName, "rowSize"),
+ getFileFirstLine(tableName, regions, isoDuration, sketch), sketch);
+ }
+
+ static void usage(Options options) {
+ usage(options, null);
+ }
+
+ static void usage(Options options, String error) {
+ if (error != null) {
+ System.out.println("ERROR: " + error);
+ }
+ // HelpFormatter can't output -Dproperty=value.
+ // Options doesn't know how to process -D one=two...i.e.
+ // with a space between -D and the property-value... so
+ // take control of the usage output and output what
+ // Options can parse.
+ System.out.println("Usage: reporter <OPTIONS> TABLENAME");
+ System.out.println("OPTIONS:");
+ System.out.println(" -h,--help Output this help message");
+ System.out.println(" -l,--limit Scan row limit (per thread): default none");
+ System.out.println(" -f,--fraction Fraction of table Regions to read; between 0 and 1: default 1.0 (all)");
+ System.out.println(" -r,--region Scan this Region only; pass encoded name; 'fraction' is ignored.");
+ System.out.println(" -t,--threads Concurrent thread count (thread per Region); default 1");
+ System.out.println(" -Dproperty=value Properties such as the zookeeper to connect to; e.g:");
+ System.out.println(" -Dhbase.zookeeper.quorum=ZK0.remote.cluster.example.org");
+ }
+
+ public static void main(String [] args)
+ throws ParseException, IOException, ExecutionException, InterruptedException {
+ Options options = new Options();
+ Option help = Option.builder("h").longOpt("help").
+ desc("output this help message").build();
+ options.addOption(help);
+ Option limitOption = Option.builder("l").longOpt("limit").hasArg().build();
+ options.addOption(limitOption);
+ Option fractionOption = Option.builder("f").longOpt("fraction").hasArg().build();
+ options.addOption(fractionOption);
+ Option regionOption = Option.builder("r").longOpt("region").hasArg().build();
+ options.addOption(regionOption);
+ Option threadsOption = Option.builder("t").longOpt("threads").hasArg().build();
+ options.addOption(threadsOption);
+ Option configOption = Option.builder("D").valueSeparator().argName("property=value").
+ hasArgs().build();
+ options.addOption(configOption);
+ // Parse command-line.
+ CommandLineParser parser = new DefaultParser();
+ CommandLine commandLine = parser.parse(options, args);
+
+ // Process general options.
+ if (commandLine.hasOption(help.getOpt()) || commandLine.getArgList().isEmpty()) {
+ usage(options);
+ System.exit(0);
+ }
+
+ int limit = -1;
+ String opt = limitOption.getOpt();
+ if (commandLine.hasOption(opt)) {
+ limit = Integer.parseInt(commandLine.getOptionValue(opt));
+ }
+ double fraction = 1.0;
+ opt = fractionOption.getOpt();
+ if (commandLine.hasOption(opt)) {
+ fraction = Double.parseDouble(commandLine.getOptionValue(opt));
+ if (fraction > 1 || fraction <= 0) {
+ usage(options, "Bad fraction: " + fraction + "; fraction must be > 0 and < 1");
+ System.exit(0);
+ }
+ }
+ int threads = 1;
+ opt = threadsOption.getOpt();
+ if (commandLine.hasOption(opt)) {
+ threads = Integer.parseInt(commandLine.getOptionValue(opt));
+ if (threads > 1000 || threads <= 0) {
+ usage(options, "Bad thread count: " + threads + "; must be > 0 and < 1000");
+ System.exit(0);
+ }
+ }
+
+ String encodedRegionName = null;
+ opt = regionOption.getOpt();
+ if (commandLine.hasOption(opt)) {
+ encodedRegionName = commandLine.getOptionValue(opt);
+ }
+
+ Configuration configuration = HBaseConfiguration.create();
+ opt = configOption.getOpt();
+ if (commandLine.hasOption(opt)) {
+ // If many options, they all show up here in the keyValues
+ // array, one after the other.
+ String [] keyValues = commandLine.getOptionValues(opt);
+ for (int i = 0; i < keyValues.length;) {
+ configuration.set(keyValues[i], keyValues[i + 1]);
+ i += 2; // Skip over this key and value to next one.
+ }
+ }
+
+ // Now process commands.
+ String [] commands = commandLine.getArgs();
+ if (commands.length < 1) {
+ usage(options, "No TABLENAME: " + Arrays.toString(commands));
+ System.exit(1);
+ }
+
+ String now = Instant.now().toString();
+ for (String command : commands) {
+ sketch(configuration, command, limit, fraction, threads, now, encodedRegionName);
+ }
+ }
+}
diff --git a/hbase-table-reporter/src/main/resources/log4j.properties b/hbase-table-reporter/src/main/resources/log4j.properties
new file mode 100644
index 0000000..c322699
--- /dev/null
+++ b/hbase-table-reporter/src/main/resources/log4j.properties
@@ -0,0 +1,68 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Define some default values that can be overridden by system properties
+hbase.root.logger=INFO,console
+hbase.log.dir=.
+hbase.log.file=hbase.log
+
+# Define the root logger to the system property "hbase.root.logger".
+log4j.rootLogger=${hbase.root.logger}
+
+# Logging Threshold
+log4j.threshold=ALL
+
+#
+# Daily Rolling File Appender
+#
+log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender
+log4j.appender.DRFA.File=${hbase.log.dir}/${hbase.log.file}
+
+# Rollver at midnight
+log4j.appender.DRFA.DatePattern=.yyyy-MM-dd
+
+# 30-day backup
+#log4j.appender.DRFA.MaxBackupIndex=30
+log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout
+# Debugging Pattern format
+log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %C{2}(%L): %m%n
+
+
+#
+# console
+# Add "console" to rootlogger above if you want to use this
+#
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout
+log4j.appender.console.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %C{2}(%L): %m%n
+
+# Custom Logging levels
+
+#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
+
+log4j.logger.org.apache.hadoop=WARN
+log4j.logger.org.apache.zookeeper=ERROR
+log4j.logger.org.apache.hadoop.hbase=DEBUG
+
+#These settings are workarounds against spurious logs from the minicluster.
+#See HBASE-4709
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsConfig=WARN
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsSinkAdapter=WARN
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsSystemImpl=WARN
+log4j.logger.org.apache.hadoop.metrics2.util.MBeans=WARN
+# Enable this to get detailed connection error/retry logging.
+# log4j.logger.org.apache.hadoop.hbase.client.ConnectionImplementation=TRACE
diff --git a/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java b/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java
new file mode 100644
index 0000000..16e46c8
--- /dev/null
+++ b/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hbase.reporter;
+
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.CellBuilderFactory;
+import org.apache.hadoop.hbase.CellBuilderType;
+import org.apache.hadoop.hbase.client.Result;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import static org.apache.hadoop.hbase.shaded.junit.framework.TestCase.assertEquals;
+
+public class TestTableReporter {
+ private static final byte [] CF = Bytes.toBytes("cf");
+ private static final byte [] Q = Bytes.toBytes("q");
+
+ private List<Cell> makeCells(byte [] row, int columns, int versions) {
+ List<Cell> cells = new ArrayList<Cell>(columns);
+ for (int j = 0; j < columns; j++) {
+ for (int k = versions; k > 0; k--) {
+ Cell cell = CellBuilderFactory.create(CellBuilderType.DEEP_COPY).
+ setRow(row).setFamily(CF).
+ setQualifier(Bytes.toBytes(j)).
+ setType(Cell.Type.Put).
+ setTimestamp(k).
+ setValue(row).build();
+ cells.add(cell);
+ }
+ }
+ return cells;
+ }
+
+ @Test
+ public void testSimpleSketching() {
+ TableReporter.Sketches sketches = new TableReporter.Sketches();
+ final int rows = 10;
+ final int columns = 3;
+ final int versions = 2;
+ for (int i = 0; i < rows; i++) {
+ TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches);
+ }
+ sketches.print();
+ // Just check the column counts. Should be 2.
+ double [] columnCounts = sketches.columnCountSketch.getQuantiles(new double [] {1});
+ assertEquals(columnCounts.length, 1);
+ assertEquals((int)columnCounts[0], columns * versions);
+ }
+
+ @Test
+ public void testAddSketches() {
+ TableReporter.Sketches sketches = new TableReporter.Sketches();
+ final int rows = 10;
+ final int columns = 3;
+ final int versions = 2;
+ for (int i = 0; i < rows; i++) {
+ TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches);
+ }
+ sketches.print();
+ TableReporter.Sketches sketches2 = new TableReporter.Sketches();
+ for (int i = 0; i < rows; i++) {
+ TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches2);
+ }
+ sketches2.print();
+ TableReporter.AccumlatingSketch accumlator = new TableReporter.AccumlatingSketch();
+ accumlator.add(sketches);
+ accumlator.add(sketches2);
+ TableReporter.Sketches sum = accumlator.get();
+
+ // Just check the column counts. Should be 2.
+ double [] columnCounts = sum.columnCountSketch.getQuantiles(new double [] {1});
+ assertEquals(columnCounts.length, 1);
+ assertEquals((int)columnCounts[0], columns * versions);
+ sum.print();
+ }
+}
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index 74c6403..1c31a01 100644
--- a/pom.xml
+++ b/pom.xml
@@ -54,6 +54,7 @@
</license>
</licenses>
<modules>
+ <module>hbase-table-reporter</module>
<module>hbase-hbck2</module>
<!--Add an assembly module because of http://maven.apache.org/plugins/maven-assembly-plugin/faq.html#module-binaries
-->