You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2020/05/18 19:32:54 UTC

[hbase-operator-tools] branch master updated: HBASE-24397 [hbase-operator-tools] Tool to Report on row sizes and column counts

This is an automated email from the ASF dual-hosted git repository.

stack pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hbase-operator-tools.git


The following commit(s) were added to refs/heads/master by this push:
     new 5b84d7a  HBASE-24397 [hbase-operator-tools] Tool to Report on row sizes and column counts
5b84d7a is described below

commit 5b84d7a9a8e5b006afd1de7b040a9273ee297fa4
Author: stack <st...@apache.org>
AuthorDate: Mon May 18 12:28:28 2020 -0700

    HBASE-24397 [hbase-operator-tools] Tool to Report on row sizes and column counts
    
    Add tool that reads a table and produces histogram/quantiles of row size
    and column count.
---
 README.md                                          |   1 +
 hbase-table-reporter/README.md                     |  48 ++
 hbase-table-reporter/pom.xml                       | 102 ++++
 hbase-table-reporter/src/main/gnuplot/freq.p       |   6 +
 hbase-table-reporter/src/main/gnuplot/histo.p      |  17 +
 .../org/apache/hbase/reporter/TableReporter.java   | 525 +++++++++++++++++++++
 .../src/main/resources/log4j.properties            |  68 +++
 .../apache/hbase/reporter/TestTableReporter.java   |  94 ++++
 pom.xml                                            |   1 +
 9 files changed, 862 insertions(+)

diff --git a/README.md b/README.md
index bf593ea..2ff66db 100644
--- a/README.md
+++ b/README.md
@@ -22,3 +22,4 @@ Host for [Apache HBase&trade;](https://hbase.apache.org)
 operator tools including:
 
  * [HBCK2](https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2), the hbase-2.x fix-it tool, the successor to hbase-1's _hbck_ (A.K.A _hbck1_).
+ * [TableReporter](https://github.com/apache/hbase-operator-tools/tree/master/hbase-table-reporter), a tool to generate a basic report on Table column counts and row sizes; use when no distributed execution available.
diff --git a/hbase-table-reporter/README.md b/hbase-table-reporter/README.md
new file mode 100644
index 0000000..b361621
--- /dev/null
+++ b/hbase-table-reporter/README.md
@@ -0,0 +1,48 @@
+# hbase-table-reporter
+Basic report on Table column counts and row sizes. Used when no distributed
+execution engine available... Runs a Scan of all data to size. Allows
+specifying subset of Regions-in-a-Table and even specifying a
+Single Region only.
+
+Here is how you run it to count over the first 20% of the
+Regions of the Table using an executor pool of 8 threads (to
+make it so we are scanning 8 regions at a time):
+
+```HBASE_CLASSPATH_PREFIX=./hbase-table-reporter-1.0-SNAPSHOT.jar hbase com.apple.hbase.Reporter -t 8 -f 0.2 GENIE2_modality_session```
+
+The output is a report per Region with a totals for all Regions output at the end. Here is
+what a single Region report looks like:
+
+```
+2020-03-06T20:50:42.473Z region=prod:example,\x0B\x89\xE8\x18Z\xE7RRg/\x05\x81\xE5\xF0.\xB0,1583452813195.d383921761f3ad7b9b1303ee672ff806., duration=PT0.023S
+rowSize quantiles [6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 6424.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 70000.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 71720.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79712.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 79840.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 103248.0, 115464.0, 115464.0, 115464.0, 115464.0, 115464.0,  [...]
+rowSize histo [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 4.0, 8.0, 0.0]
+rowSizestats N=13, min=6424.0, max=237408.0
+columnCount quantiles [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, [...]
+columnCount histo [0.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
+columnCountstats N=13, min=2.0, max=3.0
+```
+It starts w/ the timestamp for when we wrote this Region report followed by the name of the
+Region this report is about and how long it took to generate this report. Then we report 100
+quantiles to show size distribution in the Region. The next line is an histogram of sizes
+reported and then a count of rows seen and smallest and largest sizes encountered.
+
+## Making Plots
+
+When the above reporter runs, on the end, it dumps the data into files that can be used as src plotting diagrams in gnuplot.
+See the tail of the output made by the hbase-table-reporter. Per table, we generate files into the tmp dir with a '''hbase-table-reporter'' prefix.
+There'll be one for the table's rowSize histogram, rowSize percentiles, and ditto for column count. There are gnuplot *.p files in
+this repo at `src/main/gnuplot` that you can can use plotting. Edit the '.p' files to reference the generated files using
+appropriate `histo.p` or `freq.p`.
+
+The first line of the generated files is commented out. It lists vitals like table name, min and max, counts, etc.
+
+Here's a bit of script that might help generating the `.p` files:
+```
+$ for i in $(ls histogram); do  echo -n \"pwd/$i\"; x=$(head -1 $i | sed -e 's/^# //' | sed -e 's//\\\\/g'); echo " title \"$x\" with lines, \\";  done
+```
+
+Or if all the reporter files are in the current directory, something like this to put the `.p` file references into files in the tmp dir:
+```
+$ for z in rowSize.histo rowSize.per columnCount.histo columnCount.per; do for i in $(ls *$z*); do  echo -n \"`pwd`/$i\"; x=$(head -1 $i | sed -e 's/^# //' | sed -e 's/_/\\\\_/g'); echo " title \"$x\" with lines, \\";  done > /tmp/$z.txt; done 
+```
diff --git a/hbase-table-reporter/pom.xml b/hbase-table-reporter/pom.xml
new file mode 100644
index 0000000..2895550
--- /dev/null
+++ b/hbase-table-reporter/pom.xml
@@ -0,0 +1,102 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <parent>
+    <groupId>org.apache.hbase.operator.tools</groupId>
+    <artifactId>hbase-operator-tools</artifactId>
+    <version>1.1.0-SNAPSHOT</version>
+    <relativePath>..</relativePath>
+  </parent>
+  <artifactId>hbase-table-reporter</artifactId>
+  <name>Apache HBase - Table Reporter</name>
+  <description>HBase Table Report</description>
+  <properties>
+    <maven.compiler.version>3.6.1</maven.compiler.version>
+  </properties>
+  <build>
+    <plugins>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <version>${maven.compiler.version}</version>
+        <configuration>
+          <source>${compileSource}</source>
+          <target>${compileSource}</target>
+          <showWarnings>true</showWarnings>
+          <showDeprecation>false</showDeprecation>
+          <useIncrementalCompilation>false</useIncrementalCompilation>
+          <compilerArgument>-Xlint:-options</compilerArgument>
+        </configuration>
+      </plugin>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-shade-plugin</artifactId>
+        <version>3.1.1</version>
+        <executions>
+          <execution>
+            <phase>package</phase>
+            <goals>
+              <goal>shade</goal>
+            </goals>
+            <configuration>
+              <minimizeJar>true</minimizeJar>
+              <createDependencyReducedPom>true</createDependencyReducedPom>
+              <dependencyReducedPomLocation>${java.io.tmpdir}/dependency-reduced-pom.xml
+                </dependencyReducedPomLocation>
+              <artifactSet>
+                <excludes>
+                  <exclude>org.apache.hbase:hbase-shaded-client</exclude>
+                  <exclude>org.slf4j:slf4j-api</exclude>
+                  <exclude>org.apache.htrace:htrace-core</exclude>
+                  <exclude>org.apache.htrace:htrace-core4</exclude>
+                  <exclude>org.slf4j:slf4j-log4j12</exclude>
+                  <exclude>log4j:log4j:jar:</exclude>
+                  <exclude>org.apache.yetus:audience-annotations</exclude>
+                  <exclude>commons-logging:commons-logging</exclude>
+                  <exclude>com.github.stephenc.findbugs:findbugs-annotation</exclude>
+                </excludes>
+              </artifactSet>
+              <transformers>
+                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
+                  <manifestEntries>
+                    <addClasspath>true</addClasspath>
+                    <mainClass>org.apache.hbase.reporter.TableReporter</mainClass>
+                  </manifestEntries>
+                </transformer>
+                </transformers>
+              </configuration>
+            </execution>
+          </executions>
+      </plugin>
+    </plugins>
+  </build>
+  <dependencies>
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-api</artifactId>
+      <version>1.7.25</version>
+    </dependency>
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-log4j12</artifactId>
+      <version>1.7.25</version>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hbase</groupId>
+      <artifactId>hbase-shaded-client</artifactId>
+      <version>2.1.1</version>
+    </dependency>
+    <dependency>
+     <groupId>org.apache.datasketches</groupId>
+     <artifactId>datasketches-java</artifactId>
+     <version>1.3.0-incubating</version>
+    </dependency>
+    <dependency>
+      <groupId>commons-cli</groupId>
+      <artifactId>commons-cli</artifactId>
+      <version>1.4</version>
+    </dependency>
+  </dependencies>
+</project>
diff --git a/hbase-table-reporter/src/main/gnuplot/freq.p b/hbase-table-reporter/src/main/gnuplot/freq.p
new file mode 100644
index 0000000..663e80b
--- /dev/null
+++ b/hbase-table-reporter/src/main/gnuplot/freq.p
@@ -0,0 +1,6 @@
+set terminal svg dynamic
+set title 'Percentiles Row Sizes' font ",36"
+set key font ",5"
+set autoscale
+plot \
+"table.rowSize.percentiles.4986160210664496121.gnuplotdata" title "table regions=589, duration=PT38.262S, N=35751, min=1136.0, max=2.38644832E8" with lines
diff --git a/hbase-table-reporter/src/main/gnuplot/histo.p b/hbase-table-reporter/src/main/gnuplot/histo.p
new file mode 100644
index 0000000..d6fe3ee
--- /dev/null
+++ b/hbase-table-reporter/src/main/gnuplot/histo.p
@@ -0,0 +1,17 @@
+set terminal svg dynamic
+set title 'Histogram Row Sizes' font ",36"
+set key font ",5"
+set ylabel 'Log10'
+set logscale y
+set autoscale
+set style data histograms
+set style fill solid
+set style histogram clustered gap 1
+# This xtics list needs to match what is up in the code.
+set xtics axis out rotate by 45 right ("" -1, "<1" 0, "<5" 1, "<10" 2, \
+ "<15" 3, "<20"  4, "<25" 5, "<100" 6, \
+ "<1k" 7, "<5k" 8, "<10k" 9, "<20k" 10, \
+ "<50k" 11, "<100k" 12, \
+ "<1M" 13, ">=1M" 14, "" 15)
+plot \
+"table.rowSize.histograms.7171879522414641440.gnuplotdata" title "table regions=589, duration=PT38.262S, N=35751, min=1136.0, max=2.38644832E8" with lines
diff --git a/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java b/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java
new file mode 100644
index 0000000..d33d7b2
--- /dev/null
+++ b/hbase-table-reporter/src/main/java/org/apache/hbase/reporter/TableReporter.java
@@ -0,0 +1,525 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hbase.reporter;
+
+import org.apache.commons.cli.Options;
+import org.apache.commons.cli.ParseException;
+import org.apache.commons.cli.Option;
+import org.apache.commons.cli.CommandLineParser;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.DefaultParser;
+import org.apache.datasketches.quantiles.DoublesSketch;
+import org.apache.datasketches.quantiles.DoublesUnion;
+import org.apache.datasketches.quantiles.UpdateDoublesSketch;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.ExtendedCell;
+import org.apache.hadoop.hbase.HBaseConfiguration;
+import org.apache.hadoop.hbase.HBaseIOException;
+import org.apache.hadoop.hbase.HRegionInfo;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.TableName;
+import org.apache.hadoop.hbase.client.*;
+import org.apache.hadoop.hbase.io.HeapSize;
+import org.apache.hadoop.hbase.util.Bytes;
+
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileWriter;
+import java.io.IOException;
+import java.math.BigDecimal;
+import java.time.Duration;
+import java.time.Instant;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.concurrent.Callable;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.ThreadFactory;
+import java.util.function.DoubleSupplier;
+import java.util.stream.Collectors;
+import java.util.stream.DoubleStream;
+
+/**
+ * Run a scan against a table reporting on row size, column size and count.
+ *
+ * So can run against cdh5, uses loads of deprecated API and copies some Cell sizing methods local.
+ */
+public class TableReporter {
+  private static String GNUPLOT_DATA_SUFFIX = ".gnuplotdata";
+
+  /**
+   * Quantile sketches. Has a print that dumps out sketches on stdout.
+   * To accumlate Sketches instances, see {@link AccumlatingSketch}
+   */
+  static class Sketches {
+    private static final DoubleSupplier IN_POINT_1_INC = new DoubleSupplier() {
+      private BigDecimal accumulator = new BigDecimal(0);
+      private final BigDecimal pointOhOne = new BigDecimal(0.01);
+
+      @Override
+      public double getAsDouble() {
+        double d = this.accumulator.doubleValue();
+        this.accumulator = this.accumulator.add(pointOhOne);
+        return d;
+      }
+    };
+
+    /**
+     * Make an array of 100 increasing numbers from 0-1.
+     */
+    static double [] NORMALIZED_RANKS = DoubleStream.generate(IN_POINT_1_INC).limit(100).toArray();
+
+    /**
+     * Bins that sort of make sense for the data we're seeing here. After some trial and error.
+     */
+    static double [] BINS = new double [] {1, 5, 10, 15, 20, 25, 100, 1024, 5120, 10240, 20480, 51200, 102400, 1048576};
+
+    /**
+     * Size of row.
+     */
+    final UpdateDoublesSketch rowSizeSketch;
+
+    /**
+     * Count of columns in row.
+     */
+    final UpdateDoublesSketch columnCountSketch;
+
+    Sketches() {
+      this(DoublesSketch.builder().setK(256).build(), DoublesSketch.builder().setK(256).build());
+    }
+
+    Sketches(UpdateDoublesSketch rowSizeSketch, UpdateDoublesSketch columnCountSketch) {
+      this.rowSizeSketch = rowSizeSketch;
+      this.columnCountSketch = columnCountSketch;
+    }
+
+    void print(String preamble) {
+      System.out.println(preamble);
+      print();
+    }
+
+    void print() {
+      print("rowSize", rowSizeSketch);
+      print("columnCount", columnCountSketch);
+    }
+
+    private static void print(String label, final DoublesSketch sketch) {
+      System.out.println(label + " quantiles " + Arrays.toString(sketch.getQuantiles(NORMALIZED_RANKS)));
+      double [] pmfs = sketch.getPMF(BINS);
+      // System.out.println(label + " pmfs " + Arrays.toString(pmfs));
+      System.out.println(label + " histo " +
+          (pmfs == null || pmfs.length == 0?
+              "null": Arrays.toString(Arrays.stream(pmfs).map(d -> d * sketch.getN()).toArray())));
+      System.out.println(label + "stats N=" + sketch.getN() + ", min=" + sketch.getMinValue() + ", max=" +
+          sketch.getMaxValue());
+    }
+  }
+
+  /**
+   * For aggregating {@link Sketches}
+   * To add sketches, need a DoublesUnion Sketch.
+   */
+  static class AccumlatingSketch {
+    DoublesUnion rowSizeUnion = DoublesUnion.builder().build();
+    DoublesUnion columnSizeUnion = DoublesUnion.builder().build();
+
+    void add(Sketches other) {
+      this.rowSizeUnion.update(other.rowSizeSketch);
+      this.columnSizeUnion.update(other.columnCountSketch);
+    }
+
+    /**
+     * @return A Sketches made of current state of aggregation.
+     */
+    Sketches get() {
+      return new Sketches(rowSizeUnion.getResult(), columnSizeUnion.getResult());
+    }
+  }
+
+  static void processRowResult(Result result, Sketches sketches) {
+    // System.out.println(result.toString());
+    long rowSize = 0;
+    int columnCount = 0;
+    for (Cell cell : result.rawCells()) {
+      rowSize += estimatedSizeOfCell(cell);
+      columnCount += 1;
+    }
+    sketches.rowSizeSketch.update(rowSize);
+    sketches.columnCountSketch.update(columnCount);
+  }
+
+  /**
+   * @return First <code>fraction</code> of Table's regions.
+   */
+  private static List<RegionInfo> getRegions(Connection connection, TableName tableName,
+      double fraction, String encodedRegionName) throws IOException {
+    try (Admin admin = connection.getAdmin()) {
+      // Use deprecated API because running against old hbase.
+      List<RegionInfo> regions = admin.getRegions(tableName);
+      if (regions.size() <= 0) {
+        throw new HBaseIOException("No regions found in " + tableName);
+      }
+      if (encodedRegionName != null) {
+        return regions.stream().filter(ri -> ri.getEncodedName().equals(encodedRegionName)).
+            collect(Collectors.toCollection(ArrayList::new));
+      }
+      return regions.subList(0, (int)(regions.size() * fraction)); // Rounds down.
+    }
+  }
+
+  /**
+   * Class that scans a Region to produce a Sketch.
+   */
+  static class SketchRegion implements Callable<SketchRegion> {
+    private final RegionInfo ri;
+    private final Connection connection;
+    private final TableName tableName;
+    private final int limit;
+    private Sketches sketches = new Sketches();
+    private volatile long duration;
+
+    SketchRegion(Connection connection, TableName tableName, RegionInfo ri, int limit) {
+      this.ri = ri;
+      this.connection = connection;
+      this.tableName = tableName;
+      this.limit = limit;
+    }
+
+    @Override
+    public SketchRegion call() {
+      try (Table table = this.connection.getTable(this.tableName)) {
+        Scan scan = new Scan();
+        scan.setStartRow(this.ri.getStartKey());
+        scan.setStopRow(this.ri.getEndKey());
+        scan.setAllowPartialResults(true);
+        long startTime = System.currentTimeMillis();
+        long count = 0;
+        try (ResultScanner resultScanner = table.getScanner(scan)) {
+          for (Result result : resultScanner) {
+            processRowResult(result, sketches);
+            count++;
+            if (this.limit >= 0 && count <= this.limit) {
+              break;
+            }
+          }
+        }
+        this.duration = System.currentTimeMillis() - startTime;
+      } catch (IOException e) {
+        e.printStackTrace();
+      }
+      return this;
+    }
+
+    Sketches getSketches() {
+      return this.sketches;
+    }
+
+    RegionInfo getRegionInfo() {
+      return this.ri;
+    }
+
+    long getDuration() {
+      return this.duration;
+    }
+  }
+
+  private static void sketch(Configuration configuration, String tableNameAsStr, int limit,
+    double fraction, int threads, String isoNow, String encodedRegionName)
+  throws IOException, InterruptedException, ExecutionException {
+    TableName tableName = TableName.valueOf(tableNameAsStr);
+    AccumlatingSketch totalSketches = new AccumlatingSketch();
+    long startTime = System.currentTimeMillis();
+    int count = 0;
+    try (Connection connection = ConnectionFactory.createConnection(configuration)) {
+      // Get list of Regions. If 'fraction', get this fraction of all Regions. If
+      // encodedRegionName, then set fraction to 1.0 in case the returned set does not
+      // include the encodedRegionName we're looking for.
+      List<RegionInfo> regions = getRegions(connection, tableName, fraction, encodedRegionName);
+      count = regions.size();
+      if (count <= 0) {
+        throw new HBaseIOException("Empty regions list; fraction " + fraction +
+            " too severe or communication problems?");
+      } else {
+        System.out.println(Instant.now().toString() + " Scanning " + tableNameAsStr +
+            " regions=" + count + ", " + regions);
+      }
+      ExecutorService es =
+          Executors.newFixedThreadPool(threads, new ThreadFactory() {
+            @Override
+            public Thread newThread(Runnable r) {
+              Thread t = new Thread(r);
+              t.setDaemon(true);
+              return t;
+            }
+          });
+      try {
+        List<SketchRegion> srs = regions.stream().map(ri -> new SketchRegion(connection, tableName, ri, limit)).
+            collect(Collectors.toList());
+        List<Future<SketchRegion>> futures = new ArrayList<>(srs.size());
+        for (SketchRegion sr: srs) {
+          // Do submit rather than inokeall; invokeall blocks until all done. This way I get control back
+          // after all submitted.
+          futures.add(es.submit(sr));
+        }
+        // Avoid java.util.ConcurrentModificationException
+        List<Future<SketchRegion>> removals = new ArrayList<>();
+        while (!futures.isEmpty()) {
+          for (Future<SketchRegion> future: futures) {
+            if (future.isDone()) {
+              SketchRegion sr = future.get();
+              sr.getSketches().print(Instant.now().toString() +
+                  " region=" + sr.getRegionInfo().getRegionNameAsString() + ", duration=" +
+                  (Duration.ofMillis(sr.getDuration()).toString()));
+              totalSketches.add(sr.getSketches());
+              removals.add(future);
+            }
+          }
+          if (!removals.isEmpty()) {
+            futures.removeAll(removals);
+            removals.clear();
+          }
+          Thread.sleep(1000);
+        }
+      } finally {
+        es.shutdown();
+      }
+    }
+    Sketches sketches = totalSketches.get();
+    String isoDuration = Duration.ofMillis(System.currentTimeMillis() - startTime).toString();
+    sketches.print(Instant.now().toString() + " Totals for " + tableNameAsStr + " regions=" + count +
+        ", limit=" + limit + ", fraction=" + fraction + ", took=" + isoDuration);
+    // Dump out the gnuplot files. Saves time generating graphs.
+    dumpGnuplotDataFiles(isoNow, sketches, tableNameAsStr, count, isoDuration);
+  }
+
+  /**
+   * This is an estimate of the heap space occupied by a cell. When the cell is of type
+   * {@link HeapSize} we call {@link HeapSize#heapSize()} so cell can give a correct value. In other
+   * cases we just consider the bytes occupied by the cell components ie. row, CF, qualifier,
+   * timestamp, type, value and tags.
+   * Note that this can be the JVM heap space (on-heap) or the OS heap (off-heap)
+   * @return estimate of the heap space
+   */
+  public static long estimatedSizeOfCell(final Cell cell) {
+    if (cell instanceof HeapSize) {
+      return ((HeapSize) cell).heapSize();
+    }
+    // TODO: Add sizing of references that hold the row, family, etc., arrays.
+    return estimatedSerializedSizeOf(cell);
+  }
+
+  /**
+   * Estimate based on keyvalue's serialization format in the RPC layer. Note that there is an extra
+   * SIZEOF_INT added to the size here that indicates the actual length of the cell for cases where
+   * cell's are serialized in a contiguous format (For eg in RPCs).
+   * @return Estimate of the <code>cell</code> size in bytes plus an extra SIZEOF_INT indicating the
+   *         actual cell length.
+   */
+  public static int estimatedSerializedSizeOf(final Cell cell) {
+    if (cell instanceof ExtendedCell) {
+      return ((ExtendedCell) cell).getSerializedSize(true) + Bytes.SIZEOF_INT;
+    }
+
+    return getSumOfCellElementLengths(cell) +
+        // Use the KeyValue's infrastructure size presuming that another implementation would have
+        // same basic cost.
+        KeyValue.ROW_LENGTH_SIZE + KeyValue.FAMILY_LENGTH_SIZE +
+        // Serialization is probably preceded by a length (it is in the KeyValueCodec at least).
+        Bytes.SIZEOF_INT;
+  }
+
+  /**
+   * @return Sum of the lengths of all the elements in a Cell; does not count in any infrastructure
+   */
+  private static int getSumOfCellElementLengths(final Cell cell) {
+    return getSumOfCellKeyElementLengths(cell) + cell.getValueLength() + cell.getTagsLength();
+  }
+
+  /**
+   * @return Sum of all elements that make up a key; does not include infrastructure, tags or
+   *         values.
+   */
+  private static int getSumOfCellKeyElementLengths(final Cell cell) {
+    return cell.getRowLength() + cell.getFamilyLength() + cell.getQualifierLength()
+        + KeyValue.TIMESTAMP_TYPE_SIZE;
+  }
+
+  private static String getFileNamePrefix(String isoNow, String tableName, String sketchName) {
+    return "reporter." + isoNow + "." + tableName + "." + sketchName;
+  }
+
+  private static String getFileFirstLine(String tableName, int regions, String isoDuration, UpdateDoublesSketch sketch) {
+    return "# " + tableName + " regions=" + regions + ", duration=" + isoDuration + ", N=" + sketch.getN() +
+        ", min=" + sketch.getMinValue() + ", max=" + sketch.getMaxValue();
+  }
+
+  private static void dumpPercentilesFile(String prefix, String firstLine, UpdateDoublesSketch sketch)
+      throws IOException {
+    dumpFile(File.createTempFile(prefix + ".percentiles.", GNUPLOT_DATA_SUFFIX),
+        firstLine, sketch.getQuantiles(Sketches.NORMALIZED_RANKS));
+  }
+
+  private static void dumpHistogramFile(String prefix, String firstLine, UpdateDoublesSketch sketch)
+      throws IOException {
+    double [] pmfs = sketch.getPMF(Sketches.BINS);
+    double [] ds = Arrays.stream(pmfs).map(d -> d * sketch.getN()).toArray();
+    dumpFile(File.createTempFile(prefix + ".histograms.", GNUPLOT_DATA_SUFFIX),
+        firstLine, ds);
+  }
+
+  private static void dumpFile(File file, String firstLine, double [] ds) throws IOException {
+    try (BufferedWriter writer = new BufferedWriter(new FileWriter(file))) {
+      writer.write(firstLine);
+      writer.newLine();
+      for (double d : ds) {
+        writer.write(Double.toString(d));
+        writer.newLine();
+      }
+    }
+    System.out.println(Instant.now().toString() + " wrote " + file.toString());
+  }
+
+  private static void dumpFiles(String prefix, String firstLine, UpdateDoublesSketch sketch) throws IOException {
+    dumpPercentilesFile(prefix, firstLine, sketch);
+    dumpHistogramFile(prefix, firstLine, sketch);
+  }
+
+  /**
+   * Write four files, a histogram and percentiles, one each for each of the row size and column count sketches.
+   * Tie the four files with isoNow time.
+   */
+  private static void dumpGnuplotDataFiles(String isoNow, Sketches sketches, String tableName, int regions,
+      String isoDuration) throws IOException {
+    UpdateDoublesSketch sketch = sketches.columnCountSketch;
+    dumpFiles(getFileNamePrefix(isoNow, tableName, "columnCount"),
+        getFileFirstLine(tableName, regions, isoDuration, sketch), sketch);
+    sketch = sketches.rowSizeSketch;
+    dumpFiles(getFileNamePrefix(isoNow, tableName, "rowSize"),
+        getFileFirstLine(tableName, regions, isoDuration, sketch), sketch);
+  }
+
+  static void usage(Options options) {
+    usage(options, null);
+  }
+
+  static void usage(Options options, String error) {
+    if (error != null) {
+      System.out.println("ERROR: " + error);
+    }
+    // HelpFormatter can't output -Dproperty=value.
+    // Options doesn't know how to process -D one=two...i.e.
+    // with a space between -D and the property-value... so
+    // take control of the usage output and output what
+    // Options can parse.
+    System.out.println("Usage: reporter <OPTIONS> TABLENAME");
+    System.out.println("OPTIONS:");
+    System.out.println(" -h,--help        Output this help message");
+    System.out.println(" -l,--limit       Scan row limit (per thread): default none");
+    System.out.println(" -f,--fraction    Fraction of table Regions to read; between 0 and 1: default 1.0 (all)");
+    System.out.println(" -r,--region      Scan this Region only; pass encoded name; 'fraction' is ignored.");
+    System.out.println(" -t,--threads     Concurrent thread count (thread per Region); default 1");
+    System.out.println(" -Dproperty=value Properties such as the zookeeper to connect to; e.g:");
+    System.out.println("                  -Dhbase.zookeeper.quorum=ZK0.remote.cluster.example.org");
+  }
+
+  public static void main(String [] args)
+  throws ParseException, IOException, ExecutionException, InterruptedException {
+    Options options = new Options();
+    Option help = Option.builder("h").longOpt("help").
+        desc("output this help message").build();
+    options.addOption(help);
+    Option limitOption = Option.builder("l").longOpt("limit").hasArg().build();
+    options.addOption(limitOption);
+    Option fractionOption = Option.builder("f").longOpt("fraction").hasArg().build();
+    options.addOption(fractionOption);
+    Option regionOption = Option.builder("r").longOpt("region").hasArg().build();
+    options.addOption(regionOption);
+    Option threadsOption = Option.builder("t").longOpt("threads").hasArg().build();
+    options.addOption(threadsOption);
+    Option configOption = Option.builder("D").valueSeparator().argName("property=value").
+        hasArgs().build();
+    options.addOption(configOption);
+    // Parse command-line.
+    CommandLineParser parser = new DefaultParser();
+    CommandLine commandLine = parser.parse(options, args);
+
+    // Process general options.
+    if (commandLine.hasOption(help.getOpt()) || commandLine.getArgList().isEmpty()) {
+      usage(options);
+      System.exit(0);
+    }
+
+    int limit = -1;
+    String opt = limitOption.getOpt();
+    if (commandLine.hasOption(opt)) {
+      limit = Integer.parseInt(commandLine.getOptionValue(opt));
+    }
+    double fraction = 1.0;
+    opt = fractionOption.getOpt();
+    if (commandLine.hasOption(opt)) {
+      fraction = Double.parseDouble(commandLine.getOptionValue(opt));
+      if (fraction > 1 || fraction <= 0) {
+        usage(options, "Bad fraction: " + fraction + "; fraction must be > 0 and < 1");
+        System.exit(0);
+      }
+    }
+    int threads = 1;
+    opt = threadsOption.getOpt();
+    if (commandLine.hasOption(opt)) {
+      threads = Integer.parseInt(commandLine.getOptionValue(opt));
+      if (threads > 1000 || threads <= 0) {
+        usage(options, "Bad thread count: " + threads + "; must be > 0 and < 1000");
+        System.exit(0);
+      }
+    }
+
+    String encodedRegionName = null;
+    opt = regionOption.getOpt();
+    if (commandLine.hasOption(opt)) {
+      encodedRegionName = commandLine.getOptionValue(opt);
+    }
+
+    Configuration configuration = HBaseConfiguration.create();
+    opt = configOption.getOpt();
+    if (commandLine.hasOption(opt)) {
+      // If many options, they all show up here in the keyValues
+      // array, one after the other.
+      String [] keyValues = commandLine.getOptionValues(opt);
+      for (int i = 0; i < keyValues.length;) {
+        configuration.set(keyValues[i], keyValues[i + 1]);
+        i += 2; // Skip over this key and value to next one.
+      }
+    }
+
+    // Now process commands.
+    String [] commands = commandLine.getArgs();
+    if (commands.length < 1) {
+      usage(options, "No TABLENAME: " + Arrays.toString(commands));
+      System.exit(1);
+    }
+
+    String now = Instant.now().toString();
+    for (String command : commands) {
+      sketch(configuration, command, limit, fraction, threads, now, encodedRegionName);
+    }
+  }
+}
diff --git a/hbase-table-reporter/src/main/resources/log4j.properties b/hbase-table-reporter/src/main/resources/log4j.properties
new file mode 100644
index 0000000..c322699
--- /dev/null
+++ b/hbase-table-reporter/src/main/resources/log4j.properties
@@ -0,0 +1,68 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Define some default values that can be overridden by system properties
+hbase.root.logger=INFO,console
+hbase.log.dir=.
+hbase.log.file=hbase.log
+
+# Define the root logger to the system property "hbase.root.logger".
+log4j.rootLogger=${hbase.root.logger}
+
+# Logging Threshold
+log4j.threshold=ALL
+
+#
+# Daily Rolling File Appender
+#
+log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender
+log4j.appender.DRFA.File=${hbase.log.dir}/${hbase.log.file}
+
+# Rollver at midnight
+log4j.appender.DRFA.DatePattern=.yyyy-MM-dd
+
+# 30-day backup
+#log4j.appender.DRFA.MaxBackupIndex=30
+log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout
+# Debugging Pattern format
+log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %C{2}(%L): %m%n
+
+
+#
+# console
+# Add "console" to rootlogger above if you want to use this
+#
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout
+log4j.appender.console.layout.ConversionPattern=%d{ISO8601} %-5p [%t] %C{2}(%L): %m%n
+
+# Custom Logging levels
+
+#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG
+
+log4j.logger.org.apache.hadoop=WARN
+log4j.logger.org.apache.zookeeper=ERROR
+log4j.logger.org.apache.hadoop.hbase=DEBUG
+
+#These settings are workarounds against spurious logs from the minicluster.
+#See HBASE-4709
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsConfig=WARN
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsSinkAdapter=WARN
+log4j.logger.org.apache.hadoop.metrics2.impl.MetricsSystemImpl=WARN
+log4j.logger.org.apache.hadoop.metrics2.util.MBeans=WARN
+# Enable this to get detailed connection error/retry logging.
+# log4j.logger.org.apache.hadoop.hbase.client.ConnectionImplementation=TRACE
diff --git a/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java b/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java
new file mode 100644
index 0000000..16e46c8
--- /dev/null
+++ b/hbase-table-reporter/src/test/java/org/apache/hbase/reporter/TestTableReporter.java
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hbase.reporter;
+
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.CellBuilderFactory;
+import org.apache.hadoop.hbase.CellBuilderType;
+import org.apache.hadoop.hbase.client.Result;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import static org.apache.hadoop.hbase.shaded.junit.framework.TestCase.assertEquals;
+
+public class TestTableReporter {
+  private static final byte [] CF = Bytes.toBytes("cf");
+  private static final byte [] Q = Bytes.toBytes("q");
+
+  private List<Cell> makeCells(byte [] row, int columns, int versions) {
+    List<Cell> cells = new ArrayList<Cell>(columns);
+    for (int j = 0; j < columns; j++) {
+      for (int k = versions; k > 0; k--) {
+        Cell cell = CellBuilderFactory.create(CellBuilderType.DEEP_COPY).
+            setRow(row).setFamily(CF).
+            setQualifier(Bytes.toBytes(j)).
+            setType(Cell.Type.Put).
+            setTimestamp(k).
+            setValue(row).build();
+        cells.add(cell);
+      }
+    }
+    return cells;
+  }
+
+  @Test
+  public void testSimpleSketching() {
+    TableReporter.Sketches sketches = new TableReporter.Sketches();
+    final int rows = 10;
+    final int columns = 3;
+    final int versions = 2;
+    for (int i = 0; i < rows; i++) {
+      TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches);
+    }
+    sketches.print();
+    // Just check the column counts. Should be 2.
+    double [] columnCounts = sketches.columnCountSketch.getQuantiles(new double [] {1});
+    assertEquals(columnCounts.length, 1);
+    assertEquals((int)columnCounts[0], columns * versions);
+  }
+
+  @Test
+  public void testAddSketches() {
+    TableReporter.Sketches sketches = new TableReporter.Sketches();
+    final int rows = 10;
+    final int columns = 3;
+    final int versions = 2;
+    for (int i = 0; i < rows; i++) {
+      TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches);
+    }
+    sketches.print();
+    TableReporter.Sketches sketches2 = new TableReporter.Sketches();
+    for (int i = 0; i < rows; i++) {
+      TableReporter.processRowResult(Result.create(makeCells(Bytes.toBytes(i), columns, versions)), sketches2);
+    }
+    sketches2.print();
+    TableReporter.AccumlatingSketch accumlator = new TableReporter.AccumlatingSketch();
+    accumlator.add(sketches);
+    accumlator.add(sketches2);
+    TableReporter.Sketches sum = accumlator.get();
+
+    // Just check the column counts. Should be 2.
+    double [] columnCounts = sum.columnCountSketch.getQuantiles(new double [] {1});
+    assertEquals(columnCounts.length, 1);
+    assertEquals((int)columnCounts[0], columns * versions);
+    sum.print();
+  }
+}
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index 74c6403..1c31a01 100644
--- a/pom.xml
+++ b/pom.xml
@@ -54,6 +54,7 @@
     </license>
   </licenses>
   <modules>
+    <module>hbase-table-reporter</module>
     <module>hbase-hbck2</module>
     <!--Add an assembly module because of http://maven.apache.org/plugins/maven-assembly-plugin/faq.html#module-binaries
          -->