You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by bo...@apache.org on 2019/11/11 21:30:38 UTC

[beam] branch master updated: Add HllCount to Java transform catalog

This is an automated email from the ASF dual-hosted git repository.

boyuanz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 043fecd  Add HllCount to Java transform catalog
     new 063ee99  Merge pull request #9793 from robinyqiu/tmp
043fecd is described below

commit 043fecd61b72977902baaaa9f2f401a19936190f
Author: Yueyang Qiu <ro...@gmail.com>
AuthorDate: Mon Oct 14 15:37:55 2019 -0700

    Add HllCount to Java transform catalog
---
 .../src/_includes/section-menu/documentation.html  |  1 +
 .../java/aggregation/approximateunique.md          |  2 +
 .../transforms/java/aggregation/hllcount.md        | 77 ++++++++++++++++++++++
 website/src/documentation/transforms/java/index.md |  1 +
 4 files changed, 81 insertions(+)

diff --git a/website/src/_includes/section-menu/documentation.html b/website/src/_includes/section-menu/documentation.html
index 529c264e..ed776ea 100644
--- a/website/src/_includes/section-menu/documentation.html
+++ b/website/src/_includes/section-menu/documentation.html
@@ -215,6 +215,7 @@
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/distinct/">Distinct</a></li>
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupbykey/">GroupByKey</a></li>
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupintobatches/">GroupIntoBatches</a></li>
+          <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount/">HllCount</a></li>
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/latest/">Latest</a></li>
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/max/">Max</a></li>
           <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/mean/">Mean</a></li>
diff --git a/website/src/documentation/transforms/java/aggregation/approximateunique.md b/website/src/documentation/transforms/java/aggregation/approximateunique.md
index 9b3e6d0..448c0ee 100644
--- a/website/src/documentation/transforms/java/aggregation/approximateunique.md
+++ b/website/src/documentation/transforms/java/aggregation/approximateunique.md
@@ -35,6 +35,8 @@ of key-value pairs.
 See [BEAM-7703](https://issues.apache.org/jira/browse/BEAM-7703) for updates.
 
 ## Related transforms 
+* [HllCount]({{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount)
+  estimates the number of distinct elements and creates re-aggregatable sketches using the HyperLogLog++ algorithm.
 * [Count]({{ site.baseurl }}/documentation/transforms/java/aggregation/count)
   counts the number of elements within each aggregation.
 * [Distinct]({{ site.baseurl }}/documentation/transforms/java/aggregation/distinct)
\ No newline at end of file
diff --git a/website/src/documentation/transforms/java/aggregation/hllcount.md b/website/src/documentation/transforms/java/aggregation/hllcount.md
new file mode 100644
index 0000000..506a8dc
--- /dev/null
+++ b/website/src/documentation/transforms/java/aggregation/hllcount.md
@@ -0,0 +1,77 @@
+---
+layout: section
+title: "HllCount"
+permalink: /documentation/transforms/java/aggregation/hllcount/
+section_menu: section-menu/documentation.html
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Latest
+<table align="left">
+    <a target="_blank" class="button"
+        href="https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/extensions/zetasketch/HllCount.html">
+      <img src="https://beam.apache.org/images/logos/sdks/java.png" width="20px" height="20px"
+           alt="Javadoc" />
+     Javadoc
+    </a>
+</table>
+<br>
+
+Estimates the number of distinct elements in a data stream using the
+[HyperLogLog++ algorithm](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf).
+The respective transforms to create and merge sketches, and to extract from them, are:
+
+* `HllCount.Init` aggregates inputs into HLL++ sketches.
+* `HllCount.MergePartial` merges HLL++ sketches into a new sketch.
+* `HllCount.Extract` extracts the estimated count of distinct elements from HLL++ sketches.
+
+You can read more about what a sketch is at https://github.com/google/zetasketch.
+
+## Examples
+**Example 1**: creates a long-type sketch for a `PCollection<Long>` with a custom precision:
+```java
+ PCollection<Long> input = ...;
+ int p = ...;
+ PCollection<byte[]> sketch = input.apply(HllCount.Init.forLongs().withPrecision(p).globally());
+```
+
+**Example 2**: creates a bytes-type sketch for a `PCollection<KV<String, byte[]>>`:
+```java
+ PCollection<KV<String, byte[]>> input = ...;
+ PCollection<KV<String, byte[]>> sketch = input.apply(HllCount.Init.forBytes().perKey());
+```
+
+**Example 3**: merges existing sketches in a `PCollection<byte[]>` into a new sketch,
+which summarizes the union of the inputs that were aggregated in the merged sketches:
+```java
+ PCollection<byte[]> sketches = ...;
+ PCollection<byte[]> mergedSketch = sketches.apply(HllCount.MergePartial.globally());
+```
+
+**Example 4**: estimates the count of distinct elements in a `PCollection<String>`:
+```java
+ PCollection<String> input = ...;
+ PCollection<Long> countDistinct =
+     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
+```
+
+**Example 5**: extracts the count distinct estimate from an existing sketch:
+```java
+ PCollection<byte[]> sketch = ...;
+ PCollection<Long> countDistinct = sketch.apply(HllCount.Extract.globally());
+```
+
+## Related transforms
+* [ApproximateUnique]({{ site.baseurl }}/documentation/transforms/java/aggregation/approximateunique)
+  estimates the number of distinct elements or values in key-value pairs (but does not expose sketches; also less accurate than `HllCount`).
\ No newline at end of file
diff --git a/website/src/documentation/transforms/java/index.md b/website/src/documentation/transforms/java/index.md
index b36e305..71b3721 100644
--- a/website/src/documentation/transforms/java/index.md
+++ b/website/src/documentation/transforms/java/index.md
@@ -58,6 +58,7 @@ limitations under the License.
   <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupbykey">GroupByKey</a></td><td>Takes a keyed collection of elements and produces a collection where each element 
   consists of a key and all values associated with that key.</td></tr>
   <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupintobatches">GroupIntoBatches</a></td><td>Batches values associated with keys into <code>Iterable</code> batches of some size. Each batch contains elements associated with a specific key.</td></tr>
+  <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount">HllCount</a></td><td>Estimates the number of distinct elements and creates re-aggregatable sketches using the HyperLogLog++ algorithm.</td></tr>
   <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/latest">Latest</a></td><td>Selects the latest element within each aggregation according to the implicit timestamp.</td></tr>
   <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/max">Max</a></td><td>Outputs the maximum element within each aggregation.</td></tr>  
   <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/mean">Mean</a></td><td>Computes the average within each aggregation.</td></tr>