You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by bo...@apache.org on 2019/11/11 21:30:38 UTC
[beam] branch master updated: Add HllCount to Java transform catalog
This is an automated email from the ASF dual-hosted git repository.
boyuanz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 043fecd Add HllCount to Java transform catalog
new 063ee99 Merge pull request #9793 from robinyqiu/tmp
043fecd is described below
commit 043fecd61b72977902baaaa9f2f401a19936190f
Author: Yueyang Qiu <ro...@gmail.com>
AuthorDate: Mon Oct 14 15:37:55 2019 -0700
Add HllCount to Java transform catalog
---
.../src/_includes/section-menu/documentation.html | 1 +
.../java/aggregation/approximateunique.md | 2 +
.../transforms/java/aggregation/hllcount.md | 77 ++++++++++++++++++++++
website/src/documentation/transforms/java/index.md | 1 +
4 files changed, 81 insertions(+)
diff --git a/website/src/_includes/section-menu/documentation.html b/website/src/_includes/section-menu/documentation.html
index 529c264e..ed776ea 100644
--- a/website/src/_includes/section-menu/documentation.html
+++ b/website/src/_includes/section-menu/documentation.html
@@ -215,6 +215,7 @@
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/distinct/">Distinct</a></li>
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupbykey/">GroupByKey</a></li>
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupintobatches/">GroupIntoBatches</a></li>
+ <li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount/">HllCount</a></li>
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/latest/">Latest</a></li>
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/max/">Max</a></li>
<li><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/mean/">Mean</a></li>
diff --git a/website/src/documentation/transforms/java/aggregation/approximateunique.md b/website/src/documentation/transforms/java/aggregation/approximateunique.md
index 9b3e6d0..448c0ee 100644
--- a/website/src/documentation/transforms/java/aggregation/approximateunique.md
+++ b/website/src/documentation/transforms/java/aggregation/approximateunique.md
@@ -35,6 +35,8 @@ of key-value pairs.
See [BEAM-7703](https://issues.apache.org/jira/browse/BEAM-7703) for updates.
## Related transforms
+* [HllCount]({{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount)
+ estimates the number of distinct elements and creates re-aggregatable sketches using the HyperLogLog++ algorithm.
* [Count]({{ site.baseurl }}/documentation/transforms/java/aggregation/count)
counts the number of elements within each aggregation.
* [Distinct]({{ site.baseurl }}/documentation/transforms/java/aggregation/distinct)
\ No newline at end of file
diff --git a/website/src/documentation/transforms/java/aggregation/hllcount.md b/website/src/documentation/transforms/java/aggregation/hllcount.md
new file mode 100644
index 0000000..506a8dc
--- /dev/null
+++ b/website/src/documentation/transforms/java/aggregation/hllcount.md
@@ -0,0 +1,77 @@
+---
+layout: section
+title: "HllCount"
+permalink: /documentation/transforms/java/aggregation/hllcount/
+section_menu: section-menu/documentation.html
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Latest
+<table align="left">
+ <a target="_blank" class="button"
+ href="https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/extensions/zetasketch/HllCount.html">
+ <img src="https://beam.apache.org/images/logos/sdks/java.png" width="20px" height="20px"
+ alt="Javadoc" />
+ Javadoc
+ </a>
+</table>
+<br>
+
+Estimates the number of distinct elements in a data stream using the
+[HyperLogLog++ algorithm](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf).
+The respective transforms to create and merge sketches, and to extract from them, are:
+
+* `HllCount.Init` aggregates inputs into HLL++ sketches.
+* `HllCount.MergePartial` merges HLL++ sketches into a new sketch.
+* `HllCount.Extract` extracts the estimated count of distinct elements from HLL++ sketches.
+
+You can read more about what a sketch is at https://github.com/google/zetasketch.
+
+## Examples
+**Example 1**: creates a long-type sketch for a `PCollection<Long>` with a custom precision:
+```java
+ PCollection<Long> input = ...;
+ int p = ...;
+ PCollection<byte[]> sketch = input.apply(HllCount.Init.forLongs().withPrecision(p).globally());
+```
+
+**Example 2**: creates a bytes-type sketch for a `PCollection<KV<String, byte[]>>`:
+```java
+ PCollection<KV<String, byte[]>> input = ...;
+ PCollection<KV<String, byte[]>> sketch = input.apply(HllCount.Init.forBytes().perKey());
+```
+
+**Example 3**: merges existing sketches in a `PCollection<byte[]>` into a new sketch,
+which summarizes the union of the inputs that were aggregated in the merged sketches:
+```java
+ PCollection<byte[]> sketches = ...;
+ PCollection<byte[]> mergedSketch = sketches.apply(HllCount.MergePartial.globally());
+```
+
+**Example 4**: estimates the count of distinct elements in a `PCollection<String>`:
+```java
+ PCollection<String> input = ...;
+ PCollection<Long> countDistinct =
+ input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
+```
+
+**Example 5**: extracts the count distinct estimate from an existing sketch:
+```java
+ PCollection<byte[]> sketch = ...;
+ PCollection<Long> countDistinct = sketch.apply(HllCount.Extract.globally());
+```
+
+## Related transforms
+* [ApproximateUnique]({{ site.baseurl }}/documentation/transforms/java/aggregation/approximateunique)
+ estimates the number of distinct elements or values in key-value pairs (but does not expose sketches; also less accurate than `HllCount`).
\ No newline at end of file
diff --git a/website/src/documentation/transforms/java/index.md b/website/src/documentation/transforms/java/index.md
index b36e305..71b3721 100644
--- a/website/src/documentation/transforms/java/index.md
+++ b/website/src/documentation/transforms/java/index.md
@@ -58,6 +58,7 @@ limitations under the License.
<tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupbykey">GroupByKey</a></td><td>Takes a keyed collection of elements and produces a collection where each element
consists of a key and all values associated with that key.</td></tr>
<tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/groupintobatches">GroupIntoBatches</a></td><td>Batches values associated with keys into <code>Iterable</code> batches of some size. Each batch contains elements associated with a specific key.</td></tr>
+ <tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/hllcount">HllCount</a></td><td>Estimates the number of distinct elements and creates re-aggregatable sketches using the HyperLogLog++ algorithm.</td></tr>
<tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/latest">Latest</a></td><td>Selects the latest element within each aggregation according to the implicit timestamp.</td></tr>
<tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/max">Max</a></td><td>Outputs the maximum element within each aggregation.</td></tr>
<tr><td><a href="{{ site.baseurl }}/documentation/transforms/java/aggregation/mean">Mean</a></td><td>Computes the average within each aggregation.</td></tr>