You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by le...@apache.org on 2020/02/24 06:44:27 UTC
[incubator-datasketches-website] 05/06: Update ToC
This is an automated email from the ASF dual-hosted git repository.
leerho pushed a commit to branch Update
in repository https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git
commit 07979a343884452b16eba75e3ae9fd3100780dba
Author: Lee Rhodes <le...@users.noreply.github.com>
AuthorDate: Sun Feb 23 21:55:11 2020 -0800
Update ToC
---
docs/Architecture/KeyFeatures.md | 27 ++---
docs/CommandLine/CommandLine-old.md | 198 ------------------------------------
docs/CommandLine/CommandLine.md | 23 -----
src/main/resources/docgen/toc.json | 19 +---
4 files changed, 19 insertions(+), 248 deletions(-)
diff --git a/docs/Architecture/KeyFeatures.md b/docs/Architecture/KeyFeatures.md
index 3fe4d3f..3f52935 100644
--- a/docs/Architecture/KeyFeatures.md
+++ b/docs/Architecture/KeyFeatures.md
@@ -19,11 +19,13 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
-<h2>Key Features</h2>
+## Key Features
-<h3>Common Sketch Properties</h3>
+### [Sketch Features Matrix]({{site.docs_dir}}/Architecture/FeatureMatrix)
- * Please refer to the [Sketch Criteria]({{site.docs_dir}}/Architecture/SketchCriteria.html) for all sketches in the library.
+### Common Sketch Properties
+
+ * Please refer to the [Sketch Criteria]({{site.docs_dir}}/Architecture/SketchCriteria.html) for the criteria for sketches to be included in the library.
* Query results are <b>approximate</b> but within well defined error bounds that are user
configurable by trading off sketch size with accuracy.
* Designed for <a href="{{site.docs_dir}}/LargeScale.html">Large-scale</a> computing environments
@@ -39,7 +41,7 @@ and are heavily used within Yahoo / Verizon-Media.
* Comprehensive <b>unit tests</b> and testing tools are provided.
* Extensive documentation with the systems developer in mind.
-<h3>Built-In, General Purpose Functions</h3>
+### Built-In, General Purpose Functions
* General purpose <a href="{{site.docs_dir}}/Memory/MemoryPackage.html">Memory Package</a> for managing data off the Java Heap.
This enables systems designers the ability to manage their own large data heaps with
@@ -48,7 +50,7 @@ its garbage collection.
* General purpose implementaion of Austin Appleby's 128-bit MurmurHash3 algorithm,
with a number of useful extensions.
-<h3>Robust, High Quality Implementations.</h3>
+### Robust, High Quality Implementations.
* Extensive test code leveraging <a href="https://testng.org">TestNG</a>.
* Speed and accuracy performance characterization testing code
@@ -63,17 +65,16 @@ its garbage collection.
<a href="https://www.oracle.com/technetwork/java/index.html">Java JDK8</a> standards.
* Suitable for production environments.
-<h3>Opportunities to Extend</h3>
+### Opportunities to Extend
* There is ample opportunity for interested parties to contribute additional algorithms in this exciting area.
+## Key Algorithms
-<h2>Key Algorithms</h2>
-
-<h3>Count Distinct / Count Unique</h3>
+### Count Distinct / Count Unique
-<h4>Solves Computational Challenges Associated with Unique Identifiers</h4>
+#### Solves Computational Challenges Associated with Unique Identifiers
* <b>Estimating cardinality</b> of a stream with many duplicates
* Performing <a href="{{site.docs_dir}}/Theta/ThetaSketchSetOps.html">set operations</a> (e.g., Union, Intersection,
@@ -84,17 +85,17 @@ its garbage collection.
for operation on the java heap or off-heap.
* <a href="{{site.docs_dir}}/HLL/HLL.html">The Hyper-Log Log algorithms</a> when sketch size is of utmost concern.
-<h3>Quantiles</h3>
+### Quantiles
* Get normal or inverse PDFs or CDFs of the distributions of any numeric value from your raw data in a
single pass.
* Well defined error bounds on the result.
-<h3>Frequent Items</h3>
+### Frequent Items
* Get the most frequent items from a stream of items.
-<h3>Tuple Sketch</h3>
+### Tuple Sketch
* Associative sketches that are useful for performing approximate join operations and
extracting other kinds of behavior associated with unique identifiers.
diff --git a/docs/CommandLine/CommandLine-old.md b/docs/CommandLine/CommandLine-old.md
deleted file mode 100644
index e693f2e..0000000
--- a/docs/CommandLine/CommandLine-old.md
+++ /dev/null
@@ -1,198 +0,0 @@
----
-layout: doc_page
----
-<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-<b>Note that these instructions work on unix-based systems including macs. Windows systems will
-need something similar.</b>
-
-<h2>Creating the command-line <i>sketch</i></h2>
-
-<h3>Clone and install <i>sketches-core</i> and <i>sketches-misc</i></h3>
- * Clone sketches-core and sketches-misc repositories into separate directories on your system.
- * Run <i>mvn install -DskipTests -Dgpg.skip</i> on both directories. This will download JAR
- files into your local .m2/repository. Take note of the version numbers for the
- sketches-core-X.Y.Z.jar and the sketches-misc-X.Y.Z.jar
-
-<h3>Create <i>sketch</i> executable</h3>
-
-Place the following in an empty text file called "sketch" and update the version numbers and the
-path to your local <i>.m2/repository</i> directory:
-
- #!/bin/bash
- # Update version numbers and the path to your local .m2/repository as necessary
-
- COREVER="0.5.2"
- MISCVER="0.1.0"
- M2PATH="/path/to/.m2/repository"
-
- COREPATH="$M2PATH/com/yahoo/datasketches/sketches-core/$COREVER/sketches-core-$COREVER.jar"
- MISCPATH="$M2PATH/com/yahoo/datasketches/sketches-misc/$MISCVER/sketches-misc-$MISCVER.jar"
- CLSPATH="$COREPATH:$MISCPATH"
-
- java -cp $CLSPATH org.apache.datasketches.cmd.CommandLine $@
-
-Move this <i>sketch</i> file to a local system directory accessible from anywhere in your system,
-and make it executable.
-
- cp sketch /usr/local/bin/sketch
- chmod +x /usr/local/bin/sketch
-
-Test your executable. You should see something like the following:
-
- sketch
-
- NAME
- sketch - sketch Uniques, Quantiles, Histograms, or Frequent Items.
- SYNOPSIS
- sketch (this help)
- sketch TYPE help
- sketch TYPE [SIZE] [FILE]
- DESCRIPTION
- Write a sketch(TYPE, SIZE) of FILE to standard output.
- TYPE is required.
- If SIZE is omitted, internal defaults are used.
- If FILE is omitted, Standard In is assumed.
- TYPE DESCRIPTION
- sketch uniq : Sketch the unique string items of a stream.
- sketch rank : Sketch the rank-value distribution of a numeric value stream.
- sketch hist : Sketch the linear-axis value-frequency distribution of numeric value stream.
- sketch loghist : Sketch the log-axis value-frequency distribution of numeric value stream.
- sketch freq : Sketch the Heavy Hitters of a string item stream.
-
- UNIQ SYNOPSIS
- sketch uniq help
- sketch uniq [SIZE] [FILE]
-
- RANK SYNOPSIS
- sketch rank help
- sketch rank [SIZE] [FILE]
-
- HIST SYNOPSIS
- sketch hist help
- sketch hist [SIZE] [FILE]
-
- LOGHIST SYNOPSIS
- sketch loghist help
- sketch loghist [SIZE] [FILE]
-
- FREQ SYNOPSIS
- sketch freq help
- sketch freq [SIZE] [FILE]
-
-You can create a test data file, with duplicate values, like this:
-
- $ python -c "exec(\"import random\\nfor _ in range(10000000): print random.randint(1,10000000)\")" > manyNumbers.txt
-
-Now you can do either something like this:
-
- $ cat manyNumbers.txt | sketch uniq
- or
- $ cat manyNumbers.txt | sketch uniq 16000
-
-or like this:
-
- $ sketch uniq manyNumbers.txt
- or
- $ sketch uniq 16000 manyNumbers.txt
-
-Providing the size allows you to tune the accuracy.
-
-Be sure to compare the speed of the above to the conventional method:
-
- $ cat manyNumbers.txt | sort | uniq | wc -l
-
-<h2>Creating the command-line <i>demo</i></h2>
-
-If you haven't already, clone and install <i>sketches-core</i> and <i>sketches-misc</i> as in the
-previous example.
-
-Create the <i>demo</i> executable with the same content as the <i>sketch</i> executable except
-for the last line:
-
- java -cp $CLSPATH org.apache.datasketches.demo.ExactVsSketchDemo $@
-
-Move this <i>demo</i> file to a local system directory accessible from anywhere in your system,
-and make it executable.
-
- cp demo /usr/local/bin/demo
- chmod +x /usr/local/bin/demo
-
-When run, the output should look something like this:
-
- demo
-
- # COMPUTE DISTINCT COUNT EXACTLY:
- ## BUILD FILE:
- Time Min:Sec.mSec = 0:17.569
- Total Values: 100,000,000
- Build Rate: 175 nSec/Value
- Exact Uniques: 50,002,776
- File Size Bytes: 1,693,331,301
-
- ## SORT & REMOVE DUPLICATES
- Unix cmd: sort -u -o tmp/sorted.txt tmp/test.txt
- Time Min:Sec.mSec = 1:49.571
-
- ## LINE COUNT
- Unix cmd: wc -l tmp/sorted.txt
- Time Min:Sec.mSec = 0:00.900
- Output from wc command:
- 50002776 tmp/sorted.txt
-
- Total Exact Time Min:Sec.mSec = 2:08.040
-
-
- # COMPUTE DISTINCT COUNT USING SKETCHES
- ## USING THETA SKETCH
- Time Min:Sec.mSec = 0:00.614
- Total Values: 100,000,000
- Build Rate: 6 nSec/Value
- Exact Uniques: 50,002,776
- ## SKETCH STATS
- Sketch Estimate of Uniques: 50,098,990
- Sketch Actual Relative Error: 0.192%
- Sketch 95%ile Error Bounds : +/- 1.563%
- Max Sketch Size Bytes: 262,144
- Speedup Factor 208.5
-
- ## USING HLL SKETCH
- Time Min:Sec.mSec = 0:02.212
- Total Values: 100,000,000
- Build Rate: 22 nSec/Value
- Exact Uniques: 50,002,776
- ## SKETCH STATS
- Sketch Estimate of Uniques: 49,784,556
- Sketch Actual Relative Error: -0.436%
- Sketch 95%ile Error Bounds : +/- 1.306%
- Max Sketch Size Bytes: 8,192
- Speedup Factor 57.9
-
-The first part builds a file, separately timed, of 100M numbers with roughly 50% duplicates.
-The second part sorts and removes duplicates using the unix <i>sort -u</i> command and may take
-several minutes to run, so be patient. The third part does a line count using the unix <i>wc -l</i>
-command.
-
-After that, two different sketch trials are run, one with a <i>Theta Sketch</i> and the
-other with a compact implementation of Flajolet's <i>HLL Sketch</i>. Sketches do not require a
-pre-built file. They run in true streaming mode with the random values generated on the fly.
-
-Check out the statistics!
-
-Enjoy!
diff --git a/docs/CommandLine/CommandLine.md b/docs/CommandLine/CommandLine.md
deleted file mode 100644
index 46f2a64..0000000
--- a/docs/CommandLine/CommandLine.md
+++ /dev/null
@@ -1,23 +0,0 @@
----
-layout: doc_page
----
-<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
--->
-<b>Note: The command-line utility is obsolete and has been removed. A completely new CL utility is in development and should be released soon.</b>
-
diff --git a/src/main/resources/docgen/toc.json b/src/main/resources/docgen/toc.json
index 1c99ccf..6b0659e 100644
--- a/src/main/resources/docgen/toc.json
+++ b/src/main/resources/docgen/toc.json
@@ -8,10 +8,13 @@
{"class":"Doc", "desc" : "Sketch Elements", "dir" : "", "file": "SketchElements" },
{"class":"Doc", "desc" : "Large Scale Computing", "dir" : "", "file": "LargeScale" },
+ {"class":"Doc", "desc" : "Overview Slide Deck", "dir" : "", "file": "DataSketches_deck", "pdf":"true" },
+ {"class":"Doc", "desc" : "Who Uses", "dir" : "", "file": "WhoUses" },
- { "class":"Dropdown", "desc" : "Architecture", "array":
+ { "class":"Dropdown", "desc" : "Architecture & Design", "array":
[
- {"class":"Doc", "desc" : "Key Features", "dir" : "Architecture", "file": "KeyFeatures" },
+ {"class":"Doc", "desc" : "Key Features", "dir" : "Architecture", "file": "KeyFeatures" },
+ {"class":"Doc", "desc" : "Sketch Feature Matrix", "dir" : "Architecture", "file": "FeatureMatrix" },
{"class":"Doc", "desc" : "Components", "dir" : "Architecture", "file": "Components" },
{"class":"Doc", "desc" : "Sketches by Component", "dir" : "Architecture", "file": "SketchesByComponent" },
{"class":"Doc", "desc" : "Sketch Criteria", "dir" : "Architecture", "file": "SketchCriteria" },
@@ -19,7 +22,6 @@
{"class":"Doc", "desc" : "Notes on Concurrency", "dir" : "Architecture", "file": "Concurrency" },
]
},
- {"class":"Doc", "desc" : "Overview Slide Deck", "dir" : "", "file": "DataSketches_deck", "pdf":"true" },
]
},
{ "class":"Dropdown", "desc" : "Community", "array":
@@ -230,17 +232,6 @@
]
},
- { "class":"Dropdown", "desc" : "Command Line", "array":
- [
- {"class":"Doc", "desc" : "Creating Command Line Executables", "dir" : "CommandLine", "file": "CommandLine" },
- ]
- },
-
- { "class":"Dropdown", "desc" : "Who Uses", "array":
- [
- {"class":"Doc", "desc" : "Who Uses", "dir" : "", "file": "WhoUses" },
- ]
- },
]
}
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org