You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@accumulo.apache.org by mm...@apache.org on 2018/04/26 17:54:13 UTC
[accumulo-examples] branch master updated: Update bloom filters
example (#25)
This is an automated email from the ASF dual-hosted git repository.
mmiller pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-examples.git
The following commit(s) were added to refs/heads/master by this push:
new 114e507 Update bloom filters example (#25)
114e507 is described below
commit 114e50730276c092e2ffa2bc1abcebb96420ef46
Author: Mike Miller <mm...@apache.org>
AuthorDate: Thu Apr 26 13:54:11 2018 -0400
Update bloom filters example (#25)
---
docs/bloom.md | 217 ++++++---------------
.../accumulo/examples/bloom/BloomBatchScanner.java | 92 +++++++++
.../accumulo/examples/bloom/BloomFilters.java | 77 ++++++++
.../examples/bloom/BloomFiltersNotFound.java | 47 +++++
4 files changed, 271 insertions(+), 162 deletions(-)
diff --git a/docs/bloom.md b/docs/bloom.md
index c5549b0..528bbb5 100644
--- a/docs/bloom.md
+++ b/docs/bloom.md
@@ -16,183 +16,59 @@ limitations under the License.
-->
# Apache Accumulo Bloom Filter Example
-This example shows how to create a table with bloom filters enabled. It also
+This example shows how to create a table with bloom filters enabled. The second part
shows how bloom filters increase query performance when looking for values that
do not exist in a table.
-Below table named bloom_test is created and bloom filters are enabled.
-
- $ accumulo shell -u username -p password
- Shell - Apache Accumulo Interactive Shell
- - version: 1.5.0
- - instance name: instance
- - instance id: 00000000-0000-0000-0000-000000000000
- -
- - type 'help' for a list of available commands
- -
- username@instance> setauths -u username -s exampleVis
- username@instance> createtable bloom_test
- username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
- username@instance bloom_test> exit
-
-Below 1 million random values are inserted into accumulo. The randomly
-generated rows range between 0 and 1 billion. The random number generator is
-initialized with the seed 7.
-
- $ ./bin/runex client.RandomBatchWriter --seed 7 -c ./examples.conf -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis
-
-Below the table is flushed:
-
- $ accumulo shell -u username -p password -e 'flush -t bloom_test -w'
- 05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.
-
-After the flush completes, 500 random queries are done against the table. The
-same seed is used to generate the queries, therefore everything is found in the
-table.
-
- $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
- Generating 500 random queries...finished
- 96.19 lookups/sec 5.20 secs
- num results : 500
- Generating 500 random queries...finished
- 102.35 lookups/sec 4.89 secs
- num results : 500
-
-Below another 500 queries are performed, using a different seed which results
-in nothing being found. In this case the lookups are much faster because of
-the bloom filters.
-
- $ ./bin/runex client.RandomBatchScanner --seed 8 -c ./examples.conf -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
- Generating 500 random queries...finished
- 2212.39 lookups/sec 0.23 secs
- num results : 0
- Did not find 500 rows
- Generating 500 random queries...finished
- 4464.29 lookups/sec 0.11 secs
- num results : 0
- Did not find 500 rows
-
-********************************************************************************
-
-Bloom filters can also speed up lookups for entries that exist. In accumulo
-data is divided into tablets and each tablet has multiple map files. Every
-lookup in accumulo goes to a specific tablet where a lookup is done on each
-map file in the tablet. So if a tablet has three map files, lookup performance
-can be three times slower than a tablet with one map file. However if the map
-files contain unique sets of data, then bloom filters can help eliminate map
-files that do not contain the row being looked up. To illustrate this two
-identical tables were created using the following process. One table had bloom
-filters, the other did not. Also the major compaction ratio was increased to
-prevent the files from being compacted into one file.
-
- * Insert 1 million entries using RandomBatchWriter with a seed of 7
- * Flush the table using the shell
- * Insert 1 million entries using RandomBatchWriter with a seed of 8
- * Flush the table using the shell
- * Insert 1 million entries using RandomBatchWriter with a seed of 9
- * Flush the table using the shell
-
-After following the above steps, each table will have a tablet with three map
-files. Flushing the table after each batch of inserts will create a map file.
-Each map file will contain 1 million entries generated with a different seed.
-This is assuming that Accumulo is configured with enough memory to hold 1
-million inserts. If not, then more map files will be created.
-
-The commands for creating the first table without bloom filters are below.
-
- $ accumulo shell -u username -p password
- Shell - Apache Accumulo Interactive Shell
- - version: 1.5.0
- - instance name: instance
- - instance id: 00000000-0000-0000-0000-000000000000
- -
- - type 'help' for a list of available commands
- -
- username@instance> setauths -u username -s exampleVis
- username@instance> createtable bloom_test1
- username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
- username@instance bloom_test1> exit
-
- $ ARGS="-c ./examples.conf -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis"
- $ ./bin/runex client.RandomBatchWriter --seed 7 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
- $ ./bin/runex client.RandomBatchWriter --seed 8 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
- $ ./bin/runex client.RandomBatchWriter --seed 9 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
-
-The commands for creating the second table with bloom filers are below.
-
- $ accumulo shell -u username -p password
- Shell - Apache Accumulo Interactive Shell
- - version: 1.5.0
- - instance name: instance
- - instance id: 00000000-0000-0000-0000-000000000000
- -
- - type 'help' for a list of available commands
- -
- username@instance> setauths -u username -s exampleVis
- username@instance> createtable bloom_test2
- username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
- username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
- username@instance bloom_test2> exit
-
- $ ARGS="-c ./examples.conf -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis"
- $ ./bin/runex client.RandomBatchWriter --seed 7 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
- $ ./bin/runex client.RandomBatchWriter --seed 8 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
- $ ./bin/runex client.RandomBatchWriter --seed 9 $ARGS
- $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
-
-Below 500 lookups are done against the table without bloom filters using random
-NG seed 7. Even though only one map file will likely contain entries for this
-seed, all map files will be interrogated.
-
- $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
- Generating 500 random queries...finished
- 35.09 lookups/sec 14.25 secs
- num results : 500
- Generating 500 random queries...finished
- 35.33 lookups/sec 14.15 secs
- num results : 500
-
-Below the same lookups are done against the table with bloom filters. The
-lookups were 2.86 times faster because only one map file was used, even though three
-map files existed.
-
- $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
- Generating 500 random queries...finished
- 99.03 lookups/sec 5.05 secs
- num results : 500
- Generating 500 random queries...finished
- 101.15 lookups/sec 4.94 secs
- num results : 500
-
-You can verify the table has three files by looking in HDFS. To look in HDFS
-you will need the table ID, because this is used in HDFS instead of the table
-name. The following command will show table ids.
+## Bloom Filters Enabled
+
+Accumulo data is divided into tablets and each tablet has multiple r-files.
+Lookup performance of a tablet with 3 r-files can be 3 times slower than
+a tablet with one r-file. However if the files contain unique sets of data,
+then bloom filters can help with performance.
+
+Run the example below to create two identical tables. One table has bloom
+filters enabled, the other does not. The major compaction ratio was increased to
+prevent the files from being compacted into one file. If Accumulo is not configured
+with enough memory to hold 1 million rows then more r-files will be created.
+
+ $ ./bin/runex bloom.BloomFilters
+
+Run the example below to perform 500 lookups against each table. Even though only one r-file will
+likely contain entries for the query, all files will be interrogated.
+
+ $ ./bin/runex bloom.BloomBatchScanner
+
+ Scanning bloom_test1 with seed 7
+ Scan finished! 282.49 lookups/sec, 1.77 secs, 500 results
+ All expected rows were scanned
+ Scanning bloom_test2 with seed 7
+ Scan finished! 704.23 lookups/sec, 0.71 secs, 500 results
+ All expected rows were scanned
+
+You can verify the table has three or more r-files by looking in HDFS. To look in HDFS
+you will need the table ID, which can be found with the following shell command.
$ accumulo shell -u username -p password -e 'tables -l'
accumulo.metadata => !0
accumulo.root => +r
- bloom_test1 => o7
- bloom_test2 => o8
+ bloom_test1 => 2
+ bloom_test2 => 3
trace => 1
-So the table id for bloom_test2 is o8. The command below shows what files this
+So the table id for bloom_test2 is 3. The command below shows what files this
table has in HDFS. This assumes Accumulo is at the default location in HDFS.
- $ hadoop fs -lsr /accumulo/tables/o8
- drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
- -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
- -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
- -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf
+ $ hdfs dfs -ls -R /accumulo/tables/3
+ drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/3/default_tablet
+ -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dj.rf
+ -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dk.rf
+ -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/3/default_tablet/F00000dl.rf
Running the rfile-info command shows that one of the files has a bloom filter
and its 1.5MB.
- $ accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
+ $ accumulo rfile-info /accumulo/tables/3/default_tablet/F00000dj.rf
Locality group : <DEFAULT>
Start block : 0
Num blocks : 752
@@ -217,3 +93,20 @@ and its 1.5MB.
Compressed size : 1,433,115 bytes
Compression type : gz
+## Bloom Filters when data is not found
+
+Run the example below to create 2 tables, one with bloom filters enabled.
+
+ $ ./bin/runex bloom.BloomFiltersNotFound
+
+One million random values initialized with seed 7 are inserted into each table.
+Once the flush completes, 500 random queries are done against each table but with a different seed.
+Even when nothing is found the lookups are faster against the table with the bloom filters.
+
+ Writing data to bloom_test3 and bloom_test4 (bloom filters enabled)
+ Scanning bloom_test3 with seed 8
+ Scan finished! 780.03 lookups/sec, 0.64 secs, 0 results
+ Did not find 500
+ Scanning bloom_test4 with seed 8
+ Scan finished! 1736.11 lookups/sec, 0.29 secs, 0 results
+ Did not find 500
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java
new file mode 100644
index 0000000..8aeaf12
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import static org.apache.accumulo.examples.client.RandomBatchWriter.abs;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map.Entry;
+import java.util.Random;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.BatchScanner;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.security.Authorizations;
+
+/**
+ * Simple example for reading random batches of data from Accumulo.
+ */
+public class BloomBatchScanner {
+
+ public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+ Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+
+ scan(connector, "bloom_test1", 7);
+ scan(connector, "bloom_test2", 7);
+ }
+
+ static void scan(Connector connector, String tableName, int seed) throws TableNotFoundException {
+ Random r = new Random(seed);
+ HashSet<Range> ranges = new HashSet<>();
+ HashMap<String,Boolean> expectedRows = new HashMap<>();
+ while (ranges.size() < 500) {
+ long rowId = abs(r.nextLong()) % 1_000_000_000;
+ String row = String.format("row_%010d", rowId);
+ ranges.add(new Range(row));
+ expectedRows.put(row, false);
+ }
+
+ long t1 = System.currentTimeMillis();
+ long results = 0;
+ long lookups = ranges.size();
+
+ System.out.println("Scanning " + tableName + " with seed " + seed);
+ try (BatchScanner scan = connector.createBatchScanner(tableName, Authorizations.EMPTY, 20)) {
+ scan.setRanges(ranges);
+ for (Entry<Key, Value> entry : scan) {
+ Key key = entry.getKey();
+ if (!expectedRows.containsKey(key.getRow().toString())) {
+ System.out.println("Encountered unexpected key: " + key);
+ } else {
+ expectedRows.put(key.getRow().toString(), true);
+ }
+ results++;
+ }
+ }
+
+ long t2 = System.currentTimeMillis();
+ System.out.println(String.format("Scan finished! %6.2f lookups/sec, %.2f secs, %d results",
+ lookups / ((t2 - t1) / 1000.0), ((t2 - t1) / 1000.0), results));
+
+ int count = 0;
+ for (Entry<String,Boolean> entry : expectedRows.entrySet()) {
+ if (!entry.getValue()) {
+ count++;
+ }
+ }
+ if (count > 0)
+ System.out.println("Did not find " + count);
+ else
+ System.out.println("All expected rows were scanned");
+ }
+}
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java
new file mode 100644
index 0000000..4dd2ed8
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import java.util.Random;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.MutationsRejectedException;
+import org.apache.accumulo.core.client.TableExistsException;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.security.ColumnVisibility;
+import org.apache.accumulo.examples.client.RandomBatchWriter;
+
+public class BloomFilters {
+
+ public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+ Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+ try {
+ System.out.println("Creating bloom_test1 and bloom_test2");
+ connector.tableOperations().create("bloom_test1");
+ connector.tableOperations().setProperty("bloom_test1", "table.compaction.major.ratio", "7");
+ connector.tableOperations().create("bloom_test2");
+ connector.tableOperations().setProperty("bloom_test2", "table.bloom.enabled", "true");
+ connector.tableOperations().setProperty("bloom_test2", "table.compaction.major.ratio", "7");
+ } catch (TableExistsException e) {
+ // ignore
+ }
+
+ // Write a million rows 3 times flushing files to disk separately
+ System.out.println("Writing data to bloom_test1");
+ writeData(connector, "bloom_test1", 7);
+ connector.tableOperations().flush("bloom_test1", null, null, true);
+ writeData(connector, "bloom_test1", 8);
+ connector.tableOperations().flush("bloom_test1", null, null, true);
+ writeData(connector, "bloom_test1", 9);
+ connector.tableOperations().flush("bloom_test1", null, null, true);
+
+ System.out.println("Writing data to bloom_test2");
+ writeData(connector, "bloom_test2", 7);
+ connector.tableOperations().flush("bloom_test2", null, null, true);
+ writeData(connector, "bloom_test2", 8);
+ connector.tableOperations().flush("bloom_test2", null, null, true);
+ writeData(connector, "bloom_test2", 9);
+ connector.tableOperations().flush("bloom_test2", null, null, true);
+ }
+
+ // write a million random rows
+ static void writeData(Connector connector, String tableName, int seed) throws TableNotFoundException,
+ MutationsRejectedException{
+ Random r = new Random(seed);
+ try (BatchWriter bw = connector.createBatchWriter(tableName)) {
+ for (int x = 0; x < 1_000_000; x++) {
+ Long rowId = RandomBatchWriter.abs(r.nextLong()) % 1_000_000_000;
+ Mutation m = RandomBatchWriter.createMutation(rowId, 50, new ColumnVisibility());
+ bw.addMutation(m);
+ }
+ }
+ }
+}
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java
new file mode 100644
index 0000000..21a8738
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import static org.apache.accumulo.examples.bloom.BloomFilters.writeData;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.TableExistsException;
+import org.apache.accumulo.core.client.TableNotFoundException;
+
+public class BloomFiltersNotFound {
+
+ public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+ Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+ try {
+ connector.tableOperations().create("bloom_test3");
+ connector.tableOperations().create("bloom_test4");
+ connector.tableOperations().setProperty("bloom_test4", "table.bloom.enabled", "true");
+ } catch (TableExistsException e) {
+ // ignore
+ }
+ System.out.println("Writing data to bloom_test3 and bloom_test4 (bloom filters enabled)");
+ writeData(connector, "bloom_test3", 7);
+ connector.tableOperations().flush("bloom_test3", null, null, true);
+ writeData(connector, "bloom_test4", 7);
+ connector.tableOperations().flush("bloom_test4", null, null, true);
+
+ BloomBatchScanner.scan(connector, "bloom_test3", 8);
+ BloomBatchScanner.scan(connector, "bloom_test4", 8);
+ }
+}
--
To stop receiving notification emails like this one, please contact
mmiller@apache.org.