You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@accumulo.apache.org by mm...@apache.org on 2018/04/26 17:54:13 UTC

[accumulo-examples] branch master updated: Update bloom filters example (#25)

This is an automated email from the ASF dual-hosted git repository.

mmiller pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-examples.git


The following commit(s) were added to refs/heads/master by this push:
     new 114e507  Update bloom filters example (#25)
114e507 is described below

commit 114e50730276c092e2ffa2bc1abcebb96420ef46
Author: Mike Miller <mm...@apache.org>
AuthorDate: Thu Apr 26 13:54:11 2018 -0400

    Update bloom filters example (#25)
---
 docs/bloom.md                                      | 217 ++++++---------------
 .../accumulo/examples/bloom/BloomBatchScanner.java |  92 +++++++++
 .../accumulo/examples/bloom/BloomFilters.java      |  77 ++++++++
 .../examples/bloom/BloomFiltersNotFound.java       |  47 +++++
 4 files changed, 271 insertions(+), 162 deletions(-)

diff --git a/docs/bloom.md b/docs/bloom.md
index c5549b0..528bbb5 100644
--- a/docs/bloom.md
+++ b/docs/bloom.md
@@ -16,183 +16,59 @@ limitations under the License.
 -->
 # Apache Accumulo Bloom Filter Example
 
-This example shows how to create a table with bloom filters enabled.  It also
+This example shows how to create a table with bloom filters enabled.  The second part
 shows how bloom filters increase query performance when looking for values that
 do not exist in a table.
 
-Below table named bloom_test is created and bloom filters are enabled.
-
-    $ accumulo shell -u username -p password
-    Shell - Apache Accumulo Interactive Shell
-    - version: 1.5.0
-    - instance name: instance
-    - instance id: 00000000-0000-0000-0000-000000000000
-    -
-    - type 'help' for a list of available commands
-    -
-    username@instance> setauths -u username -s exampleVis
-    username@instance> createtable bloom_test
-    username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
-    username@instance bloom_test> exit
-
-Below 1 million random values are inserted into accumulo. The randomly
-generated rows range between 0 and 1 billion. The random number generator is
-initialized with the seed 7.
-
-    $ ./bin/runex client.RandomBatchWriter --seed 7 -c ./examples.conf -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis
-
-Below the table is flushed:
-
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test -w'
-    05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.
-
-After the flush completes, 500 random queries are done against the table. The
-same seed is used to generate the queries, therefore everything is found in the
-table.
-
-    $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
-    Generating 500 random queries...finished
-    96.19 lookups/sec   5.20 secs
-    num results : 500
-    Generating 500 random queries...finished
-    102.35 lookups/sec   4.89 secs
-    num results : 500
-
-Below another 500 queries are performed, using a different seed which results
-in nothing being found. In this case the lookups are much faster because of
-the bloom filters.
-
-    $ ./bin/runex client.RandomBatchScanner --seed 8 -c ./examples.conf -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
-    Generating 500 random queries...finished
-    2212.39 lookups/sec   0.23 secs
-    num results : 0
-    Did not find 500 rows
-    Generating 500 random queries...finished
-    4464.29 lookups/sec   0.11 secs
-    num results : 0
-    Did not find 500 rows
-
-********************************************************************************
-
-Bloom filters can also speed up lookups for entries that exist. In accumulo
-data is divided into tablets and each tablet has multiple map files. Every
-lookup in accumulo goes to a specific tablet where a lookup is done on each
-map file in the tablet. So if a tablet has three map files, lookup performance
-can be three times slower than a tablet with one map file. However if the map
-files contain unique sets of data, then bloom filters can help eliminate map
-files that do not contain the row being looked up. To illustrate this two
-identical tables were created using the following process. One table had bloom
-filters, the other did not. Also the major compaction ratio was increased to
-prevent the files from being compacted into one file.
-
- * Insert 1 million entries using  RandomBatchWriter with a seed of 7
- * Flush the table using the shell
- * Insert 1 million entries using  RandomBatchWriter with a seed of 8
- * Flush the table using the shell
- * Insert 1 million entries using  RandomBatchWriter with a seed of 9
- * Flush the table using the shell
-
-After following the above steps, each table will have a tablet with three map
-files. Flushing the table after each batch of inserts will create a map file.
-Each map file will contain 1 million entries generated with a different seed.
-This is assuming that Accumulo is configured with enough memory to hold 1
-million inserts. If not, then more map files will be created.
-
-The commands for creating the first table without bloom filters are below.
-
-    $ accumulo shell -u username -p password
-    Shell - Apache Accumulo Interactive Shell
-    - version: 1.5.0
-    - instance name: instance
-    - instance id: 00000000-0000-0000-0000-000000000000
-    -
-    - type 'help' for a list of available commands
-    -
-    username@instance> setauths -u username -s exampleVis
-    username@instance> createtable bloom_test1
-    username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
-    username@instance bloom_test1> exit
-
-    $ ARGS="-c ./examples.conf -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis"
-    $ ./bin/runex client.RandomBatchWriter --seed 7 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
-    $ ./bin/runex client.RandomBatchWriter --seed 8 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
-    $ ./bin/runex client.RandomBatchWriter --seed 9 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
-
-The commands for creating the second table with bloom filers are below.
-
-    $ accumulo shell -u username -p password
-    Shell - Apache Accumulo Interactive Shell
-    - version: 1.5.0
-    - instance name: instance
-    - instance id: 00000000-0000-0000-0000-000000000000
-    -
-    - type 'help' for a list of available commands
-    -
-    username@instance> setauths -u username -s exampleVis
-    username@instance> createtable bloom_test2
-    username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
-    username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
-    username@instance bloom_test2> exit
-
-    $ ARGS="-c ./examples.conf -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60 --batchThreads 3 --vis exampleVis"
-    $ ./bin/runex client.RandomBatchWriter --seed 7 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
-    $ ./bin/runex client.RandomBatchWriter --seed 8 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
-    $ ./bin/runex client.RandomBatchWriter --seed 9 $ARGS
-    $ accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
-
-Below 500 lookups are done against the table without bloom filters using random
-NG seed 7. Even though only one map file will likely contain entries for this
-seed, all map files will be interrogated.
-
-    $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
-    Generating 500 random queries...finished
-    35.09 lookups/sec  14.25 secs
-    num results : 500
-    Generating 500 random queries...finished
-    35.33 lookups/sec  14.15 secs
-    num results : 500
-
-Below the same lookups are done against the table with bloom filters. The
-lookups were 2.86 times faster because only one map file was used, even though three
-map files existed.
-
-    $ ./bin/runex client.RandomBatchScanner --seed 7 -c ./examples.conf -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
-    Generating 500 random queries...finished
-    99.03 lookups/sec   5.05 secs
-    num results : 500
-    Generating 500 random queries...finished
-    101.15 lookups/sec   4.94 secs
-    num results : 500
-
-You can verify the table has three files by looking in HDFS. To look in HDFS
-you will need the table ID, because this is used in HDFS instead of the table
-name. The following command will show table ids.
+## Bloom Filters Enabled
+
+Accumulo data is divided into tablets and each tablet has multiple r-files.
+Lookup performance of a tablet with 3 r-files can be 3 times slower than
+a tablet with one r-file. However if the files contain unique sets of data,
+then bloom filters can help with performance.
+
+Run the example below to create two identical tables. One table has bloom
+filters enabled, the other does not. The major compaction ratio was increased to
+prevent the files from being compacted into one file. If Accumulo is not configured
+with enough memory to hold 1 million rows then more r-files will be created.
+
+    $ ./bin/runex bloom.BloomFilters
+
+Run the example below to perform 500 lookups against each table. Even though only one r-file will 
+likely contain entries for the query, all files will be interrogated.
+    
+    $ ./bin/runex bloom.BloomBatchScanner
+
+    Scanning bloom_test1 with seed 7
+    Scan finished! 282.49 lookups/sec, 1.77 secs, 500 results
+    All expected rows were scanned
+    Scanning bloom_test2 with seed 7
+    Scan finished! 704.23 lookups/sec, 0.71 secs, 500 results
+    All expected rows were scanned
+
+You can verify the table has three or more r-files by looking in HDFS. To look in HDFS
+you will need the table ID, which can be found with the following shell command.
 
     $ accumulo shell -u username -p password -e 'tables -l'
     accumulo.metadata    =>        !0
     accumulo.root        =>        +r
-    bloom_test1          =>        o7
-    bloom_test2          =>        o8
+    bloom_test1          =>         2
+    bloom_test2          =>         3
     trace                =>         1
 
-So the table id for bloom_test2 is o8. The command below shows what files this
+So the table id for bloom_test2 is 3. The command below shows what files this
 table has in HDFS. This assumes Accumulo is at the default location in HDFS.
 
-    $ hadoop fs -lsr /accumulo/tables/o8
-    drwxr-xr-x   - username supergroup          0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
-    -rw-r--r--   3 username supergroup   52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
-    -rw-r--r--   3 username supergroup   52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
-    -rw-r--r--   3 username supergroup   52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf
+    $ hdfs dfs -ls -R /accumulo/tables/3
+    drwxr-xr-x   - username supergroup          0 2012-01-10 14:02 /accumulo/tables/3/default_tablet
+    -rw-r--r--   3 username supergroup   52672650 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dj.rf
+    -rw-r--r--   3 username supergroup   52436176 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dk.rf
+    -rw-r--r--   3 username supergroup   52850173 2012-01-10 14:02 /accumulo/tables/3/default_tablet/F00000dl.rf
 
 Running the rfile-info command shows that one of the files has a bloom filter
 and its 1.5MB.
 
-    $ accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
+    $ accumulo rfile-info /accumulo/tables/3/default_tablet/F00000dj.rf
     Locality group         : <DEFAULT>
 	Start block          : 0
 	Num   blocks         : 752
@@ -217,3 +93,20 @@ and its 1.5MB.
       Compressed size      : 1,433,115 bytes
       Compression type     : gz
 
+## Bloom Filters when data is not found
+
+Run the example below to create 2 tables, one with bloom filters enabled.
+
+    $ ./bin/runex bloom.BloomFiltersNotFound
+
+One million random values initialized with seed 7 are inserted into each table.  
+Once the flush completes, 500 random queries are done against each table but with a different seed.
+Even when nothing is found the lookups are faster against the table with the bloom filters.
+
+    Writing data to bloom_test3 and bloom_test4 (bloom filters enabled)
+    Scanning bloom_test3 with seed 8
+    Scan finished! 780.03 lookups/sec, 0.64 secs, 0 results
+    Did not find 500
+    Scanning bloom_test4 with seed 8
+    Scan finished! 1736.11 lookups/sec, 0.29 secs, 0 results
+    Did not find 500
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java
new file mode 100644
index 0000000..8aeaf12
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomBatchScanner.java
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import static org.apache.accumulo.examples.client.RandomBatchWriter.abs;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map.Entry;
+import java.util.Random;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.BatchScanner;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.core.security.Authorizations;
+
+/**
+ * Simple example for reading random batches of data from Accumulo.
+ */
+public class BloomBatchScanner {
+
+  public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+    Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+
+    scan(connector, "bloom_test1", 7);
+    scan(connector, "bloom_test2", 7);
+  }
+
+  static void scan(Connector connector, String tableName, int seed) throws TableNotFoundException {
+    Random r = new Random(seed);
+    HashSet<Range> ranges = new HashSet<>();
+    HashMap<String,Boolean> expectedRows = new HashMap<>();
+    while (ranges.size() < 500) {
+      long rowId = abs(r.nextLong()) % 1_000_000_000;
+      String row = String.format("row_%010d", rowId);
+      ranges.add(new Range(row));
+      expectedRows.put(row, false);
+    }
+
+    long t1 = System.currentTimeMillis();
+    long results = 0;
+    long lookups = ranges.size();
+
+    System.out.println("Scanning " + tableName + " with seed " + seed);
+    try (BatchScanner scan = connector.createBatchScanner(tableName, Authorizations.EMPTY, 20)) {
+      scan.setRanges(ranges);
+      for (Entry<Key, Value> entry : scan) {
+        Key key = entry.getKey();
+        if (!expectedRows.containsKey(key.getRow().toString())) {
+          System.out.println("Encountered unexpected key: " + key);
+        } else {
+          expectedRows.put(key.getRow().toString(), true);
+        }
+        results++;
+      }
+    }
+
+    long t2 = System.currentTimeMillis();
+    System.out.println(String.format("Scan finished! %6.2f lookups/sec, %.2f secs, %d results",
+            lookups / ((t2 - t1) / 1000.0), ((t2 - t1) / 1000.0), results));
+
+    int count = 0;
+    for (Entry<String,Boolean> entry : expectedRows.entrySet()) {
+      if (!entry.getValue()) {
+        count++;
+      }
+    }
+    if (count > 0)
+      System.out.println("Did not find " + count);
+    else
+      System.out.println("All expected rows were scanned");
+  }
+}
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java
new file mode 100644
index 0000000..4dd2ed8
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomFilters.java
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import java.util.Random;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.BatchWriter;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.MutationsRejectedException;
+import org.apache.accumulo.core.client.TableExistsException;
+import org.apache.accumulo.core.client.TableNotFoundException;
+import org.apache.accumulo.core.data.Mutation;
+import org.apache.accumulo.core.security.ColumnVisibility;
+import org.apache.accumulo.examples.client.RandomBatchWriter;
+
+public class BloomFilters {
+
+  public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+    Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+    try {
+      System.out.println("Creating bloom_test1 and bloom_test2");
+      connector.tableOperations().create("bloom_test1");
+      connector.tableOperations().setProperty("bloom_test1", "table.compaction.major.ratio", "7");
+      connector.tableOperations().create("bloom_test2");
+      connector.tableOperations().setProperty("bloom_test2", "table.bloom.enabled", "true");
+      connector.tableOperations().setProperty("bloom_test2", "table.compaction.major.ratio", "7");
+    } catch (TableExistsException e) {
+      // ignore
+    }
+
+    // Write a million rows 3 times flushing files to disk separately
+    System.out.println("Writing data to bloom_test1");
+    writeData(connector, "bloom_test1", 7);
+    connector.tableOperations().flush("bloom_test1", null, null, true);
+    writeData(connector, "bloom_test1", 8);
+    connector.tableOperations().flush("bloom_test1", null, null, true);
+    writeData(connector, "bloom_test1", 9);
+    connector.tableOperations().flush("bloom_test1", null, null, true);
+
+    System.out.println("Writing data to bloom_test2");
+    writeData(connector, "bloom_test2", 7);
+    connector.tableOperations().flush("bloom_test2", null, null, true);
+    writeData(connector, "bloom_test2", 8);
+    connector.tableOperations().flush("bloom_test2", null, null, true);
+    writeData(connector, "bloom_test2", 9);
+    connector.tableOperations().flush("bloom_test2", null, null, true);
+  }
+
+  // write a million random rows
+  static void writeData(Connector connector, String tableName, int seed) throws TableNotFoundException,
+        MutationsRejectedException{
+    Random r = new Random(seed);
+    try (BatchWriter bw = connector.createBatchWriter(tableName)) {
+      for (int x = 0; x < 1_000_000; x++) {
+        Long rowId = RandomBatchWriter.abs(r.nextLong()) % 1_000_000_000;
+        Mutation m = RandomBatchWriter.createMutation(rowId, 50, new ColumnVisibility());
+        bw.addMutation(m);
+      }
+    }
+  }
+}
diff --git a/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java b/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java
new file mode 100644
index 0000000..21a8738
--- /dev/null
+++ b/src/main/java/org/apache/accumulo/examples/bloom/BloomFiltersNotFound.java
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.accumulo.examples.bloom;
+
+import static org.apache.accumulo.examples.bloom.BloomFilters.writeData;
+
+import org.apache.accumulo.core.client.AccumuloException;
+import org.apache.accumulo.core.client.AccumuloSecurityException;
+import org.apache.accumulo.core.client.Connector;
+import org.apache.accumulo.core.client.TableExistsException;
+import org.apache.accumulo.core.client.TableNotFoundException;
+
+public class BloomFiltersNotFound {
+
+  public static void main(String[] args) throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
+    Connector connector = Connector.builder().usingProperties("conf/accumulo-client.properties").build();
+    try {
+      connector.tableOperations().create("bloom_test3");
+      connector.tableOperations().create("bloom_test4");
+      connector.tableOperations().setProperty("bloom_test4", "table.bloom.enabled", "true");
+    } catch (TableExistsException e) {
+      // ignore
+    }
+    System.out.println("Writing data to bloom_test3 and bloom_test4 (bloom filters enabled)");
+    writeData(connector, "bloom_test3", 7);
+    connector.tableOperations().flush("bloom_test3", null, null, true);
+    writeData(connector, "bloom_test4", 7);
+    connector.tableOperations().flush("bloom_test4", null, null, true);
+
+    BloomBatchScanner.scan(connector, "bloom_test3", 8);
+    BloomBatchScanner.scan(connector, "bloom_test4", 8);
+  }
+}

-- 
To stop receiving notification emails like this one, please contact
mmiller@apache.org.