You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by da...@apache.org on 2015/10/31 05:30:24 UTC

svn commit: r1711573 - in /pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/func.xml

Author: daijy
Date: Sat Oct 31 04:30:24 2015
New Revision: 1711573

URL: http://svn.apache.org/viewvc?rev=1711573&view=rev
Log:
PIG-4713: Document Bloom UDF

Modified:
    pig/trunk/CHANGES.txt
    pig/trunk/src/docs/src/documentation/content/xdocs/func.xml

Modified: pig/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/pig/trunk/CHANGES.txt?rev=1711573&r1=1711572&r2=1711573&view=diff
==============================================================================
--- pig/trunk/CHANGES.txt (original)
+++ pig/trunk/CHANGES.txt Sat Oct 31 04:30:24 2015
@@ -24,6 +24,8 @@ INCOMPATIBLE CHANGES
 
 IMPROVEMENTS
 
+PIG-4713: Document Bloom UDF (gliptak via daijy)
+
 PIG-3251: Bzip2TextInputFormat requires double the memory of maximum record size (knoguchi)
 
 PIG-4704: Customizable Error Handling for Storers in Pig (siddhimehta via daijy)

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/func.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/func.xml?rev=1711573&r1=1711572&r2=1711573&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/func.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/func.xml Sat Oct 31 04:30:24 2015
@@ -294,6 +294,87 @@ team_parkyearslist = FOREACH (GROUP team
   </section>
 </section>
 
+<section id="bloom">
+  <title>Bloom</title>
+  <p>Bloom filters are a common way to select a limited set of records before
+    moving data for a join or other heavy weight operation.</p>
+
+  <section>
+    <title>Syntax</title>
+    <table>
+      <tr>
+        <td>
+          <p>BuildBloom(String hashType, String mode, String vectorSize, String nbHash)</p>
+        </td>
+      </tr>
+      <tr>
+        <td>
+          <p>Bloom(String filename)</p>
+        </td>
+      </tr>
+  </table></section>
+
+  <section>
+    <title>Terms</title>
+    <table>
+      <tr>
+	<td><p>hashtype</p></td>
+        <td><p>The type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'.</p></td>
+      </tr>
+      <tr>
+	<td><p>mode</p></td>
+        <td><p>Will be ignored, though by convention it should be "fixed" or "fixedsize"</p></td>
+      </tr>
+      <tr>
+	<td><p>vectorSize</p></td>
+        <td><p>The number of bits in the bloom filter.</p></td>
+      </tr>
+      <tr>
+	<td><p>nbHash</p></td>
+        <td><p>The number of hash functions used in constructing the bloom filter.</p></td>
+      </tr>
+      <tr>
+	<td><p>filename</p></td>
+        <td><p>File containing the serialized Bloom filter.</p></td>
+      </tr>
+    </table>
+    <p>See <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom Filter</a> for
+      a discussion of how to select the number of bits and the number of hash
+      functions.
+    </p>
+  </section>
+
+  <section>
+    <title>Usage</title>
+    <p>Bloom filters are a common way to select a limited set of records before
+      moving data for a join or other heavy weight operation. For example, if
+      one wanted to join a very large data set L with a smaller set S, and it
+      was known that the number of keys in L that will match with S is small,
+      building a bloom filter on S and then applying it to L before the join
+      can greatly reduce the number of records from L that have to be moved
+      from the map to the reduce, thus speeding the join.
+    </p>
+    <p>The implementation uses Hadoop's bloom filters
+      (org.apache.hadoop.util.bloom.BloomFilter) internally.
+    </p>
+  </section>
+  <section>
+    <title>Examples</title>
+<source>
+  define bb BuildBloom('128', '3', 'jenkins');
+  small = load 'S' as (x, y, z);
+  grpd = group small all;
+  fltrd = foreach grpd generate bb(small.x);
+  store fltrd in 'mybloom';
+  exec;
+  define bloom Bloom('mybloom');
+  large = load 'L' as (a, b, c);
+  flarge = filter large by bloom(L.a);
+  joined = join small by x, flarge by a;
+  store joined into 'results';
+</source>
+  </section>
+</section>
    
    <!-- ++++++++++++++++++++++++++++++++++++++++++++++ --> 
    <section id="concat">