You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by mh...@apache.org on 2015/10/21 19:00:40 UTC

svn commit: r1709884 [1/8] - in /incubator/datafu/site: ./ blog/ blog/2012/01/10/ blog/2013/01/24/ blog/2013/09/04/ blog/2013/10/03/ blog/2014/04/27/ community/ docs/ docs/datafu/ docs/datafu/guide/ docs/hourglass/ javascripts/ stylesheets/

Author: mhayes
Date: Wed Oct 21 17:00:40 2015
New Revision: 1709884

URL: http://svn.apache.org/viewvc?rev=1709884&view=rev
Log:
Update datafu website

Added:
    incubator/datafu/site/community/contributing.html
    incubator/datafu/site/docs/quick-start.html
Removed:
    incubator/datafu/site/docs/datafu/contributing.html
    incubator/datafu/site/docs/hourglass/contributing.html
Modified:
    incubator/datafu/site/blog/2012/01/10/introducing-datafu.html
    incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html
    incubator/datafu/site/blog/2013/09/04/datafu-1-0.html
    incubator/datafu/site/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html
    incubator/datafu/site/blog/2014/04/27/datafu-at-apachecon.html
    incubator/datafu/site/blog/index.html
    incubator/datafu/site/community/mailing-lists.html
    incubator/datafu/site/docs/datafu/getting-started.html
    incubator/datafu/site/docs/datafu/guide.html
    incubator/datafu/site/docs/datafu/guide/bag-operations.html
    incubator/datafu/site/docs/datafu/guide/estimation.html
    incubator/datafu/site/docs/datafu/guide/hashing.html
    incubator/datafu/site/docs/datafu/guide/link-analysis.html
    incubator/datafu/site/docs/datafu/guide/more-tips-and-tricks.html
    incubator/datafu/site/docs/datafu/guide/sampling.html
    incubator/datafu/site/docs/datafu/guide/sessions.html
    incubator/datafu/site/docs/datafu/guide/set-operations.html
    incubator/datafu/site/docs/datafu/guide/statistics.html
    incubator/datafu/site/docs/datafu/javadoc.html
    incubator/datafu/site/docs/hourglass/concepts.html
    incubator/datafu/site/docs/hourglass/getting-started.html
    incubator/datafu/site/docs/hourglass/javadoc.html
    incubator/datafu/site/index.html
    incubator/datafu/site/javascripts/all.js
    incubator/datafu/site/sitemap.xml
    incubator/datafu/site/stylesheets/all.css
    incubator/datafu/site/stylesheets/highlight.css

Modified: incubator/datafu/site/blog/2012/01/10/introducing-datafu.html
URL: http://svn.apache.org/viewvc/incubator/datafu/site/blog/2012/01/10/introducing-datafu.html?rev=1709884&r1=1709883&r2=1709884&view=diff
==============================================================================
--- incubator/datafu/site/blog/2012/01/10/introducing-datafu.html (original)
+++ incubator/datafu/site/blog/2012/01/10/introducing-datafu.html Wed Oct 21 17:00:40 2015
@@ -1,3 +1,5 @@
+
+
 <!doctype html>
 <html>
   <head>
@@ -10,11 +12,9 @@
     <!-- Use title if it's in the page YAML frontmatter -->
     <title>Introducing DataFu, an open source collection of useful Apache Pig UDFs</title>
     
-    <link href="/stylesheets/all.css" media="screen" rel="stylesheet" type="text/css" />
-<link href="/stylesheets/highlight.css" media="screen" rel="stylesheet" type="text/css" />
-    <script src="/javascripts/all.js" type="text/javascript"></script>
+    <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css" rel="stylesheet" />
+    <script src="/javascripts/all.js"></script>
 
-    
     <script type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-30533336-2']);
@@ -26,14 +26,14 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
       })();
     </script>
-    
   </head>
   
   <body class="blog blog_2012 blog_2012_01 blog_2012_01_10 blog_2012_01_10_introducing-datafu">
 
     <div class="container">
 
-      <div class="header">
+      
+<div class="header">
 
   <ul class="nav nav-pills pull-right">
     <li><a href="/">Home</a></li>
@@ -49,9 +49,7 @@
   <article class="col-lg-10">
     <h1>Introducing DataFu, an open source collection of useful Apache Pig UDFs</h1>
     <h5 class="text-muted"><time>Jan 10, 2012</time></h5>
-    
       <h5 class="text-muted">Matthew Hayes</h5>
-    
 
     <hr>
 
@@ -61,7 +59,7 @@
 
 <p>DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.</p>
 
-<h3 id="toc_0">What&#39;s included?</h3>
+<h3 id="what-39-s-included">What&#39;s included?</h3>
 
 <p>Here&#39;s a taste of what you can do with DataFu:</p>
 
@@ -74,7 +72,7 @@
 <li>And <a href="/docs/datafu/1.2.0/">lots more</a>.</li>
 </ul>
 
-<h3 id="toc_1">Example: Computing Quantiles</h3>
+<h3 id="example-computing-quantiles">Example: Computing Quantiles</h3>
 
 <p>Let&#39;s walk through an example of how we could use DataFu. We will compute <a href="http://en.wikipedia.org/wiki/Quantile">quantiles</a> for a fake data set. You can grab all the code for this example, including scripts to generate test data, from this gist.</p>
 
@@ -85,7 +83,7 @@
 <p>We can use DataFu to compute quantiles using the <a href="/docs/datafu/1.2.0/datafu/pig/stats/Quantile.html">Quantile UDF</a>. The constructor for the UDF takes the quantiles to be computed. In this case we provide 0.25, 0.5, and 0.75 to compute the 25th, 50th, and 75th percentiles (a.k.a <a href="http://en.wikipedia.org/wiki/Quartile">quartiles</a>). We also provide 0.0 and 1.0 to compute the min and max.</p>
 
 <p>Quantile UDF example script:</p>
-<pre class="highlight pig"><span class="k">define</span> <span class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">stats</span><span class="p">.</span><span class="n">Quantile</span><span class="p">(</span><span class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">define</span> <span class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">stats</span><span class="p">.</span><span class="n">Quantile</span><span class="p">(</span><span class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
 
 <span class="n">temperature</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'temperature.txt'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span><span class="n">chararray</span><span class="p">,</span> <span class="n">temp</span><span class="p">:</span><span class="n">double</span><span class="p">);</span>
 
@@ -97,20 +95,22 @@
 <span class="p">}</span>
 
 <span class="k">DUMP</span> <span class="n">temperature_quartiles</span>
-</pre>
+</code></pre>
+
 <p>Quantile UDF example output, 10,000 measurements:</p>
-<pre class="highlight text">(1,(41.58171454288797,56.559375253601715,59.91093458980706,63.335574106080365,79.2841731889925))
+<pre class="highlight plaintext"><code>(1,(41.58171454288797,56.559375253601715,59.91093458980706,63.335574106080365,79.2841731889925))
 (2,(14.393515179526304,43.39558395897533,50.081758806889766,56.54245916209963,91.03574746442487))
 (3,(29.865710766927595,37.86257868882021,39.97075970657039,41.989584898364704,51.31349575866486))
-</pre>
+</code></pre>
+
 <p>The values in each row of the output are the min, 25th percentile, 50th percentile (median), 75th percentile, and max.</p>
 
-<h3 id="toc_2">StreamingQuantile UDF</h3>
+<h3 id="streamingquantile-udf">StreamingQuantile UDF</h3>
 
 <p>The Quantile UDF determines the quantiles by reading the input values for a key in sorted order and picking out the quantiles based on the size of the input DataBag. Alternatively we can estimate quantiles using the <a href="/docs/datafu/1.2.0/datafu/pig/stats/StreamingQuantile.html">StreamingQuantile UDF</a>, contributed to DataFu by <a href="http://www.linkedin.com/pub/josh-wills/0/82b/138">Josh Wills of Cloudera</a>, which does not require that the input data be sorted.</p>
 
 <p>StreamingQuantile UDF example script:</p>
-<pre class="highlight pig"><span class="k">define</span> <span class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">stats</span><span class="p">.</span><span class="n">StreamingQuantile</span><span class="p">(</span><span class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">define</span> <span class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">stats</span><span class="p">.</span><span class="n">StreamingQuantile</span><span class="p">(</span><span class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
 
 <span class="n">temperature</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'temperature.txt'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span class="p">:</span><span class="n">chararray</span><span class="p">,</span> <span class="n">temp</span><span class="p">:</span><span class="n">double</span><span class="p">);</span>
 
@@ -122,39 +122,43 @@
 <span class="p">}</span>
 
 <span class="k">DUMP</span> <span class="n">temperature_quartiles</span>
-</pre>
+</code></pre>
+
 <p>StreamingQuantile UDF example output, 10,000 measurements:</p>
-<pre class="highlight text">(1,(41.58171454288797,56.24183579452584,59.61727093346221,62.919576028265375,79.2841731889925))
+<pre class="highlight plaintext"><code>(1,(41.58171454288797,56.24183579452584,59.61727093346221,62.919576028265375,79.2841731889925))
 (2,(14.393515179526304,42.55929349057328,49.50432161293486,56.020101184758644,91.03574746442487))
 (3,(29.865710766927595,37.64744333815733,39.84941055349095,41.77693877565934,51.31349575866486))
-</pre>
+</code></pre>
+
 <p>Notice that the 25th, 50th, and 75th percentile values computed by StreamingQuantile are fairly close to the exact values computed by Quantile.</p>
 
-<h3 id="toc_3">Accuracy vs. Runtime</h3>
+<h3 id="accuracy-vs-runtime">Accuracy vs. Runtime</h3>
 
 <p>StreamingQuantile samples the data with in-memory buffers. It implements the <a href="http://pig.apache.org/docs/r0.7.0/udf.html#Accumulator+Interface">Accumulator interface</a>, which makes it much more efficient than the Quantile UDF for very large input data. Where Quantile needs access to all the input data, StreamingQuantile can be fed the data incrementally. With Quantile, the input data will be spilled to disk as the DataBag is materialized if it is too large to fit in memory. For very large input data, this can be significant.</p>
 
 <p>To demonstrate this, we can change our experiment so that instead of processing three sets of 10,000 measurements, we will process three sets of 1 billion. Let’s compare the output of Quantile and StreamingQuantile on this data set:</p>
 
 <p>Quantile UDF example output, 1 billion measurements:</p>
-<pre class="highlight text">(1,(30.524038,56.62764,60.000134,63.372384,90.561695))
+<pre class="highlight plaintext"><code>(1,(30.524038,56.62764,60.000134,63.372384,90.561695))
 (2,(-9.845137,43.25512,49.999536,56.74441,109.714687))
 (3,(21.564769,37.976644,40.000025,42.023622,58.057268))
-</pre>
+</code></pre>
+
 <p>StreamingQuantile UDF example output, 1 billion measurements:</p>
-<pre class="highlight text">(1,(30.524038,55.993967,59.488968,62.775554,90.561695))
+<pre class="highlight plaintext"><code>(1,(30.524038,55.993967,59.488968,62.775554,90.561695))
 (2,(-9.845137,41.95725,48.977708,55.554239,109.714687))
 (3,(21.564769,37.569332,39.692373,41.666762,58.057268))
-</pre>
+</code></pre>
+
 <p>The 25th, 50th, and 75th percentile values computed using StreamingQuantile are only estimates, but they are pretty close to the exact values computed with Quantile. With StreamingQuantile and Quantile there is a tradeoff between accuracy and runtime. The script using Quantile takes <strong>5 times as long</strong> to run as the one using StreamingQuantile when the input is the three sets of 1 billion measurements.</p>
 
-<h3 id="toc_4">Testing</h3>
+<h3 id="testing">Testing</h3>
 
 <p>DataFu has a suite of unit tests for each UDF. Instead of just testing the Java code for a UDF directly, which might overlook issues with the way the UDF works in an actual Pig script, we used <a href="http://pig.apache.org/docs/r0.8.1/pigunit.html">PigUnit</a> to do our testing. This let us run Pig scripts locally and still integrate our tests into a framework such as <a href="http://www.junit.org/">JUnit</a> or <a href="http://testng.org/">TestNG</a>.</p>
 
 <p>We have also integrated the code coverage tracking tool <a href="http://cobertura.sourceforge.net/">Cobertura</a> into our Ant build file. This helps us flag areas in DataFu which lack sufficient testing.</p>
 
-<h3 id="toc_5">Conclusion</h3>
+<h3 id="conclusion">Conclusion</h3>
 
 <p>We hope this gives you a taste of what you can do with DataFu. We are accepting contributions, so if you are interested in helping out, please fork the code and send us your pull requests!</p>
 
@@ -163,8 +167,9 @@
 </div>
 
     
-      <div class="footer">
-Copyright &copy; 2011-2014 <a href="http://www.apache.org/licenses/">The Apache Software Foundation</a>. <br>
+      
+<div class="footer">
+Copyright &copy; 2011-2015 <a href="http://www.apache.org/licenses/">The Apache Software Foundation</a>. <br>
 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.
 </div>
 

Modified: incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html
URL: http://svn.apache.org/viewvc/incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html?rev=1709884&r1=1709883&r2=1709884&view=diff
==============================================================================
--- incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html (original)
+++ incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html Wed Oct 21 17:00:40 2015
@@ -1,3 +1,5 @@
+
+
 <!doctype html>
 <html>
   <head>
@@ -10,11 +12,9 @@
     <!-- Use title if it's in the page YAML frontmatter -->
     <title>DataFu, The WD-40 of Big Data</title>
     
-    <link href="/stylesheets/all.css" media="screen" rel="stylesheet" type="text/css" />
-<link href="/stylesheets/highlight.css" media="screen" rel="stylesheet" type="text/css" />
-    <script src="/javascripts/all.js" type="text/javascript"></script>
+    <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css" rel="stylesheet" />
+    <script src="/javascripts/all.js"></script>
 
-    
     <script type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-30533336-2']);
@@ -26,14 +26,14 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
       })();
     </script>
-    
   </head>
   
   <body class="blog blog_2013 blog_2013_01 blog_2013_01_24 blog_2013_01_24_datafu-the-wd-40-of-big-data">
 
     <div class="container">
 
-      <div class="header">
+      
+<div class="header">
 
   <ul class="nav nav-pills pull-right">
     <li><a href="/">Home</a></li>
@@ -49,9 +49,7 @@
   <article class="col-lg-10">
     <h1>DataFu, The WD-40 of Big Data</h1>
     <h5 class="text-muted"><time>Jan 24, 2013</time></h5>
-    
       <h5 class="text-muted">Matthew Hayes, Sam Shah</h5>
-    
 
     <hr>
 
@@ -90,10 +88,10 @@ G = foreach F generate
 
 <p>You can grab sample data and code you can run on your own for this sessionization example below.</p>
 
-<h3 id="toc_0">Sessionization Example</h3>
+<h3 id="sessionization-example">Sessionization Example</h3>
 
 <p>Suppose that we have a stream of page views from which we have extracted a member ID and UNIX timestamp. It might look something like this:</p>
-<pre class="highlight text">memberId timestamp      url
+<pre class="highlight plaintext"><code>memberId timestamp      url
 1        1357718725941  /
 1        1357718871442  /profile
 1        1357719038706  /inbox
@@ -102,11 +100,12 @@ G = foreach F generate
 2        1357752955401  /inbox
 2        1357752982385  /profile
 ...
-</pre>
+</code></pre>
+
 <p>The full data set for this example can be found <a href="https://gist.github.com/raw/4614332/8231534822295e4626af75b3341239177ec44fbe/clicks.csv">here</a>.</p>
 
 <p>Using DataFu we can assign session IDs to each of these events and group by session ID in order to compute the length of each session. From there we can complete the exercise by simply applying the statistics UDFs provided by DataFu.</p>
-<pre class="highlight pig"><span class="k">REGISTER</span> <span class="n">piggybank</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
+<pre class="highlight pig"><code><span class="k">REGISTER</span> <span class="n">piggybank</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
 <span class="k">REGISTER</span> <span class="n">datafu</span><span class="o">-</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">6</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span>
 <span class="k">REGISTER</span> <span class="n">guava</span><span class="o">-</span><span class="mi">13</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">1</span><span class="p">.</span><span class="n">jar</span><span class="p">;</span> <span class="c1">-- needed by StreamingQuantile
 </span>
@@ -149,16 +148,18 @@ G = foreach F generate
 
 <span class="k">DUMP</span> <span class="n">session_stats</span>
 <span class="c1">--(15.737532575757575,31.29552045993877,(2.848041666666667),(14.648516666666666,31.88788333333333,86.69525))
-</span></pre>
-<p>This is just a taste. There’s plenty more in the library for you to peruse. Take a look <a href="http://data.linkedin.com/opensource/datafu">here</a>. DataFu is freely available under the Apache 2 license. We welcome contributions, so please send us your pull requests!</p>
+</span></code></pre>
+
+<p>This is just a taste. There’s plenty more in the library for you to peruse. Take a look <a href="/docs/datafu/guide.html">here</a>. DataFu is freely available under the Apache 2 license. We welcome contributions, so please send us your pull requests!</p>
 
 
   </article>
 </div>
 
     
-      <div class="footer">
-Copyright &copy; 2011-2014 <a href="http://www.apache.org/licenses/">The Apache Software Foundation</a>. <br>
+      
+<div class="footer">
+Copyright &copy; 2011-2015 <a href="http://www.apache.org/licenses/">The Apache Software Foundation</a>. <br>
 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.
 </div>