You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by wv...@apache.org on 2014/01/28 00:51:15 UTC

[50/51] [partial] DATAFU-20 Initial commit of website content

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/blog/2013-09-04-datafu-1-0.markdown
----------------------------------------------------------------------
diff --git a/site/source/blog/2013-09-04-datafu-1-0.markdown b/site/source/blog/2013-09-04-datafu-1-0.markdown
new file mode 100644
index 0000000..85a1c25
--- /dev/null
+++ b/site/source/blog/2013-09-04-datafu-1-0.markdown
@@ -0,0 +1,597 @@
+---
+title: DataFu 1.0
+author: William Vaughan
+---
+
+[DataFu](http://data.linkedin.com/opensource/datafu) is an open-source collection of user-defined functions for working with large-scale data in [Hadoop](http://hadoop.apache.org/) and [Pig](http://pig.apache.org/).
+
+About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and statistics tasks. Over the years, we had developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came [PigUnit](http://pig.apache.org/docs/r0.11.1/test.html#pigunit), which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have the initial release of DataFu.
+
+Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like "People You May Known" and "Skills and Endorsements." The library is used at numerous companies, and it has also been included in Cloudera's Hadoop distribution ([CDH](http://www.cloudera.com/content/cloudera/en/products/cdh.html)) as well as the [Apache BigTop](http://bigtop.apache.org/) project. DataFu has matured, and we are proud to announce the [1.0 release](https://github.com/linkedin/datafu/blob/master/changes.md).
+
+This release of DataFu has a number of new features that can make writing Pig easier, cleaner, and more efficient. In this post, we are going to highlight some of these new features by walking through a large number of examples. Think of this as a HowTo Pig + DataFu guide.
+
+## Counting events
+
+Let's consider a hypothetical recommendation system. In this system, a user will be recommended an item (an impression). The user can then accept that recommendation, explicitly reject that recommendation, or just simply ignore it. A common task in such a system would be to count how many times users have seen and acted on items. How could we construct a pig script to implement this task?
+
+## Setup
+
+Before we start, it's best to define what exactly we want to do, our inputs and our outputs. The task is to generate, for each user, a list of all items that user has seen with a count of how many impressions were seen, how many were accepted, and how many were rejected.
+
+In summary, our desired output schema is:
+
+```
+features: {user_id:int, items:{(item_id:int, impression_count:int, accept_count:int, reject_count:int)}}
+```
+
+For input, we can load a record for each event:
+
+```pig
+impressions = LOAD '$impressions' AS (user_id:int, item_id:int, timestamp:long);
+accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);
+rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);
+```
+
+## A naive approach
+
+The straight-forward approach to this problem generates each of the counts that we want, joins all of these counts together, and then groups them up by the user to produce the desired output:
+
+```pig
+impressions_counted = FOREACH (GROUP impressions BY (user_id, item_id)) GENERATE
+  FLATTEN(group) as (user_id, item_id), COUNT_STAR(impressions) as count;
+accepts_counted = FOREACH (GROUP accepts BY (user_id, item_id)) GENERATE
+  FLATTEN(group) as (user_id, item_id), COUNT_STAR(accepts) as count;
+rejects_counted = FOREACH (GROUP rejects BY (user_id, item_id)) GENERATE
+  FLATTEN(group) as (user_id, item_id), COUNT_STAR(rejects) as count;
+ 
+joined_accepts = JOIN impressions_counted BY (user_id, item_id) LEFT OUTER, accepts_counted BY (user_id, item_id);  
+joined_accepts = FOREACH joined_accepts GENERATE 
+  impressions_counted::user_id as user_id,
+  impressions_counted::item_id as item_id,
+  impressions_counted::count as impression_count,
+  ((accepts_counted::count is null)?0:accepts_counted::count) as accept_count;
+ 
+joined_accepts_rejects = JOIN joined_accepts BY (user_id, item_id) LEFT OUTER, rejects_counted BY (user_id, item_id);
+joined_accepts_rejects = FOREACH joined_accepts_rejects GENERATE 
+  joined_accepts::user_id as user_id,
+  joined_accepts::item_id as item_id,
+  joined_accepts::impression_count as impression_count,
+  joined_accepts::accept_count as accept_count,
+  ((rejects_counted::count is null)?0:rejects_counted::count) as reject_count;
+ 
+features = FOREACH (GROUP joined_accepts_rejects BY user_id) GENERATE 
+  group as user_id, joined_accepts_rejects.(item_id, impression_count, accept_count, reject_count) as items;
+```
+
+Unfortunately, this approach is not very efficient. It generates six mapreduce jobs during execution and streams a lot of the same data through these jobs.
+
+## A better approach
+
+Recognizing that we can combine the outer joins and group operations into a single `cogroup` allows us to reduce the number of mapreduce jobs.
+
+```pig
+features_grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id);
+features_counted = FOREACH features_grouped GENERATE 
+  FLATTEN(group) as (user_id, item_id),
+  COUNT_STAR(impressions) as impression_count,
+  COUNT_STAR(accepts) as accept_count,
+  COUNT_STAR(rejects) as reject_count;
+ 
+features = FOREACH (GROUP features_counted BY user_id) GENERATE 
+  group as user_id,
+  features_counted.(item_id, impression_count, accept_count, reject_count) as items;
+```
+
+However, we still have to perform an extra group operation to bring everything together by `user_id` for a total of two mapreduce jobs.
+
+## The best approach: DataFu
+
+The two grouping operations in the last example operate on the same set of data. It would be great if we could just get rid of one of them somehow.
+
+One thing that we have noticed is that even very big data will frequently get reasonably small once you segment it sufficiently. In this case, we have to segment down to the user level for our output. That's small enough to fit in memory. So, with a little bit of DataFu, we can group up all of the data for that user, and process it in one pass:
+
+```pig
+DEFINE CountEach datafu.pig.bags.CountEach('flatten');
+DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();
+DEFINE Coalesce datafu.pig.util.Coalesce();
+ 
+features_grouped = COGROUP impressions BY user_id, accepts BY user_id, rejects BY user_id;
+ 
+features_counted = FOREACH features_grouped GENERATE 
+  group as user_id,
+  CountEach(impressions.item_id) as impressions,
+  CountEach(accepts.item_id) as accepts,
+  CountEach(rejects.item_id) as rejects;
+ 
+features_joined = FOREACH features_counted GENERATE
+  user_id,
+  BagLeftOuterJoin(
+    impressions, 'item_id',
+    accepts, 'item_id',
+    rejects, 'item_id'
+  ) as items;
+ 
+features = FOREACH features_joined {
+  projected = FOREACH items GENERATE
+    impressions::item_id as item_id,
+    impressions::count as impression_count,
+    Coalesce(accepts::count, 0) as accept_count,
+    Coalesce(rejects::count, 0) as reject_count;
+  GENERATE user_id, projected as items;
+}
+```
+
+So, let's step through this example and see how it works and what our data looks like along the way.
+
+### Group the features
+
+First we group all of the data together by the user, getting a few bags with all of the respective event data in the bag.
+
+```pig
+features_grouped = COGROUP impressions BY user_id, accepts BY user_id, rejects BY user_id;
+
+--features_grouped: {group: int,impressions: {(user_id: int,item_id: int,timestamp: long)},accepts: {(user_id: int,item_id: int,timestamp: long)},rejects: {(user_id: int,item_id: int,timestamp: long)}}
+```
+
+### CountEach
+
+Next we count the occurences of each item in the impression, accept and reject bag.
+
+```pig
+DEFINE CountEach datafu.pig.bags.CountEach('flatten');
+
+features_counted = FOREACH features_grouped GENERATE 
+    group as user_id,
+    CountEach(impressions.item_id) as impressions,
+    CountEach(accepts.item_id) as accepts,
+    CountEach(rejects.item_id) as rejects;
+
+--features_counted: {user_id: int,impressions: {(item_id: int,count: int)},accepts: {(item_id: int,count: int)},rejects: {(item_id: int,count: int)}}
+```
+
+CountEach is a new UDF in DataFu that iterates through a bag counting the number of occurrences of each distinct tuple. In this case, we want to count occurrences of items, so we project the inner tuples of the bag to contain just the `item_id`. Since we specified the optional 'flatten' argument in the constructor, the output of the UDF will be a bag of each distinct input tuple (item_id) with a count field appended.
+
+### BagLeftOuterJoin
+
+Now, we want to combine all of the separate counts for each type of event together into one tuple per item.
+
+```pig
+DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();
+
+features_joined = FOREACH features_counted GENERATE
+    user_id,
+    BagLeftOuterJoin(
+        impressions, 'item_id',
+        accepts, 'item_id',
+        rejects, 'item_id'
+    ) as items;
+
+--features_joined: {user_id: int,items: {(impressions::item_id: int,impressions::count: int,accepts::item_id: int,accepts::count: int,rejects::item_id: int,rejects::count: int)}}
+```
+
+This is a join operation, but unfortunately, the only join operation that pig allows on bags (in a nested foreach) is `CROSS`. DataFu provides the BagLeftOuterJoin UDF to make up for this limitation. This UDF performs an in-memory hash join of each bag using the specified field as the join key. The output of this UDF mimics what you would expect from this bit of not (yet) valid Pig:
+
+```pig
+features_joined = FOREACH features_counted {
+  items = JOIN impressions BY item_id LEFT OUTER, accepts BY item_id, rejects BY item_id;
+  GENERATE
+    user_id, items;
+}
+```
+
+Because `BagLeftOuterJoin` is a UDF and works in memory, a separate map-reduce job is not launched. This fact will save us some time as we'll see later on in the analysis.
+
+### Coalesce
+
+Finally, we have our data in about the right shape. We just need to clean up the schema and put some default values in place.
+
+```pig
+DEFINE Coalesce datafu.pig.util.Coalesce();
+
+features = FOREACH features_joined {
+    projected = FOREACH items GENERATE
+        impressions::item_id as item_id,
+        impressions::count as impression_count,
+        Coalesce(accepts::count, 0) as accept_count,
+        Coalesce(rejects::count, 0) as reject_count;
+  GENERATE user_id, projected as items;
+}
+
+--features: {user_id: int,items: {(item_id: int,impression_count: int,accept_count: int,reject_count: int)}}
+```
+
+The various counts were joined together using an outer join in the previous step because a user has not necessarily performed an accept or reject action on each item that he or she has seen. If they have not acted, those fields will be null. `Coalesce` returns its first non-null parameter, which allows us to cleanly replace that null with a zero, avoiding the need for a bincond operator and maintaining the correct schema. Done!
+
+## Analysis
+
+Ok great, we now have three ways to write the same script. We know that the naive way will trigger six mapreduce jobs, the better way two, and the DataFu way one, but does that really equate to a difference in performance?
+
+Since we happened to have a dataset with a few billion records in it lying around, we decided to compare the three. We looked at two different metrics for evaluation. One is the best case wall clock time. This metric is basically the sum of the slowest map and reduce task for each job (using pig default parallelism estimates). The other is total compute time which is the sum of all map and reduce task durations.
+
+<table style="width: 80%; margin-bottom: 1em; cellpadding: 4px;">
+    <thead>
+        <tr>
+            <th>Version</th>
+            <th style="text-align: center">Wall clock time %</th>
+            <th style="text-align: center">Total compute time %</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>naive</td><td style="text-align: right;">100.0%</td><td style="text-align: right;">100.0%</td>
+        </tr>
+        <tr>
+            <td>better</td><td style="text-align: right;">30.2%</td><td style="text-align: right;">62.6%</td>
+        </tr>
+        <tr>
+            <td>datafu</td><td style="text-align: right;">10.5%</td><td style="text-align: right;">43.0%</td>
+        </tr>
+    </tbody>
+</table>
+
+As we can see, the DataFu version provides a noticable improvement in both metrics. Glad to know that work wasn't all for naught.
+
+## Creating a custom purpose UDF
+
+Many UDFs, such as those presented in the previous section, are general purpose. DataFu serves to collect these UDFs and make sure they are tested and easily available. If you are writing such a UDF, then we will happily accept contributions. However, frequently when you sit down to write a UDF, it is because you need to insert some sort of custom business logic or calculation into your pig script. These types of UDFs can easily become complex, involving a large number of parameters or nested structures.
+
+## Positional notation is bad
+
+Even once the code is written, you are not done. You have to maintain it.
+
+One of the difficult parts about this maintenance is that, as the pig script that uses the UDF changes, a developer has to be sure not to change the parameters to the UDF. Worse, because a standard UDF references fields by positions, it's very easy to introduce a subtle change that has an unintended side effect that does not trigger any errors during runtime, for example, when two fields of the same type swap positions.
+
+## Aliases can be better
+
+Using aliases instead of positions makes it easier to maintain a consistent mapping between the UDF and the pig script. If an alias is removed, the UDF will fail with an error. If an alias changes position in a tuple, the UDF does not need to care. The alias also has some semantic meaning to the developer which can aid in the maintenance proces.
+
+## AliasableEvalFunc
+
+Unfortunately, there is a problem using aliases. As of Pig 11.1 they are not available when the UDF is exec'ing on the back-end; they are only available on the front-end. The solution to this is to capture a mapping of alias to position on the front-end, store that mapping into the UDF context, retreive it on the back-end, and use it to look up each position by alias. You also need to handle a few issues with complex schemas (nested tuples and bags), keeping track of UDF instances, etc. To make this process simpler, DataFu provides `AliasableEvalFunc`, an extension to the standard `EvalFunc` with all of this behavior included.
+
+### Mortgage payment example
+
+Using `AliasableEvalFunc` is pretty simple; the primary difference is that you need to override `getOutputSchema` instead of `outputSchema` and have access to the alias, position map through a number of convenience methods. Consider the following example:
+
+```java
+public class MortgagePayment extends AliasableEvalFunc<DataBag> {
+  @Override
+  public Schema getOutputSchema(Schema input) {
+    try {
+      Schema tupleSchema = new Schema();
+      tupleSchema.add(new Schema.FieldSchema("monthly_payment", DataType.DOUBLE));
+      Schema bagSchema;
+    
+      bagSchema = new Schema(new Schema.FieldSchema(this.getClass().getName().toLowerCase(), tupleSchema, DataType.BAG));
+      return bagSchema;
+    } catch (FrontendException e) {
+      throw new RuntimeException(e);
+    }
+  }
+ 
+  @Override
+  public DataBag exec(Tuple input) throws IOException  {
+    DataBag output = BagFactory.getInstance().newDefaultBag();
+    
+    // get a value from the input tuple by alias
+    Double principal = getDouble(input, "principal");
+    Integer numPayments = getInteger(input, "num_payments");
+    DataBag interestRates = getBag(input, "interest_rates");
+    
+    for (Tuple interestTuple : interestRates) {
+      // get a value from the inner bag tuple by alias
+      Double interest = getDouble(interestTuple, getPrefixedAliasName("interest_rates", "interest_rate"));
+      double monthlyPayment = computeMonthlyPayment(principal, numPayments, interest);
+      output.add(TupleFactory.getInstance().newTuple(monthlyPayment));
+    }
+    
+    return output;
+  }
+ 
+  private double computeMonthlyPayment(Double principal, Integer numPayments, Double interest) {
+    return principal * (interest * Math.pow(interest+1, numPayments)) / (Math.pow(interest+1, numPayments) - 1.0);
+  }
+}
+```
+
+In this script we retrieve by alias from the input tuple a couple of different types of fields. One of these fields is a bag, and we also want to get values from the tuples in that bag. To avoid having namespace collisions among the different levels of nested tuples, AliasableEvalFunc prepends the name of the enclosing bag or tuple. Thus, we use `getPrefixedAliasName` to find the field `interest_rate` inside the bag named `interest_rates`. That's all there is to using aliases in a UDF. As an added benefit, being able to dump schema information on errors helps in developing and debugging the UDF (see `datafu.pig.util.DataFuException`).
+
+### LinearRegression example
+
+Having access to the schema opens up UDF development possibilities. Let's look back at the recommendation system example from the first part. The script in that part generated a bunch of features about the items that users saw and clicked. That's a good start to a recommendation workflow, but the end goal is to select which items to recommend. A common way to do this is to assign a score to each item based on some sort of machine learning algorithm. A simple algorithm for this task is linear regression. Ok, let's say we've trained our first linear regression model and are ready to plug it in to our workflow to produce our scores.
+
+We could develop a custom UDF for this model that computes the score. It is just a weighted sum of the features. So, using `AliasableEvalFunc` we could retrieve each field that we need, multiply by the correct coefficient, and then sum these together. But, then every time we change the model, we are going to have to change the UDF to update the fields and coefficients. We know that our first model is not going to be very good and want to make it easy to plug in new models.
+
+The model for a linear regression is pretty simple; it's just a mapping of fields to coefficient values. The only things that will change between models are which fields we are interested in and what the coefficient for those fields will be. So, let's just pass in a string representation of the model and then let the UDF do the work of figuring out how to apply it.
+
+```pig
+DEFINE LinearRegression datafu.test.blog.LinearRegression('intercept:1,impression_count:-0.1,accept_count:2.0,reject_count:-1.0');
+ 
+features = LOAD 'test/pig/datafu/test/blog/features.dat' AS (user_id:int, items:bag{(item_id:int,impression_count:int,accept_count:int,reject_count:int)});
+ 
+recommendations = FOREACH features {
+  scored_items = FOREACH items GENERATE item_id, LinearRegression(*) as score;
+  GENERATE user_id, scored_items as items;
+}
+```
+
+Nice, that's clean, and we could even pass that model string in as a parameter so we don't have to change the pig script to change the model either -- very reusable.
+
+Now, the hard work, writing the UDF:
+
+```java
+public class LinearRegression extends AliasableEvalFunc<Double>
+{
+  Map<String, Double> parameters;
+  
+  public LinearRegression(String parameterString) {
+    parameters = new HashMap<String, Double>();
+    for (String token : parameterString.split(",")) {
+      String[] keyValue = token.split(":");
+      parameters.put(keyValue[0].trim(), Double.parseDouble(keyValue[1].trim()));
+    }     
+  }
+ 
+  @Override
+  public Schema getOutputSchema(Schema input) {
+    return new Schema(new Schema.FieldSchema("score", DataType.DOUBLE));
+  }
+ 
+  @Override
+  public Double exec(Tuple input) throws IOException {
+    double score = 0.0;
+    for (String parameter : parameters.keySet()) {
+      double coefficient = parameters.get(parameter);
+      if (parameter.equals("intercept")) {
+        score += coefficient;
+      } else {
+        score += coefficient * getDouble(input, parameter);
+      }
+    }
+    return score;
+  }
+}
+```
+
+Ok, maybe not that hard... The UDF parses out the mapping of field to coeffcient in the constructor and then looks up the specified fields by name in the exec function. So, what happens when we change the model? If we decide to drop a field from our model, it just gets ignored, even if it is in the input tuple. If we add a new feature that's already available in the data it will just work. If we try and use a model with a new feature and forget to update the pig script, it will throw an error and tell us the feature that does not exist (as part of the behavior of `getDouble()`).
+
+Combining this example with the feature counting example presented earlier, we have the basis for a recommendation system that was easy to write, will execute quickly, and will be simple to maintain.
+
+## Sampling the data
+
+Working with big data can be a bit overwhelming and time consuming. Sometimes you want to avoid some of this hassle and just look at a portion of this data. Pig has built-in support for random sampling with the `SAMPLE` operator. But sometimes a random percentage of the records is not quite what you need. Fortunately, DataFu has a few sampling UDFs that will help in some situations, and as always, we would be happy to accept any contributions of additional sampling UDFs, if you happen to have some lying around.
+
+These things always are easier to understand with a bit of code, so let's go back to our recommendation system context and look at a few more examples.
+
+## Example 1. Generate training data
+
+We had mentioned previously that we were going to use a machine learning algorithm, linear regression, to generate scores for our items. We waived our hands and it happened previously, but generally this task involves some work. One of the first steps is to generate the training data set for the learning algorithm. In order to make this training efficient, we only want to use a sample of all of our raw data.
+
+### Setup
+
+Given impression, accepts, rejects and some pre-computed features about a user and items, we'd like to generate a training set, which will have all of this information for each `(user_id, item_id)` pair, for some sample of users.
+
+So, from this input:
+
+```pig
+impressions = LOAD '$impressions' AS (user_id:int, item_id:int, timestamp:long);
+accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);
+rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);
+features = LOAD '$features' AS (user_id:int, item_id:int, feature_1:int, feature_2:int)
+```
+
+We want to produce this type of output:
+
+    {user_id, item_id, is_impressed, is_accepted, is_rejected, feature_1, feature_2}
+
+One key point on sampling here: We want the sampling to be done by `user_id`. This means that if we choose one `user_id` to be included in the sample, all the data for that `user_id` should be included in the sample. This requirement is needed to preserve the original characteristics of raw data in the sampled data as well.
+
+### Naive approach
+
+The staright-foward solution for this task will be group the tracking data for each user, item pair, then group it by `user_id`, sample this grouped data, and then flatten it all out again again:
+
+```pig
+grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id) features BY (user_id, item_id);
+full_result = FOREACH grouped GENREATE 
+  FLATTEN(group) AS user_id, item_id,
+  (impressions::timestamp is null)?1:0 AS is_impressed,
+  (accepts::timestamp is null)?1:0 AS is_accepted,
+  (rejects::timestamp is null)?1:0 AS is_rejected,
+  Coalesce(features::feature_1, 0) AS feature_1,
+  Coalesce(features::feature_2, 0) AS feature_2;
+ 
+grouped_full_result = GROUP full_result BY user_id;
+sampled = SAMPLE grouped_full_result BY group 0.01;
+result = FOREACH sampled GENERATE 
+  group AS user_id,
+  FLATTEN(full_result);
+```
+
+This job includes two group operations, which translates to two map-reduce jobs. Also, the group operation is being done on the full data even though we will sample it down later. Can we do any better than this?
+
+### A sample of DataFu -- SampleByKey
+
+Yep.
+
+```pig
+DEFINE SampleByKey datafu.pig.sampling.SampleByKey('whatever_the_salt_you_want_to_use','0.01');
+ 
+impressions = FILTER impressions BY SampleByKey('user_id');
+accepts = FILTER impressions BY SampleByKey('user_id');
+rejects = FILTER rejects BY SampleByKey('user_id');
+features = FILTER features BY SampleByKey('user_id');
+ 
+grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id), features BY (user_id, item_id);
+result = FOREACH grouped GENREATE 
+  FLATTEN(group) AS (user_id, item_id),
+  (impressions::timestamp is null)?1:0 AS is_impressed,
+  (accepts::timestamp is null)?1:0 AS is_accepted,
+  (rejects::timestamp is null)?1:0 AS is_rejected,
+  Coalesce(features::feature_1, 0) AS feature_1,
+  Coalesce(features::feature_2, 0) AS feature_2;
+```
+
+We can use the `SampleByKey` FilterFunc to do this with only one group operation. And, since the group is operating on the already sampled (significantly smaller) data this job will be far more efficient.
+
+`SampleByKey` lets you designate which fields you want to use as keys for the sampling, and guarantees that for each selected key, all other records with that key will also be selected, which is exactly what we want. Another charasteritic of `SampleByKey` is that it is deterministic, as long as the same salt is given on initialization. Thanks to this charastristic, we were able to sample the data seperately before we join them from the above example.
+
+## Example 2. Recommending your output
+
+Ok, we've now created some training data that we used to create a model which will produce a score for each recommendation. So now we've got to pick which items to show the user. But, we've got a bit of a problem, we only have limited real-estate on the screen to present our recommendations, so how do we select which ones to show? We've got a score from our model so we could just always pick the top scoring items. But then we might be showing the same recommendations all the time, and we want to shake things up a bit so things aren't so static (OK, yes, I admit this is a contrived example; you wouldn't do it this way in real life). So let's take a sample of the output.
+
+### Setup
+
+With this input:
+
+```pig
+recommendations = LOAD '$recommendations' AS (user_id:int, recs{item_id:int, score:double});
+```
+
+We want to produce the exact same output, but with fewer items per user -- let's say no more than 10.
+
+### Naive approach
+
+We can randomize using Pig's default Sample command.
+
+```pig
+results = FOREACH recommendations {
+  sampled = SAMPLE recs 1;
+  limitted = LIMIT recs 10;
+  GENERATE user_id, limited AS recs;
+}
+```
+
+The problem of this approach is that results are sampled from the population in a uniformly random fashion. The score you created with your learning algorithm does not have any effect on generating final results.
+
+### The DataFu you most likely need -- WeightedSample
+
+We should use that score we generated to help bias our sample.
+
+```pig
+DEFINE WeightedSample datafu.pig.sampling.WeightedSample();
+results = FOREACH recommendations GENERATE user_id, WeightedSample(recs, 1, 10);
+-- from recs, using index 1(second column) as weight, select up to 10 items
+```
+
+Fortunately, `WeightedSample` can do exactly that. It will randomly select from the candidates, but the scores of each candidate will be used as the probability of whether the candidate will be seleceted or not. So, the tuples with higher weight will have a higher chance to be included in sample - perfect.
+
+## Additional Examples
+
+If you've made it this far into the post, you deserve an encore. So here are two more examples of how DataFu can make writing pig a bit simpler for you:
+
+## Filtering with In
+
+One case where conditional logic can be painful is filtering based on a set of values. Suppose you want to filter tuples based on a field equalling one of a list of values. In Pig this can be achieved by joining a list of conditional checks with OR:
+
+```pig
+data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
+  
+dump data;
+-- (roses,red)
+-- (violets,blue)
+-- (sugar,sweet)
+  
+data2 = FILTER data BY adj == 'red' OR adj == 'blue';
+  
+dump data2;
+-- (roses,red)
+-- (violets,blue)
+```
+
+However as the number of items to check for grows this becomes very verbose. The `In` filter function solves this and makes the resulting code very concise:
+
+```pig
+DEFINE In datafu.pig.util.In();
+ 
+data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
+  
+dump data;
+-- (roses,red)
+-- (violets,blue)
+-- (sugar,sweet)
+  
+data2 = FILTER data BY In(adj, 'red','blue');
+  
+dump data2;
+-- (roses,red)
+-- (violets,blue)
+```
+
+## Left Outer Join of three or more relations with EmptyBagToNullFields
+
+Pig's `JOIN` operator supports performing left outer joins on two relations only. If you want to perform a join on more than two relations you have two options. One is to perform a sequence of joins.
+
+```pig
+input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
+input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
+input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
+  
+data1 = JOIN input1 BY val1 LEFT, input2 BY val1;
+data1 = FILTER data1 BY input1::val1 IS NOT NULL;
+  
+data2 = JOIN data1 BY input1::val1 LEFT, input3 BY val1;
+data2 = FILTER data2 BY input1::val1 IS NOT NULL;
+```
+
+However this can be inefficient as it requires multiple MapReduce jobs. For many situations, a better option is to use a single `COGROUP` which requires only a single MapReduce job. However the code gets pretty ugly.
+
+```pig
+input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
+input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
+input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
+  
+data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
+data2 = FOREACH data1 GENERATE
+  FLATTEN(input1), -- left join on this
+  FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) 
+      as (input2::val1,input2::val2),
+  FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) 
+      as (input3::val1,input3::val2);
+```
+
+This code uses the insight that the input1 bag will be empty when there is no match, and flattening this will remove the entire record. If the input2 or input3 bags are empty we don't want flattening them to remove the record though, so we replace them with a bag having a single tuple with null elements. When these are flattened we get a single tuple with null elements. But, we want our output to have the correct schema, so we have to specify it manually. Once, we do all of this, the approach successfully replicates the left join behavior. It's more efficient, and it's really ugly to type and read.
+
+To clean up this code we have created `EmptyBagToNullFields`, which replicates the same logic as in the example above, but in a much more concise and readable fashion.
+
+```pig
+DEFINE EmptyBagToNullFields datafu.pig.bags.EmptyBagToNullFields();
+ 
+input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
+input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
+input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
+  
+data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
+data2 = FOREACH data1 GENERATE
+  FLATTEN(input1),
+  FLATTEN(EmptyBagToNullFields(input2)),
+  FLATTEN(EmptyBagToNullFields(input3));
+```
+
+Notice that we do not need to specify the schema as with the previous `COGROUP` example. The reason is that `EmptyBagToNullFields` produces the same schema as the input bag. So much cleaner.
+
+## Final Example
+
+Ok, a second encore, but no more. If you are doing a lot of these, you can turn this into a macro:
+
+```pig
+DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined {
+  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;
+  $joined = FOREACH cogrouped GENERATE 
+    FLATTEN($relation1), 
+    FLATTEN(EmptyBagToNullFields($relation2)), 
+    FLATTEN(EmptyBagToNullFields($relation3));
+}
+```
+
+Then all you need to do is call your macro
+
+```pig
+features = left_outer_join(input1, val1, input2, val2, input3, val3);
+```
+
+## Wrap-up
+
+So, that's a lot to digest, but it's just a highlight into a few interesting pieces of DataFu. Check out the [DataFu 1.0 release](http://data.linkedin.com/opensource/datafu) as there's even more in store.
+
+We hope that it proves valuable to you and as always welcome any contributions. Please let us know how you're using the library — we would love to hear from you.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
----------------------------------------------------------------------
diff --git a/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown b/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
new file mode 100644
index 0000000..c8f619b
--- /dev/null
+++ b/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
@@ -0,0 +1,444 @@
+---
+title: DataFu's Hourglass, Incremental Data Processing in Hadoop
+author: Matthew Hayes
+---
+
+For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this dashboard up to date, we can schedule a query that runs daily and gathers the stats for the last 30 days. However, this simple implementation would be wasteful: only one day of data has changed, but we'd be consuming and recalculating the stats for all 30.
+
+A more efficient solution is to make the query incremental: using basic arithmetic, we can update the output from the previous day by adding and subtracting input data. This enables the job to process only the new data, significantly reducing the computational resources required. Unfortunately, although there are many benefits to the incremental approach, getting incremental jobs right is hard:
+
+* The job must maintain state to keep track of what has already been done, and compare this against the input to determine what to process.
+* If the previous output is reused, then the job needs to be written to consume not just new input data, but also previous outputs.
+* There are more things that can go wrong with an incremental job, so you typically need to spend more time writing automated tests to make sure things are working.
+
+To solve these problems, we are happy to announce that we have open sourced [Hourglass](https://github.com/linkedin/datafu/tree/master/contrib/hourglass), a framework that makes it much easier to write incremental Hadoop jobs. We are releasing Hourglass under the Apache 2.0 License as part of the [DataFu](https://github.com/linkedin/datafu) project. We will be presenting our "Hourglass: a Library for Incremental Processing on Hadoop" paper at the [IEEE BigData 2013](http://cci.drexel.edu/bigdata/bigdata2013/index.htm) conference on October 9th.
+
+In this post, we will give an overview of the basic concepts behind Hourglass and walk through examples of using the framework to solve processing tasks incrementally. The first example presents a job that counts how many times a member has logged in to a site. The second example presents a job that estimates the number of members who have visited in the past thirty days. Lastly, we will show you how to get the code and start writing your own incremental hadoop jobs.
+
+## Basic Concepts
+
+Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.
+
+We have found that two types of sliding window computations are extremely common in practice:
+
+* **Fixed-length**: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
+* **Fixed-start**: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.
+
+We designed Hourglass with these two use cases in mind. Our goal was to design building blocks that could efficiently solve these problems while maintaining a programming model familiar to developers of MapReduce jobs.
+
+The two major building blocks of incremental processing with Hourglass are a pair of Hadoop jobs:
+
+* **Partition-preserving**: consumes partitioned input data and produces partitioned output.
+* **Partition-collapsing**: consumes partitioned input data and merges it to produce a single output.
+
+We'll discuss these two jobs in the next two sections.
+
+## Partition-preserving job
+
+![partition-preserving job](/images/Hourglass-Concepts-Preserving.png)
+
+In the partition-preserving job, input data that is partitioned by day is consumed and output data is produced that is also partitioned by day. This is equivalent to running one MapReduce job separately for each day of input data. Suppose that the input data is a page view event and the goal is to count the number of page views by member. This job would produce the page view counts per member, partitioned by day.
+
+## Partition-collapsing job
+
+![partition-preserving job](/images/Hourglass-Concepts-Collapsing.png)
+
+In the partition-collapsing job, input data that is partitioned by day is consumed and a single output is produced. If the input data is a page view event and the goal is to count the number of page views by member, then this job would produce the page view counts per member over the entire `n` days.
+
+Functionally, the partition-collapsing job is not too different from a standard MapReduce job. However, one very useful feature it has is the ability to reuse its previous output, enabling it to avoid reprocessing input data. So if day `n+1` arrives, it can merge it with the previous output, without having to reprocess days `1` through `n`. For many aggregation problems, this makes the computation much more efficient.
+
+![partition-preserving job](/images/Hourglass-Concepts-CollapsingReuse.png)
+
+## Hourglass programming model
+
+The Hourglass jobs are implemented as MapReduce jobs for Hadoop:
+
+![partition-preserving job](/images/Hourglass-MapCombineReduce.png)
+
+The `map` method receives values from the input data and, for each input, produces zero or more key-value pairs as output. Implementing the map operation is similar to implementing the same for Hadoop, just with a different interface. For the partition-preserving job, Hourglass automatically keeps the data partitioned by day, so the developer can focus purely on the application logic.
+
+The `reduce` method receives each key and a list of values. For each key, it produces the same key and a single new value. In some cases, it could produce no output at all for a particular key. In a standard `reduce` implementation in Hadoop, the programmer is provided an interface to the list of values and is responsible for pulling each value from it. With an `accumulator`, this is reversed: the values are passed in one at a time and at most one value can be produced as output.
+
+The `combine` method is optional and can be used as an optimization. Its purpose is to reduce the amount of data that is passed to the `reducer`, limiting I/O costs. Like the `reducer`, it also uses an accumulator. For each key, it produces the same key and a single new value, where the input and output values have the same type.
+
+Hourglass uses [Avro](http://avro.apache.org/) for all of the input and output data types in the diagram above, namely `k`, `v`, `v2`, and `v3`. One of the tasks when programming with Hourglass is to define the schemas for these types. The exception is the input schema, which is implicitly determined by the jobs when the input is inspected.
+
+## Example 1: Counting Events Per Member
+
+With the basic concepts out of the way, let's look at an example. Suppose that we have a website that tracks user logins as an event, and for each event, the member ID is recorded. These events are collected and stored in HDFS in Avro under paths with the format `/data/event/yyyy/MM/dd`. Suppose for this example our Avro schema is:
+
+    {
+      "type" : "record", "name" : "ExampleEvent", 
+      "namespace" : "datafu.hourglass.test",
+      "fields" : [ {
+        "name" : "id",
+        "type" : "long",
+        "doc" : "ID"
+      } ]
+    }
+
+The goal is to count how many times each member has logged in over the entire history and produce a daily report containing these counts. One solution is to simply consume all data under `/data/event` each day and aggregate by member ID. While this solution works, it is very wasteful (and only gets more wasteful over time), as it recomputes all the data every day, even though only 1 day worth of data has changed. Wouldn't it be better if we could merge the previous result with the new data? With Hourglass you can.
+
+To continue our example, let's say there are two days of data currently available, 2013/03/15 and 2013/03/16, and that their contents are:
+
+    2013/03/15:
+    {"id": 1}, {"id": 1}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 3}
+     
+    2013/03/16:
+    {"id": 1}, {"id": 1}, {"id": 2}, {"id": 2}, {"id": 3}, 
+
+Let's aggregate the counts by member ID using Hourglass. To perform the aggregation we will use [PartitionCollapsingIncrementalJob](/docs/hourglass/0.1.3/datafu/hourglass/jobs/PartitionCollapsingIncrementalJob.html), which takes a partitioned data set and collapses all the partitions together into a single output. The goal is to aggregate the two days of input and produce a single day of output, as in the following diagram:
+
+![partition-preserving job](/images/Hourglass-Example1-Step1.png)
+
+First, create the job:
+
+```java
+PartitionCollapsingIncrementalJob job = 
+    new PartitionCollapsingIncrementalJob(Example.class);
+```
+
+Next, we will define schemas for the key and value used by the job. The key affects how data is grouped in the reducer when we perform the aggregation. In this case, it will be the member ID. The value is the piece of data being aggregated, which will be an integer representing the count.
+
+```java
+final String namespace = "com.example";
+ 
+final Schema keySchema = 
+  Schema.createRecord("Key",null,namespace,false);
+ 
+keySchema.setFields(Arrays.asList(
+  new Field("member_id",Schema.create(Type.LONG),null,null)));
+ 
+final String keySchemaString = keySchema.toString(true);
+ 
+final Schema valueSchema = 
+  Schema.createRecord("Value",null,namespace,false);
+ 
+valueSchema.setFields(Arrays.asList(
+  new Field("count",Schema.create(Type.INT),null,null)));
+```
+ 
+final String valueSchemaString = valueSchema.toString(true);
+
+This produces the following representation:
+
+    {
+      "type" : "record", "name" : "Key", "namespace" : "com.example",
+      "fields" : [ {
+        "name" : "member_id",
+        "type" : "long"
+      } ]
+    }
+     
+    {
+      "type" : "record", "name" : "Value", "namespace" : "com.example",
+      "fields" : [ {
+        "name" : "count",
+        "type" : "int"
+      } ]
+    }
+
+Now we can tell the job what our schemas are. Hourglass allows two different value types. One is the intermediate value type that is produced by the mapper and combiner. The other is the output value type, the product of the reducer. In this case we will use the same value type for each.
+
+```java
+job.setKeySchema(keySchema);
+job.setIntermediateValueSchema(valueSchema);
+job.setOutputValueSchema(valueSchema);
+```
+
+Next, we will tell Hourglass where to find the data, where to write the data, and that we want to reuse the previous output.
+
+```java
+job.setInputPaths(Arrays.asList(new Path("/data/event")));
+job.setOutputPath(new Path("/output"));
+job.setReusePreviousOutput(true);
+```
+
+Now let's get into some application logic. The mapper will produce a key-value pair from each input record, consisting of the member ID and a count, which for each input record will just be `1`.
+
+```java
+job.setMapper(new Mapper<GenericRecord,GenericRecord,GenericRecord>()
+{
+  private transient Schema kSchema;
+  private transient Schema vSchema;
+  
+  @Override
+  public void map(
+    GenericRecord input,
+    KeyValueCollector<GenericRecord, GenericRecord> collector) 
+  throws IOException, InterruptedException 
+  {
+    if (kSchema == null) 
+      kSchema = new Schema.Parser().parse(keySchemaString);
+ 
+    if (vSchema == null) 
+      vSchema = new Schema.Parser().parse(valueSchemaString);
+ 
+    GenericRecord key = new GenericData.Record(kSchema);
+    key.put("member_id", input.get("id"));
+ 
+    GenericRecord value = new GenericData.Record(vSchema);
+    value.put("count", 1);
+ 
+    collector.collect(key,value);
+  }      
+});
+```
+
+An accumulator is responsible for aggregating this data. Records will be grouped by member ID and then passed to the accumulator one-by-one. The accumulator keeps a running total and adds each input count to it. When all data has been passed to it, the `getFinal()` method will be called, which returns the output record containing the count.
+
+```java
+job.setReducerAccumulator(new Accumulator<GenericRecord,GenericRecord>() 
+{
+  private transient int count;
+  private transient Schema vSchema;
+  
+  @Override
+  public void accumulate(GenericRecord value) {
+    this.count += (Integer)value.get("count");
+  }
+ 
+  @Override
+  public GenericRecord getFinal() {
+    if (vSchema == null) 
+      vSchema = new Schema.Parser().parse(valueSchemaString);
+ 
+    GenericRecord output = new GenericData.Record(vSchema);
+    output.put("count", count);
+ 
+    return output;
+  }
+ 
+  @Override
+  public void cleanup() {
+    this.count = 0;
+  }      
+});
+```
+
+Since the intermediate and output values have the same schema, the accumulator can also be used for the combiner, so let's indicate that we want it to be used for that:
+
+```java
+job.setCombinerAccumulator(job.getReducerAccumulator());
+job.setUseCombiner(true);
+```
+
+Finally, we run the job.
+
+```java
+job.run();
+```
+
+When we inspect the output we find that the counts match what we expect:
+
+    {"key": {"member_id": 1}, "value": {"count": 5}}
+    {"key": {"member_id": 2}, "value": {"count": 3}}
+    {"key": {"member_id": 3}, "value": {"count": 3}}
+
+Now suppose that a new day of data becomes available:
+
+    2013/03/17:
+    {"id": 1}, {"id": 1}, {"id": 2}, {"id": 2}, {"id": 2},
+    {"id": 3}, {"id": 3}
+
+Let's run the job again. Since Hourglass already has a result for the previous day, it consumes the new day of input and the previous output, rather than all the input data it already processed.
+
+![partition-preserving job](/images/Hourglass-Example1-Step2.png)
+
+The previous output is passed to the accumulator, where it is aggregated with the new data. This produces the output we expect:
+
+    {"key": {"member_id": 1}, "value": {"count": 7}}
+    {"key": {"member_id": 2}, "value": {"count": 6}}
+    {"key": {"member_id": 3}, "value": {"count": 5}}
+
+In this example, we only have a few days of input data, so the impact of incrementally processing the new data is small. However, as the size of the input data grows, the benefit of incrementally processing data becomes very significant.
+
+## Example 2: Cardinality Estimation
+
+Suppose that we have another event that tracks every page view that occurs on the site. One piece of information recorded in the event is the member ID. We want to use this event to tackle another problem: a daily report that estimates the number of members who are active on the site in the past 30 days.
+
+The straightforward approach is to read in the past 30 days of data, perform a `distinct` operation on the member IDs, and then count the IDs. However, as in the previous case, the input data day-to-day is practically the same. It only differs in the days at the beginning and end of the window. This means that each day we are repeating much of the same work. But unlike the previous case, we cannot simply merge in the new day of data with the previous output because the window length is fixed and we want the oldest day to be removed when the window advances. The `PartitionCollapsingIncrementalJob` class alone from the previous example will not solve this problem.
+
+Hourglass includes another class to address the fixed-length sliding window use case: the [PartitionPreservingIncrementalJob](/docs/hourglass/0.1.3/datafu/hourglass/jobs/PartitionPreservingIncrementalJob.html). This type of job consumes partitioned input data, just like the collapsing version, but unlike the other job its output is partitioned. It keeps the data partitioned as it is processing it and uses Hadoop's multiple-outputs feature to produce data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient.
+
+With the `PartitionPreservingIncrementalJob`, we can perform aggregation per day and then use the `PartitionCollapsingIncrementalJob` to produce the final result. For basic arithmetic-based operations like summation, we could even save ourselves more work by reusing the output, subtracting off the oldest day and adding the newest one.
+
+So how can we use the two jobs together to get the cardinality of active members over the past 30 days? One solution is to use `PartitionPreservingIncrementalJob` to produce daily sets of distinct member IDs. That is, each day of data produced has all the IDs for members that accessed the site that day. In other words, this is a `distinct` operation. Then `PartitionCollapsingIncrementalJob` can consume this data, perform `distinct` again, and count the number of IDs. The benefit of this approach is that when a new day of data arrives, the partition-preserving job only needs to process that new day and nothing else, as the previous days have already been processed. This idea is outlined below.
+
+![partition-preserving job](/images/Hourglass-Example2-DistinctMembers.png)
+
+This solution should be more efficient than the naive solution. However, if an estimation of member cardinality is satisfactory, then we could make the job even more efficient. [HyperLogLog](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.142.9475) is capable of estimating the cardinality of large data sets very accurately using a relatively small amount of memory. For example, cardinalities in the billions can be estimated to within 2% accuracy using only 1.5kb of memory. It's also friendly to distributed computing as multiple HyperLogLog estimators can be merged together.
+
+HyperLogLog is a good fit for this use case. For this example, we will use [HyperLogLogPlus](https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java) from [stream-lib](https://github.com/clearspring/stream-lib), an implementation based on [this paper](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf) that includes some enhancements to the original algorithm. We can use a HyperLogLogPlus estimator for each day of input data in the partition-preserving job and serialize the estimator's bytes as the output. Then the partition-collapsing job can merge together the estimators for the time window to produce the final estimate.
+
+Let's start by defining the mapper. The key it uses is just a dummy value, as we are only producing a single statistic in this case. For the value we use a record with two fields: one is the count estimate; the other we'll just call "data", which can be either a single member ID or the bytes from the serialized estimator. For the map output we use the member ID.
+
+```java
+Mapper<GenericRecord,GenericRecord,GenericRecord> mapper = 
+  new Mapper<GenericRecord,GenericRecord,GenericRecord>() {
+    private transient Schema kSchema;
+    private transient Schema vSchema;
+  
+    @Override
+    public void map(
+      GenericRecord input,
+      KeyValueCollector<GenericRecord, GenericRecord> collector) 
+    throws IOException, InterruptedException
+    {
+      if (kSchema == null) 
+        kSchema = new Schema.Parser().parse(keySchemaString);
+      
+      if (vSchema == null) 
+        vSchema = new Schema.Parser().parse(valueSchemaString);
+      
+      GenericRecord key = new GenericData.Record(kSchema);
+      key.put("name", "member_count");
+      
+      GenericRecord value = new GenericData.Record(vSchema);
+      value.put("data",input.get("id")); // member id
+      value.put("count", 1L);            // just a single member
+      
+      collector.collect(key,value);        
+    }      
+  };
+```
+
+Next, we'll define the accumulator, which can be used for both the combiner and the reducer. This accumulator can handle either member IDs or estimator bytes. When it receives a member ID it adds it to the HyperLogLog estimator. When it receives an estimator it merges it with the current estimator to produce a new one. To produce the final result, it gets the current estimate and also serializes the current estimator as a sequence of bytes.
+
+```java
+Accumulator<GenericRecord,GenericRecord> accumulator = 
+  new Accumulator<GenericRecord,GenericRecord>() {
+    private transient HyperLogLogPlus estimator;
+    private transient Schema vSchema;
+  
+    @Override
+    public void accumulate(GenericRecord value)
+    {
+      if (estimator == null) estimator = new HyperLogLogPlus(20);
+      Object data = value.get("data");
+      if (data instanceof Long)
+      {
+        estimator.offer(data);
+      }
+      else if (data instanceof ByteBuffer)
+      {
+        ByteBuffer bytes = (ByteBuffer)data;
+        HyperLogLogPlus newEstimator;
+        try
+        {
+          newEstimator = 
+            HyperLogLogPlus.Builder.build(bytes.array());
+ 
+          estimator = 
+            (HyperLogLogPlus)estimator.merge(newEstimator);
+        }
+        catch (IOException e)
+        {
+          throw new RuntimeException(e);
+        }
+        catch (CardinalityMergeException e)
+        {
+          throw new RuntimeException(e);
+        }      
+      }
+    }
+ 
+    @Override
+    public GenericRecord getFinal()
+    {
+      if (vSchema == null) 
+        vSchema = new Schema.Parser().parse(valueSchemaString);
+      
+      GenericRecord output = new GenericData.Record(vSchema);
+      
+      try
+      {
+        ByteBuffer bytes = 
+          ByteBuffer.wrap(estimator.getBytes());
+        output.put("data", bytes);
+        output.put("count", estimator.cardinality());
+      }
+      catch (IOException e)
+      {
+        throw new RuntimeException(e);
+      }
+      return output;
+    }
+ 
+    @Override
+    public void cleanup()
+    {
+      estimator = null;
+    }      
+  };
+```
+
+So there you have it. With the mapper and accumulator now defined, it is just a matter of passing them to the jobs and providing some other configuration. The key piece is to ensure that the second job uses a 30 day sliding window:
+
+```java
+PartitionCollapsingIncrementalJob job2 = 
+  new PartitionCollapsingIncrementalJob(Example.class);    
+ 
+// ...
+ 
+job2.setNumDays(30); // 30 day sliding window
+```
+
+## Try it yourself!
+
+Here is how you can start using Hourglass. We'll test out the job from the first example against some test data we'll create in a Hadoop. First, clone the DataFu repository and navigate to the Hourglass directory:
+
+    git clone https://github.com/linkedin/datafu.git
+    cd contrib/hourglass
+
+Build the Hourglass JAR, and in addition build a test jar that contains the example jobs above.
+
+    ant jar
+    ant testjar
+
+Define some variables that we'll need for the `hadoop jar` command. These list the JAR dependencies, as well as the two JARs we just built.
+
+    export LIBJARS=$(find "lib/common" -name '*.jar' | xargs echo | tr ' ' ',')
+    export LIBJARS=$LIBJARS,$(find "build" -name '*.jar' | xargs echo | tr ' ' ',')
+    export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
+
+Generate some test data under `/data/event` using a `generate` tool. This will create some random events for dates between 2013/03/01 and 2013/03/14. Each record consists of just a single long value in the range 1-100.
+
+    hadoop jar build/datafu-hourglass-test.jar generate -libjars ${LIBJARS} /data/event 2013/03/01-2013/03/14
+
+Just to get a sense for what the data looks like, we can copy it locally and dump the first several records.
+
+    hadoop fs -copyToLocal /data/event/2013/03/01/part-00000.avro temp.avro
+    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+
+Now run the `countbyid` tool, which executes the job from the first example that we defined earlier. This will count the number of events for each ID value. In the output you will notice that it reads all fourteen days of input that are available.
+
+    hadoop jar build/datafu-hourglass-test.jar countbyid -libjars ${LIBJARS} /data/event /output
+
+We can see what this produced by copying the output locally and dumping the first several records. Each record consists of an ID and a count.
+
+    rm temp.avro
+    hadoop fs -copyToLocal /output/20130314/part-r-00000.avro temp.avro
+    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+
+Now let's generate an additional day of data for 2013/03/15.
+
+    hadoop jar build/datafu-hourglass-test.jar generate -libjars ${LIBJARS} /data/event 2013/03/15
+
+We can run the incremental job again. This time it will reuse the previous output and will only consume the new day of input.
+
+    hadoop jar build/datafu-hourglass-test.jar countbyid -libjars ${LIBJARS} /data/event /output
+
+We can download the new output and inspect the counts:
+
+    rm temp.avro
+    hadoop fs -copyToLocal /output/20130315/part-r-00000.avro temp.avro
+    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+
+Both of the examples in this post are also included as unit tests in the `Example` class within the source code. Some code has been omitted from the examples in this post for sake of space, so please check the original source if you're interested in more of the details.
+
+If you're interested in the project, we also encourage you to try running the unit tests, which can be run in Eclipse once the project is loaded there, or by running `ant test` at the command line.
+
+## Conclusion
+
+We hope this whets your appetite for incremental data processing with DataFu's Hourglass. The [code](https://github.com/linkedin/datafu/tree/master/contrib/hourglass) is available on Github in the [DataFu](https://github.com/linkedin/datafu) repository under an Apache 2.0 license. Documentation is available [here](/docs/hourglass/current/). We are accepting contributions, so if you are interesting in helping out, please fork the code and send us your pull requests!
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/blog/index.html.erb
----------------------------------------------------------------------
diff --git a/site/source/blog/index.html.erb b/site/source/blog/index.html.erb
new file mode 100644
index 0000000..e62c26c
--- /dev/null
+++ b/site/source/blog/index.html.erb
@@ -0,0 +1,19 @@
+---
+title: "Blog - DataFu"
+---
+
+<% blog.articles.each do |article| %>
+<div class="row">
+  <article class="col-lg-10">
+    <h2><%= link_to article.title, article %></h2>
+    <h5 class="text-muted"><time><%= article.date.strftime('%b %e, %Y') %></time>
+    <% if article.data.author %>
+    <h5 class="text-muted">
+        <%= article.data.author %>
+    </h5>
+    <% end%>
+    <%= article.summary %>
+    <%= link_to "Read more...", article %>
+  </article>
+</div>
+<% end %>

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/docs/datafu/1.0.0/allclasses-frame.html
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/1.0.0/allclasses-frame.html b/site/source/docs/datafu/1.0.0/allclasses-frame.html
new file mode 100644
index 0000000..8fcd276
--- /dev/null
+++ b/site/source/docs/datafu/1.0.0/allclasses-frame.html
@@ -0,0 +1,135 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<!--NewPage-->
+<HTML>
+<HEAD>
+<!-- Generated by javadoc (build 1.6.0_27) on Wed Sep 04 12:22:40 PDT 2013 -->
+<TITLE>
+All Classes (DataFu 1.0.0)
+</TITLE>
+
+<META NAME="date" CONTENT="2013-09-04">
+
+<LINK REL ="stylesheet" TYPE="text/css" HREF="stylesheet.css" TITLE="Style">
+
+
+</HEAD>
+
+<BODY BGCOLOR="white">
+<FONT size="+1" CLASS="FrameHeadingFont">
+<B>All Classes</B></FONT>
+<BR>
+
+<TABLE BORDER="0" WIDTH="100%" SUMMARY="">
+<TR>
+<TD NOWRAP><FONT CLASS="FrameItemFont"><A HREF="datafu/pig/util/AliasableEvalFunc.html" title="class in datafu.pig.util" target="classFrame">AliasableEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/bags/AppendToBag.html" title="class in datafu.pig.bags" target="classFrame">AppendToBag</A>
+<BR>
+<A HREF="datafu/pig/util/Assert.html" title="class in datafu.pig.util" target="classFrame">Assert</A>
+<BR>
+<A HREF="datafu/pig/bags/BagConcat.html" title="class in datafu.pig.bags" target="classFrame">BagConcat</A>
+<BR>
+<A HREF="datafu/pig/bags/BagGroup.html" title="class in datafu.pig.bags" target="classFrame">BagGroup</A>
+<BR>
+<A HREF="datafu/pig/bags/BagLeftOuterJoin.html" title="class in datafu.pig.bags" target="classFrame">BagLeftOuterJoin</A>
+<BR>
+<A HREF="datafu/pig/bags/BagSplit.html" title="class in datafu.pig.bags" target="classFrame">BagSplit</A>
+<BR>
+<A HREF="datafu/pig/util/BoolToInt.html" title="class in datafu.pig.util" target="classFrame">BoolToInt</A>
+<BR>
+<A HREF="datafu/pig/util/Coalesce.html" title="class in datafu.pig.util" target="classFrame">Coalesce</A>
+<BR>
+<A HREF="datafu/pig/util/ContextualEvalFunc.html" title="class in datafu.pig.util" target="classFrame">ContextualEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/bags/CountEach.html" title="class in datafu.pig.bags" target="classFrame">CountEach</A>
+<BR>
+<A HREF="datafu/pig/util/DataFuException.html" title="class in datafu.pig.util" target="classFrame">DataFuException</A>
+<BR>
+<A HREF="datafu/pig/bags/DistinctBy.html" title="class in datafu.pig.bags" target="classFrame">DistinctBy</A>
+<BR>
+<A HREF="datafu/pig/bags/EmptyBagToNull.html" title="class in datafu.pig.bags" target="classFrame">EmptyBagToNull</A>
+<BR>
+<A HREF="datafu/pig/bags/EmptyBagToNullFields.html" title="class in datafu.pig.bags" target="classFrame">EmptyBagToNullFields</A>
+<BR>
+<A HREF="datafu/pig/bags/Enumerate.html" title="class in datafu.pig.bags" target="classFrame">Enumerate</A>
+<BR>
+<A HREF="datafu/pig/util/FieldNotFound.html" title="class in datafu.pig.util" target="classFrame">FieldNotFound</A>
+<BR>
+<A HREF="datafu/pig/bags/FirstTupleFromBag.html" title="class in datafu.pig.bags" target="classFrame">FirstTupleFromBag</A>
+<BR>
+<A HREF="datafu/pig/geo/HaversineDistInMiles.html" title="class in datafu.pig.geo" target="classFrame">HaversineDistInMiles</A>
+<BR>
+<A HREF="datafu/pig/util/In.html" title="class in datafu.pig.util" target="classFrame">In</A>
+<BR>
+<A HREF="datafu/pig/util/IntToBool.html" title="class in datafu.pig.util" target="classFrame">IntToBool</A>
+<BR>
+<A HREF="datafu/pig/stats/MarkovPairs.html" title="class in datafu.pig.stats" target="classFrame">MarkovPairs</A>
+<BR>
+<A HREF="datafu/pig/hash/MD5.html" title="class in datafu.pig.hash" target="classFrame">MD5</A>
+<BR>
+<A HREF="datafu/pig/stats/Median.html" title="class in datafu.pig.stats" target="classFrame">Median</A>
+<BR>
+<A HREF="datafu/pig/bags/NullToEmptyBag.html" title="class in datafu.pig.bags" target="classFrame">NullToEmptyBag</A>
+<BR>
+<A HREF="datafu/pig/linkanalysis/PageRank.html" title="class in datafu.pig.linkanalysis" target="classFrame">PageRank</A>
+<BR>
+<A HREF="datafu/pig/linkanalysis/PageRankImpl.html" title="class in datafu.pig.linkanalysis" target="classFrame">PageRankImpl</A>
+<BR>
+<A HREF="datafu/pig/bags/PrependToBag.html" title="class in datafu.pig.bags" target="classFrame">PrependToBag</A>
+<BR>
+<A HREF="datafu/pig/stats/Quantile.html" title="class in datafu.pig.stats" target="classFrame">Quantile</A>
+<BR>
+<A HREF="datafu/pig/stats/QuantileUtil.html" title="class in datafu.pig.stats" target="classFrame">QuantileUtil</A>
+<BR>
+<A HREF="datafu/pig/random/RandInt.html" title="class in datafu.pig.random" target="classFrame">RandInt</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.html" title="class in datafu.pig.sampling" target="classFrame">ReservoirSample</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Final.html" title="class in datafu.pig.sampling" target="classFrame">ReservoirSample.Final</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Initial.html" title="class in datafu.pig.sampling" target="classFrame">ReservoirSample.Initial</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Intermediate.html" title="class in datafu.pig.sampling" target="classFrame">ReservoirSample.Intermediate</A>
+<BR>
+<A HREF="datafu/pig/bags/ReverseEnumerate.html" title="class in datafu.pig.bags" target="classFrame">ReverseEnumerate</A>
+<BR>
+<A HREF="datafu/pig/sampling/SampleByKey.html" title="class in datafu.pig.sampling" target="classFrame">SampleByKey</A>
+<BR>
+<A HREF="datafu/pig/sessions/SessionCount.html" title="class in datafu.pig.sessions" target="classFrame">SessionCount</A>
+<BR>
+<A HREF="datafu/pig/sessions/Sessionize.html" title="class in datafu.pig.sessions" target="classFrame">Sessionize</A>
+<BR>
+<A HREF="datafu/pig/sets/SetIntersect.html" title="class in datafu.pig.sets" target="classFrame">SetIntersect</A>
+<BR>
+<A HREF="datafu/pig/sets/SetUnion.html" title="class in datafu.pig.sets" target="classFrame">SetUnion</A>
+<BR>
+<A HREF="datafu/pig/util/SimpleEvalFunc.html" title="class in datafu.pig.util" target="classFrame">SimpleEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/stats/StreamingMedian.html" title="class in datafu.pig.stats" target="classFrame">StreamingMedian</A>
+<BR>
+<A HREF="datafu/pig/stats/StreamingQuantile.html" title="class in datafu.pig.stats" target="classFrame">StreamingQuantile</A>
+<BR>
+<A HREF="datafu/pig/util/TransposeTupleToBag.html" title="class in datafu.pig.util" target="classFrame">TransposeTupleToBag</A>
+<BR>
+<A HREF="datafu/pig/bags/UnorderedPairs.html" title="class in datafu.pig.bags" target="classFrame">UnorderedPairs</A>
+<BR>
+<A HREF="datafu/pig/urls/UserAgentClassify.html" title="class in datafu.pig.urls" target="classFrame">UserAgentClassify</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.html" title="class in datafu.pig.stats" target="classFrame">VAR</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Final.html" title="class in datafu.pig.stats" target="classFrame">VAR.Final</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Initial.html" title="class in datafu.pig.stats" target="classFrame">VAR.Initial</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Intermediate.html" title="class in datafu.pig.stats" target="classFrame">VAR.Intermediate</A>
+<BR>
+<A HREF="datafu/pig/sampling/WeightedSample.html" title="class in datafu.pig.sampling" target="classFrame">WeightedSample</A>
+<BR>
+<A HREF="datafu/pig/stats/WilsonBinConf.html" title="class in datafu.pig.stats" target="classFrame">WilsonBinConf</A>
+<BR>
+</FONT></TD>
+</TR>
+</TABLE>
+
+</BODY>
+</HTML>

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/docs/datafu/1.0.0/allclasses-noframe.html
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/1.0.0/allclasses-noframe.html b/site/source/docs/datafu/1.0.0/allclasses-noframe.html
new file mode 100644
index 0000000..3f4bd4e
--- /dev/null
+++ b/site/source/docs/datafu/1.0.0/allclasses-noframe.html
@@ -0,0 +1,135 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<!--NewPage-->
+<HTML>
+<HEAD>
+<!-- Generated by javadoc (build 1.6.0_27) on Wed Sep 04 12:22:40 PDT 2013 -->
+<TITLE>
+All Classes (DataFu 1.0.0)
+</TITLE>
+
+<META NAME="date" CONTENT="2013-09-04">
+
+<LINK REL ="stylesheet" TYPE="text/css" HREF="stylesheet.css" TITLE="Style">
+
+
+</HEAD>
+
+<BODY BGCOLOR="white">
+<FONT size="+1" CLASS="FrameHeadingFont">
+<B>All Classes</B></FONT>
+<BR>
+
+<TABLE BORDER="0" WIDTH="100%" SUMMARY="">
+<TR>
+<TD NOWRAP><FONT CLASS="FrameItemFont"><A HREF="datafu/pig/util/AliasableEvalFunc.html" title="class in datafu.pig.util">AliasableEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/bags/AppendToBag.html" title="class in datafu.pig.bags">AppendToBag</A>
+<BR>
+<A HREF="datafu/pig/util/Assert.html" title="class in datafu.pig.util">Assert</A>
+<BR>
+<A HREF="datafu/pig/bags/BagConcat.html" title="class in datafu.pig.bags">BagConcat</A>
+<BR>
+<A HREF="datafu/pig/bags/BagGroup.html" title="class in datafu.pig.bags">BagGroup</A>
+<BR>
+<A HREF="datafu/pig/bags/BagLeftOuterJoin.html" title="class in datafu.pig.bags">BagLeftOuterJoin</A>
+<BR>
+<A HREF="datafu/pig/bags/BagSplit.html" title="class in datafu.pig.bags">BagSplit</A>
+<BR>
+<A HREF="datafu/pig/util/BoolToInt.html" title="class in datafu.pig.util">BoolToInt</A>
+<BR>
+<A HREF="datafu/pig/util/Coalesce.html" title="class in datafu.pig.util">Coalesce</A>
+<BR>
+<A HREF="datafu/pig/util/ContextualEvalFunc.html" title="class in datafu.pig.util">ContextualEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/bags/CountEach.html" title="class in datafu.pig.bags">CountEach</A>
+<BR>
+<A HREF="datafu/pig/util/DataFuException.html" title="class in datafu.pig.util">DataFuException</A>
+<BR>
+<A HREF="datafu/pig/bags/DistinctBy.html" title="class in datafu.pig.bags">DistinctBy</A>
+<BR>
+<A HREF="datafu/pig/bags/EmptyBagToNull.html" title="class in datafu.pig.bags">EmptyBagToNull</A>
+<BR>
+<A HREF="datafu/pig/bags/EmptyBagToNullFields.html" title="class in datafu.pig.bags">EmptyBagToNullFields</A>
+<BR>
+<A HREF="datafu/pig/bags/Enumerate.html" title="class in datafu.pig.bags">Enumerate</A>
+<BR>
+<A HREF="datafu/pig/util/FieldNotFound.html" title="class in datafu.pig.util">FieldNotFound</A>
+<BR>
+<A HREF="datafu/pig/bags/FirstTupleFromBag.html" title="class in datafu.pig.bags">FirstTupleFromBag</A>
+<BR>
+<A HREF="datafu/pig/geo/HaversineDistInMiles.html" title="class in datafu.pig.geo">HaversineDistInMiles</A>
+<BR>
+<A HREF="datafu/pig/util/In.html" title="class in datafu.pig.util">In</A>
+<BR>
+<A HREF="datafu/pig/util/IntToBool.html" title="class in datafu.pig.util">IntToBool</A>
+<BR>
+<A HREF="datafu/pig/stats/MarkovPairs.html" title="class in datafu.pig.stats">MarkovPairs</A>
+<BR>
+<A HREF="datafu/pig/hash/MD5.html" title="class in datafu.pig.hash">MD5</A>
+<BR>
+<A HREF="datafu/pig/stats/Median.html" title="class in datafu.pig.stats">Median</A>
+<BR>
+<A HREF="datafu/pig/bags/NullToEmptyBag.html" title="class in datafu.pig.bags">NullToEmptyBag</A>
+<BR>
+<A HREF="datafu/pig/linkanalysis/PageRank.html" title="class in datafu.pig.linkanalysis">PageRank</A>
+<BR>
+<A HREF="datafu/pig/linkanalysis/PageRankImpl.html" title="class in datafu.pig.linkanalysis">PageRankImpl</A>
+<BR>
+<A HREF="datafu/pig/bags/PrependToBag.html" title="class in datafu.pig.bags">PrependToBag</A>
+<BR>
+<A HREF="datafu/pig/stats/Quantile.html" title="class in datafu.pig.stats">Quantile</A>
+<BR>
+<A HREF="datafu/pig/stats/QuantileUtil.html" title="class in datafu.pig.stats">QuantileUtil</A>
+<BR>
+<A HREF="datafu/pig/random/RandInt.html" title="class in datafu.pig.random">RandInt</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.html" title="class in datafu.pig.sampling">ReservoirSample</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Final.html" title="class in datafu.pig.sampling">ReservoirSample.Final</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Initial.html" title="class in datafu.pig.sampling">ReservoirSample.Initial</A>
+<BR>
+<A HREF="datafu/pig/sampling/ReservoirSample.Intermediate.html" title="class in datafu.pig.sampling">ReservoirSample.Intermediate</A>
+<BR>
+<A HREF="datafu/pig/bags/ReverseEnumerate.html" title="class in datafu.pig.bags">ReverseEnumerate</A>
+<BR>
+<A HREF="datafu/pig/sampling/SampleByKey.html" title="class in datafu.pig.sampling">SampleByKey</A>
+<BR>
+<A HREF="datafu/pig/sessions/SessionCount.html" title="class in datafu.pig.sessions">SessionCount</A>
+<BR>
+<A HREF="datafu/pig/sessions/Sessionize.html" title="class in datafu.pig.sessions">Sessionize</A>
+<BR>
+<A HREF="datafu/pig/sets/SetIntersect.html" title="class in datafu.pig.sets">SetIntersect</A>
+<BR>
+<A HREF="datafu/pig/sets/SetUnion.html" title="class in datafu.pig.sets">SetUnion</A>
+<BR>
+<A HREF="datafu/pig/util/SimpleEvalFunc.html" title="class in datafu.pig.util">SimpleEvalFunc</A>
+<BR>
+<A HREF="datafu/pig/stats/StreamingMedian.html" title="class in datafu.pig.stats">StreamingMedian</A>
+<BR>
+<A HREF="datafu/pig/stats/StreamingQuantile.html" title="class in datafu.pig.stats">StreamingQuantile</A>
+<BR>
+<A HREF="datafu/pig/util/TransposeTupleToBag.html" title="class in datafu.pig.util">TransposeTupleToBag</A>
+<BR>
+<A HREF="datafu/pig/bags/UnorderedPairs.html" title="class in datafu.pig.bags">UnorderedPairs</A>
+<BR>
+<A HREF="datafu/pig/urls/UserAgentClassify.html" title="class in datafu.pig.urls">UserAgentClassify</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.html" title="class in datafu.pig.stats">VAR</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Final.html" title="class in datafu.pig.stats">VAR.Final</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Initial.html" title="class in datafu.pig.stats">VAR.Initial</A>
+<BR>
+<A HREF="datafu/pig/stats/VAR.Intermediate.html" title="class in datafu.pig.stats">VAR.Intermediate</A>
+<BR>
+<A HREF="datafu/pig/sampling/WeightedSample.html" title="class in datafu.pig.sampling">WeightedSample</A>
+<BR>
+<A HREF="datafu/pig/stats/WilsonBinConf.html" title="class in datafu.pig.stats">WilsonBinConf</A>
+<BR>
+</FONT></TD>
+</TR>
+</TABLE>
+
+</BODY>
+</HTML>

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/424e3b48/site/source/docs/datafu/1.0.0/constant-values.html
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/1.0.0/constant-values.html b/site/source/docs/datafu/1.0.0/constant-values.html
new file mode 100644
index 0000000..37c52ce
--- /dev/null
+++ b/site/source/docs/datafu/1.0.0/constant-values.html
@@ -0,0 +1,174 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<!--NewPage-->
+<HTML>
+<HEAD>
+<!-- Generated by javadoc (build 1.6.0_27) on Wed Sep 04 12:22:39 PDT 2013 -->
+<TITLE>
+Constant Field Values (DataFu 1.0.0)
+</TITLE>
+
+<META NAME="date" CONTENT="2013-09-04">
+
+<LINK REL ="stylesheet" TYPE="text/css" HREF="stylesheet.css" TITLE="Style">
+
+<SCRIPT type="text/javascript">
+function windowTitle()
+{
+    if (location.href.indexOf('is-external=true') == -1) {
+        parent.document.title="Constant Field Values (DataFu 1.0.0)";
+    }
+}
+</SCRIPT>
+<NOSCRIPT>
+</NOSCRIPT>
+
+</HEAD>
+
+<BODY BGCOLOR="white" onload="windowTitle();">
+<HR>
+
+
+<!-- ========= START OF TOP NAVBAR ======= -->
+<A NAME="navbar_top"><!-- --></A>
+<A HREF="#skip-navbar_top" title="Skip navigation links"></A>
+<TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY="">
+<TR>
+<TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1">
+<A NAME="navbar_top_firstrow"><!-- --></A>
+<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY="">
+  <TR ALIGN="center" VALIGN="top">
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Package</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Class</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Use</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="overview-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A>&nbsp;</TD>
+  </TR>
+</TABLE>
+</TD>
+<TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM>
+</EM>
+</TD>
+</TR>
+
+<TR>
+<TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">
+&nbsp;PREV&nbsp;
+&nbsp;NEXT</FONT></TD>
+<TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">
+  <A HREF="index.html?constant-values.html" target="_top"><B>FRAMES</B></A>  &nbsp;
+&nbsp;<A HREF="constant-values.html" target="_top"><B>NO FRAMES</B></A>  &nbsp;
+&nbsp;<SCRIPT type="text/javascript">
+  <!--
+  if(window==top) {
+    document.writeln('<A HREF="allclasses-noframe.html"><B>All Classes</B></A>');
+  }
+  //-->
+</SCRIPT>
+<NOSCRIPT>
+  <A HREF="allclasses-noframe.html"><B>All Classes</B></A>
+</NOSCRIPT>
+
+
+</FONT></TD>
+</TR>
+</TABLE>
+<A NAME="skip-navbar_top"></A>
+<!-- ========= END OF TOP NAVBAR ========= -->
+
+<HR>
+<CENTER>
+<H1>
+Constant Field Values</H1>
+</CENTER>
+<HR SIZE="4" NOSHADE>
+<B>Contents</B><UL>
+<LI><A HREF="#datafu.pig">datafu.pig.*</A>
+</UL>
+
+<A NAME="datafu.pig"><!-- --></A>
+<TABLE BORDER="1" WIDTH="100%" CELLPADDING="3" CELLSPACING="0" SUMMARY="">
+<TR BGCOLOR="#CCCCFF" CLASS="TableHeadingColor">
+<TH ALIGN="left"><FONT SIZE="+2">
+datafu.pig.*</FONT></TH>
+</TR>
+</TABLE>
+
+<P>
+
+<TABLE BORDER="1" CELLPADDING="3" CELLSPACING="0" SUMMARY="">
+<TR BGCOLOR="#EEEEFF" CLASS="TableSubHeadingColor">
+<TH ALIGN="left" COLSPAN="3">datafu.pig.geo.<A HREF="datafu/pig/geo/HaversineDistInMiles.html" title="class in datafu.pig.geo">HaversineDistInMiles</A></TH>
+</TR>
+<TR BGCOLOR="white" CLASS="TableRowColor">
+<A NAME="datafu.pig.geo.HaversineDistInMiles.EARTH_RADIUS"><!-- --></A><TD ALIGN="right"><FONT SIZE="-1">
+<CODE>public&nbsp;static&nbsp;final&nbsp;double</CODE></FONT></TD>
+<TD ALIGN="left"><CODE><A HREF="datafu/pig/geo/HaversineDistInMiles.html#EARTH_RADIUS">EARTH_RADIUS</A></CODE></TD>
+<TD ALIGN="right"><CODE>3958.75</CODE></TD>
+</TR>
+</FONT></TD>
+</TR>
+</TABLE>
+
+<P>
+
+<P>
+<HR>
+
+
+<!-- ======= START OF BOTTOM NAVBAR ====== -->
+<A NAME="navbar_bottom"><!-- --></A>
+<A HREF="#skip-navbar_bottom" title="Skip navigation links"></A>
+<TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY="">
+<TR>
+<TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1">
+<A NAME="navbar_bottom_firstrow"><!-- --></A>
+<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY="">
+  <TR ALIGN="center" VALIGN="top">
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Package</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Class</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Use</FONT>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="overview-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A>&nbsp;</TD>
+  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A>&nbsp;</TD>
+  </TR>
+</TABLE>
+</TD>
+<TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM>
+</EM>
+</TD>
+</TR>
+
+<TR>
+<TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">
+&nbsp;PREV&nbsp;
+&nbsp;NEXT</FONT></TD>
+<TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">
+  <A HREF="index.html?constant-values.html" target="_top"><B>FRAMES</B></A>  &nbsp;
+&nbsp;<A HREF="constant-values.html" target="_top"><B>NO FRAMES</B></A>  &nbsp;
+&nbsp;<SCRIPT type="text/javascript">
+  <!--
+  if(window==top) {
+    document.writeln('<A HREF="allclasses-noframe.html"><B>All Classes</B></A>');
+  }
+  //-->
+</SCRIPT>
+<NOSCRIPT>
+  <A HREF="allclasses-noframe.html"><B>All Classes</B></A>
+</NOSCRIPT>
+
+
+</FONT></TD>
+</TR>
+</TABLE>
+<A NAME="skip-navbar_bottom"></A>
+<!-- ======== END OF BOTTOM NAVBAR ======= -->
+
+<HR>
+Matthew Hayes, Sam Shah
+</BODY>
+</HTML>