You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by wv...@apache.org on 2014/01/23 20:28:51 UTC
git commit: Clean up README file

Updated Branches:
  refs/heads/master e80841468 -> 862a7fb3a


Clean up README file

This makes the README file much more concise and too the point.  It generally directs the user to the official website, but also includes a few useful pointers at well to help them get started.

Signed-off-by: William Vaughan <wv...@linkedin.com>


Project: http://git-wip-us.apache.org/repos/asf/incubator-datafu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-datafu/commit/862a7fb3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-datafu/tree/862a7fb3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-datafu/diff/862a7fb3

Branch: refs/heads/master
Commit: 862a7fb3abb1df6b8488af708548e182b9f92ae9
Parents: e808414
Author: Matt Hayes <mh...@linkedin.com>
Authored: Thu Jan 23 11:20:26 2014 -0800
Committer: William Vaughan <wv...@linkedin.com>
Committed: Thu Jan 23 11:27:14 2014 -0800

----------------------------------------------------------------------
 README.md | 218 ++++++++-------------------------------------------------
 1 file changed, 30 insertions(+), 188 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/862a7fb3/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 771d265..41059cb 100644
--- a/README.md
+++ b/README.md
@@ -1,33 +1,28 @@
-# DataFu [![Build Status](https://travis-ci.org/linkedin/datafu.png?branch=master)](https://travis-ci.org/linkedin/datafu)
+# Apache DataFu
 
-[DataFu](http://data.linkedin.com/opensource/datafu) is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills & Endorsements". It contains functions for:
+[Apache DataFu](http://datafu.incubator.apache.org) is a collection of libraries for working with large-scale data in Hadoop.
+The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
 
-* PageRank
-* Statistics (e.g. quantiles, median, variance, etc.)
-* Sampling (e.g. weighted, reservoir, etc.)
-* Sessionization
-* Convenience bag functions (e.g. enumerating items)
-* Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
-* Set operations (intersect, union)
-* and [more](http://linkedin.github.com/datafu/docs/current/)...
+It consists of two libraries:
 
-Each function is unit tested and code coverage is being tracked for the entire library.
+* **Apache DataFu Pig**: a collection of user-defined functions for [Apache Pig](http://pig.apache.org/)
+* **Apache DataFu Hourglass**: an incremental processing framework for [Apache Hadoop](http://hadoop.apache.org/) in MapReduce
 
-We have also contributed a framework called [Hourglass](https://github.com/linkedin/datafu/tree/master/contrib/hourglass) for incrementally
-processing data in Hadoop.
+For more information please visit the website:
 
-## Pig Compatibility
+* [http://datafu.incubator.apache.org/](http://datafu.incubator.apache.org/)
 
-The current version of DataFu has been tested against Pig 0.11.1 and 0.12.0.  DataFu should be compatible with some older versions of Pig, however
-we do not do any sort of testing with prior versions of Pig and do not guarantee compatibility.  
-Our policy is to test against the most recent version of Pig whenever we release and make sure DataFu works with that version. 
+If you'd like to jump in and get started, check out the corresponding guides for each library:
+
+* [Apache DataFu Pig - Getting Started](http://datafu.incubator.apache.org/docs/datafu/getting-started.html)
+* [Apache DataFu Hourglass - Getting Started](http://datafu.incubator.apache.org/docs/hourglass/getting-started.html)
 
 ## Blog Posts
 
-* [Introducing DataFu](http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs)
-* [DataFu: The WD-40 of Big Data](http://hortonworks.com/blog/datafu/)
-* [DataFu 1.0](http://engineering.linkedin.com/datafu/datafu-10)
-* [DataFu's Hourglass: Incremental Data Processing in Hadoop](http://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop)
+* [Introducing DataFu](http://datafu.incubator.apache.org/blog/2012/01/10/introducing-datafu.html)
+* [DataFu: The WD-40 of Big Data](http://datafu.incubator.apache.org/blog/2013/01/24/datafu-the-wd-40-of-big-data.html)
+* [DataFu 1.0](http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html)
+* [DataFu's Hourglass: Incremental Data Processing in Hadoop](http://datafu.incubator.apache.org/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html)
 
 ## Presentations
 
@@ -39,176 +34,23 @@ Our policy is to test against the most recent version of Pig whenever we release
 
 * [Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)](http://www.slideshare.net/matthewterencehayes/hourglass-27038297)
 
-## What can you do with it?
-
-Here's a taste of what you can do in Pig.
-
-### Statistics
-  
-Compute the [median](http://en.wikipedia.org/wiki/Median) with the [Median UDF](http://linkedin.github.com/datafu/docs/current/datafu/pig/stats/Median.html):
-
-    define Median datafu.pig.stats.StreamingMedian();
-
-    -- input: 3,5,4,1,2
-    input = LOAD 'input' AS (val:int);
-
-    grouped = GROUP input ALL;
-    -- produces median of 3
-    medians = FOREACH grouped GENERATE Median(sorted.val);
-  
-Similarly, compute any arbitrary [quantiles](http://en.wikipedia.org/wiki/Quantile) with [StreamingQuantile](http://linkedin.github.com/datafu/docs/current/datafu/pig/stats/StreamingQuantile.html):
-
-    define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0');
-
-    -- input: 9,10,2,3,5,8,1,4,6,7
-    input = LOAD 'input' AS (val:int);
-
-    grouped = GROUP input ALL;
-    -- produces: (1,5.5,10)
-    quantiles = FOREACH grouped GENERATE Quantile(sorted.val);
-
-Or how about the [variance](http://en.wikipedia.org/wiki/Variance) using [VAR](http://linkedin.github.com/datafu/docs/current/datafu/pig/stats/VAR.html):
-
-    define VAR datafu.pig.stats.VAR();
-
-    -- input: 1,2,3,4,5,6,7,8,9
-    input = LOAD 'input' AS (val:int);
-
-    grouped = GROUP input ALL;
-    -- produces variance of 6.666666666666668
-    variance = FOREACH grouped GENERATE VAR(input.val);
- 
-### Set Operations
-
-Treat sorted bags as sets and compute their intersection with [SetIntersect](http://linkedin.github.com/datafu/docs/current/datafu/pig/sets/SetIntersect.html):
-
-    define SetIntersect datafu.pig.sets.SetIntersect();
-  
-    -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
-    input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
-
-    -- ({(1),(4),(5)})
-    intersected = FOREACH input {
-      sorted_b1 = ORDER B1 by val;
-      sorted_b2 = ORDER B2 by val;
-      GENERATE SetIntersect(sorted_b1,sorted_b2);
-    }
-      
-Compute the set union with [SetUnion](http://linkedin.github.com/datafu/docs/current/datafu/pig/sets/SetUnion.html):
-
-    define SetUnion datafu.pig.sets.SetUnion();
-
-    -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
-    input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
-
-    -- ({(3),(4),(1),(2),(7),(5),(6),(0),(10)})
-    unioned = FOREACH input GENERATE SetUnion(B1,B2);
-      
-Operate on several bags even:
-
-    intersected = FOREACH input GENERATE SetUnion(B1,B2,B3);
-
-### Bag operations
-
-Concatenate two or more bags with [BagConcat](http://linkedin.github.com/datafu/docs/current/datafu/pig/bags/BagConcat.html):
-
-    define BagConcat datafu.pig.bags.BagConcat();
-
-    -- ({(1),(2),(3)},{(4),(5)},{(6),(7)})
-    input = LOAD 'input' AS (B1: bag{T: tuple(v:INT)}, B2: bag{T: tuple(v:INT)}, B3: bag{T: tuple(v:INT)});
-
-    -- ({(1),(2),(3),(4),(5),(6),(7)})
-    output = FOREACH input GENERATE BagConcat(B1,B2,B3);
-
-Append a tuple to a bag with [AppendToBag](http://linkedin.github.com/datafu/docs/current/datafu/pig/bags/AppendToBag.html):
-
-    define AppendToBag datafu.pig.bags.AppendToBag();
-
-    -- ({(1),(2),(3)},(4))
-    input = LOAD 'input' AS (B: bag{T: tuple(v:INT)}, T: tuple(v:INT));
-
-    -- ({(1),(2),(3),(4)})
-    output = FOREACH input GENERATE AppendToBag(B,T);
-
-### PageRank
-
-Run PageRank on a large number of independent graphs through the [PageRank UDF](http://linkedin.github.com/datafu/docs/current/datafu/pig/linkanalysis/PageRank.html):
-
-    define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');
-
-    topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);
-
-    topic_edges_grouped = GROUP topic_edges by (topic, source) ;
-    topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
-      group.topic as topic,
-      group.source as source,
-      topic_edges.(dest,weight) as edges;
-
-    topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; 
-
-    topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
-      group as topic,
-      FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);
-
-    skill_ranks = FOREACH skill_ranks GENERATE
-      topic, source, rank;
-    
-This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.
-    
-## Start Using It
-
-The JAR can be found [here](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.linkedin.datafu%22) in the Maven central repository.  The GroupId and ArtifactId are `com.linkedin.datafu` and `datafu`, respectively.
-
-If you are using Ivy:
-
-    <dependency org="com.linkedin.datafu" name="datafu" rev="1.0.0"/>
-    
-If you are using Maven:
-
-    <dependency>
-      <groupId>com.linkedin.datafu</groupId>
-      <artifactId>datafu</artifactId>
-      <version>1.0.0</version>
-    </dependency>
-
-Or [download](https://github.com/linkedin/datafu/archive/master.zip) the code.
-    
-## Working with the source code
-
-Here are some common tasks when working with the source code.
-
-### Eclipse
-
-To generate eclipse files:
-
-    ant eclipse
-
-### Build the JAR
-
-    ant jar
-    
-### Run all tests
-
-    ant test
-
-### Run specific tests
-
-Override `testclasses.pattern`, which defaults to `**/*.class`.  For example, to run all tests defined in `QuantileTests`:
-
-    ant test -Dtestclasses.pattern=**/QuantileTests.class
-
-### Compute code coverage
-
-    ant coverage
+## Getting Help
 
-### Notes on eclipse
+Bugs and feature requests can be filed [here](https://issues.apache.org/jira/browse/DATAFU).  For other help please see the [discussion group](http://groups.google.com/group/datafu).
 
-#### Adjusting heap size for TestNG plugin
+## Building the Code
 
-You may run out of heap when executing tests in Eclipse.  To fix this adjust your heap settings for the TestNG plugin.  Go to Eclipse->Preferences.  Select TestNG->Run/Debug.  Add "-Xmx1G" to the JVM args.
+The Apache DataFu Pig library can be built by running the command below.  More information about working with the source
+code can be found in the [DataFu Pig Contributing Guide](http://datafu.incubator.apache.org/docs/datafu/contributing.html).
 
-## Contribute
+```
+ant jar
+```
 
-The source code is available under the Apache 2.0 license.  
+The Apache DataFu Pig library can be built by running the commands below.  More information about working with the source
+code can be found in the [DataFu Hourglass Contributing Guide](http://datafu.incubator.apache.org/docs/hourglass/contributing.html).
 
-For help please see the [discussion group](http://groups.google.com/group/datafu).  Bugs and feature requests can be filed [here](http://github.com/linkedin/datafu/issues).
+```
+cd contrib/hourglass
+ant jar
+```