You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by mh...@apache.org on 2019/10/25 18:42:32 UTC

[datafu] branch master updated: DATAFU-128 Add documentation for macros (and Spark)

This is an automated email from the ASF dual-hosted git repository.

mhayes pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/datafu.git


The following commit(s) were added to refs/heads/master by this push:
     new a05782f  DATAFU-128 Add documentation for macros (and Spark)
a05782f is described below

commit a05782f96dde826a1e950cd8a935ec6aac9dde6b
Author: Eyal Allweil <ey...@apache.org>
AuthorDate: Thu Oct 10 16:48:41 2019 +0300

    DATAFU-128 Add documentation for macros (and Spark)
    
    Signed-off-by: Matthew Hayes <mh...@apache.org>
---
 README.md                                          |  17 +-
 datafu-spark/README.md                             |   2 +-
 .../src/main/scala/datafu/spark/SparkDFUtils.scala |   3 +-
 ...ook-at-paypals-contributions-to-datafu.markdown |   2 +-
 site/source/community/contributing.html.markdown   |   8 +-
 site/source/docs/datafu/guide.html.markdown.erb    |   6 +-
 .../docs/datafu/guide/macros.html.markdown.erb     | 131 +++++++++++++
 .../docs/datafu/guide/sampling.html.markdown.erb   |  26 ++-
 site/source/docs/download.html.markdown.erb        |   2 +
 .../docs/spark/getting-started.html.markdown.erb   |  75 ++++++++
 site/source/docs/spark/guide.html.markdown.erb     | 211 +++++++++++++++++++++
 site/source/index.markdown.erb                     |  15 +-
 site/source/layouts/_docs_nav.erb                  |   9 +-
 13 files changed, 481 insertions(+), 26 deletions(-)

diff --git a/README.md b/README.md
index 28c3c29..0020e53 100644
--- a/README.md
+++ b/README.md
@@ -3,8 +3,9 @@
 [Apache DataFu](http://datafu.apache.org) is a collection of libraries for working with large-scale data in Hadoop.
 The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
 
-It consists of two libraries:
+It consists of three libraries:
 
+* **Apache DataFu Spark**: a collection of utils and user-defined functions for [Apache Spark](http://spark.apache.org/)
 * **Apache DataFu Pig**: a collection of user-defined functions for [Apache Pig](http://pig.apache.org/)
 * **Apache DataFu Hourglass**: an incremental processing framework for [Apache Hadoop](http://hadoop.apache.org/) in MapReduce
 
@@ -14,6 +15,7 @@ For more information please visit the website:
 
 If you'd like to jump in and get started, check out the corresponding guides for each library:
 
+* [Apache DataFu Spark - Getting Started](http://datafu.apache.org/docs/spark/getting-started.html)
 * [Apache DataFu Pig - Getting Started](http://datafu.apache.org/docs/datafu/getting-started.html)
 * [Apache DataFu Hourglass - Getting Started](http://datafu.apache.org/docs/hourglass/getting-started.html)
 
@@ -23,6 +25,7 @@ If you'd like to jump in and get started, check out the corresponding guides for
 * [DataFu: The WD-40 of Big Data](http://datafu.apache.org/blog/2013/01/24/datafu-the-wd-40-of-big-data.html)
 * [DataFu 1.0](http://datafu.apache.org/blog/2013/09/04/datafu-1-0.html)
 * [DataFu's Hourglass: Incremental Data Processing in Hadoop](http://datafu.apache.org/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html)
+* [A Look at PayPal's Contributions to DataFu](http://datafu.apache.org/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html)
 
 ## Presentations
 
@@ -69,9 +72,7 @@ To build DataFu from a git checkout or binary release, run:
 
     ./gradlew clean assemble
 
-The datafu-pig JAR can be found under `datafu-pig/build/libs`.  The artifact name will be of the form `datafu-pig-incubating-x.y.z.jar` if this is a source release and `datafu-pig-incubating-x.y.z-SNAPSHOT.jar` if this is being built from the code repository.
-
-The datafu-hourglass can be found in the `datafu-hourglass/build/libs` directory.
+Each project's jars can be found under the corresponding sub directory. For example, the datafu-pig JAR can be found under `datafu-pig/build/libs`.  The artifact name will be of the form `datafu-pig-x.y.z.jar` if this is a source release and `datafu-pig-x.y.z-SNAPSHOT.jar` if this is being built from the code repository.
 
 ### Generating Eclipse Files
 
@@ -96,19 +97,15 @@ To run all the tests:
 
     ./gradlew test
 
-To run only the DataFu Pig tests:
+To run only one module's tests - for example, only the DataFu Pig tests:
 
     ./gradlew :datafu-pig:test
 
-To run only the DataFu Hourglass tests:
-
-    ./gradlew :datafu-hourglass:test
-
 To run tests for a single class, use the `test.single` property.  For example, to run only the QuantileTests:
 
     ./gradlew :datafu-pig:test -Dtest.single=QuantileTests
 
-The tests can also be run from within eclipse.  You'll need to install the TestNG plugin for Eclipse.  See: http://testng.org/doc/download.html.
+The tests can also be run from within Eclipse.  You'll need to install the TestNG plugin for Eclipse for DataFu Pig and Hourglass.  See: http://testng.org/doc/download.html.
 
 Potential issues and workaround:
 * You may run out of heap when executing tests in Eclipse. To fix this adjust your heap settings for the TestNG plugin. Go to Eclipse->Preferences. Select TestNG->Run/Debug. Add "-Xmx1G" to the JVM args.
diff --git a/datafu-spark/README.md b/datafu-spark/README.md
index 429f136..b72221e 100644
--- a/datafu-spark/README.md
+++ b/datafu-spark/README.md
@@ -44,7 +44,7 @@ df_people = sqlContext.createDataFrame([
      ("c", "Zoey", 36)],
      ["id", "name", "age"])
 
-func_dedup_res = df_utils.dedup(dataFrame=df_people, groupCol=df_people.id,
+func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
                                 orderCols=[df_people.age.desc(), df_people.name.desc()])
 
 func_dedup_res.registerTempTable("dedup")
diff --git a/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala b/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala
index 0ee1520..14c1fae 100644
--- a/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala
+++ b/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala
@@ -366,8 +366,7 @@ object SparkDFUtils {
     * The main problem this function addresses is that doing naive explode on the ranges can result
     * in a huge table.
     * requires:
-    * 1. point table needs to be distinct on the point column. there could be a few corresponding
-    *    ranges to each point, so we choose the minimal range.
+    * 1. point table needs to be distinct on the point column.
     * 2. the range and point columns need to be numeric.
     *
     * TIMES:
diff --git a/site/source/blog/2019-01-29-a-look-at-paypals-contributions-to-datafu.markdown b/site/source/blog/2019-01-29-a-look-at-paypals-contributions-to-datafu.markdown
index 534b44e..444b713 100644
--- a/site/source/blog/2019-01-29-a-look-at-paypals-contributions-to-datafu.markdown
+++ b/site/source/blog/2019-01-29-a-look-at-paypals-contributions-to-datafu.markdown
@@ -104,7 +104,7 @@ The result will be all the records from our original table for customers 2, 4 an
 ---
 <br>
 
-**3\. Comparing expected and actual results for regression tests — the diff\_macro**
+**3\. Comparing expected and actual results for regression tests — the _diff\_macro_**
 
 After making changes in an application’s logic, we are often interested in the effect they have on our output. One common use case is when we refactor — we don’t expect our output to change. Another is a surgical change which should only affect a very small subset of records. For easily performing such regression tests on actual data, we use the _diff\_macro_, which is based on DataFu’s _TupleDiff_ UDF.
 
diff --git a/site/source/community/contributing.html.markdown b/site/source/community/contributing.html.markdown
index dc3802b..aea5d4d 100644
--- a/site/source/community/contributing.html.markdown
+++ b/site/source/community/contributing.html.markdown
@@ -54,12 +54,11 @@ All the JARs for the project can be built with the following command:
 
     ./gradlew assemble
 
-This builds SNAPSHOT versions of the JARs for both DataFu Pig and Hourglass.  The built JARs can be found under `datafu-pig/build/libs` and `datafu-hourglass/build/libs`, respectively.
+This builds SNAPSHOT versions of the JARs for DataFu Pig, Spark and Hourglass.  The built JARs can be found under `datafu-pig/build/libs`, `datafu-spark/build/libs` and `datafu-hourglass/build/libs`, respectively.
 
-The Apache DataFu Pig library can be built by running the command below.
+A single project - for example, DataFu Pig - may be built by running the command below.
 
     ./gradlew :datafu-pig:assemble
-    ./gradlew :datafu-hourglass:assemble
 
 ### Running Tests
 
@@ -69,10 +68,9 @@ Tests can be run with the following command:
 
 All the tests can also be run from within eclipse.
 
-To run the DataFu Pig or Hourglass tests specifically:
+To run a single project's test - for example, for DataFu Pig only:
 
     ./gradlew :datafu-pig:test
-    ./gradlew :datafu-hourglass:test
 
 To run a specific set of tests from the command line, you can define the `test.single` system property with a value matching the test class you want to run.  For example, to run all tests defined in the `QuantileTests` test class for DataFu Pig:
 
diff --git a/site/source/docs/datafu/guide.html.markdown.erb b/site/source/docs/datafu/guide.html.markdown.erb
index afc8178..006c8a0 100644
--- a/site/source/docs/datafu/guide.html.markdown.erb
+++ b/site/source/docs/datafu/guide.html.markdown.erb
@@ -21,16 +21,17 @@ license: >
 
 # Guide
 
-Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](https://pig.apache.org/).
+Apache DataFu Pig is a collection of user-defined functions and macros for working with large scale data in [Apache Pig](https://pig.apache.org/).
 It has a number of useful functions available.  This guide provides examples of how to use these functions and serves as an overview for working with the library.
 
 * [Statistics](/docs/datafu/guide/statistics.html): median, quantiles, variance
 * [Bag Operations](/docs/datafu/guide/bag-operations.html): join, prepend, append, count items, concat
 * [Set Operations](/docs/datafu/guide/set-operations.html): set intersection, union, difference
 * [Sessions](/docs/datafu/guide/sessions.html): sessionize streams of data
-* [Sampling](/docs/datafu/guide/sampling.html): simple random sample with/without replacement, weighted sample
+* [Sampling](/docs/datafu/guide/sampling.html): simple random sample with/without replacement, weighted sample, sample by keys
 * [Hashing](/docs/datafu/guide/hashing.html): SHA and MD5
 * [Link Analysis](/docs/datafu/guide/link-analysis.html): PageRank
+* [Assorted Macros](/docs/datafu/guide/macros.html): deduplication of tables, human-readable diffs and more
 * [More Tips and Tricks](/docs/datafu/guide/more-tips-and-tricks.html)
 
 There are also [Javadocs](https://datafu.apache.org/docs/datafu/<%= current_page.data.version %>/) available for all UDFs in the library.  We continue to add
@@ -47,6 +48,7 @@ Our policy is to test against the most recent version of Pig whenever we release
 * [Introducing DataFu](/blog/2012/01/10/introducing-datafu.html)
 * [DataFu: The WD-40 of Big Data](/blog/2013/01/24/datafu-the-wd-40-of-big-data.html)
 * [DataFu 1.0](/blog/2013/09/04/datafu-1-0.html)
+* [A Look at PayPal's Contributions to DataFu](/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html)
 
 ## Slides
 
diff --git a/site/source/docs/datafu/guide/macros.html.markdown.erb b/site/source/docs/datafu/guide/macros.html.markdown.erb
new file mode 100644
index 0000000..1eb5ce5
--- /dev/null
+++ b/site/source/docs/datafu/guide/macros.html.markdown.erb
@@ -0,0 +1,131 @@
+---
+title: Macros - Guide - Apache DataFu Pig
+version: 1.5.0
+section_name: Apache DataFu Pig - Guide
+license: >
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+---
+
+## Macros
+
+### Finding the most recent update of a given record — the _dedup_ (de-duplication) macro
+
+A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.
+
+<br>
+<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
+<center>Raw customers’ data, with more than one row per customer</center>
+<br>
+
+We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? For this we can use the _dedup_ macro, as below:
+
+```pig
+REGISTER datafu-pig-<%= current_page.data.version %>.jar;
+
+IMPORT 'datafu/dedup.pig';
+
+data = LOAD 'customers.csv' AS (id: int, name: chararray, purchases: int, date_updated: chararray);
+
+dedup_data = dedup(data, 'id', 'date_updated');
+
+STORE dedup_data INTO 'dedup_out';
+```
+
+Our result will be as expected — each customer only appears once, as you can see below:
+
+<br>
+<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
+<center>“Deduplicated” data, with only the most recent record for each customer</center>
+<br>
+
+One nice thing about this macro is that you can use more than one field to dedup the data. For example, if we wanted to use both the _id_ and _name_ fields, we would change this line:
+
+```pig
+dedup_data = dedup(data, 'id', 'date_updated');
+```
+
+to this:
+
+```pig
+dedup_data = dedup(data, '(id, name)', 'date_updated');
+```
+
+---
+<br>
+
+### Comparing expected and actual results for regression tests — the _diff\_macro_
+
+After making changes in an application’s logic, we are often interested in the effect they have on our output. One common use case is when we refactor — we don’t expect our output to change. Another is a surgical change which should only affect a very small subset of records. For easily performing such regression tests on actual data, we use the _diff\_macro_, which is based on DataFu’s _TupleDiff_ UDF.
+
+Let’s look at a table which is exactly like _dedup\_out_, but with four changes.
+
+1.  We will remove record 1, _quentin_
+2.  We will change _date\_updated_ for record 2, _julia_
+3.  We will change _purchases_ and _date\_updated_ for record 4, _alice_
+4.  We will add a new row, record 8, _amanda_
+
+<br>
+<script src="https://gist.github.com/eyala/699942d65471f3c305b0dcda09944a95.js"></script>
+<br>
+
+We’ll run the following Pig script, using DataFu’s _diff\_macro_:
+
+```pig
+REGISTER datafu-pig-<%= current_page.data.version %>.jar;
+
+IMPORT 'datafu/diff_macros.pig';
+
+data = LOAD 'dedup_out.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);
+
+changed = LOAD 'dedup_out_changed.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);
+
+diffs = diff_macro(data,changed,id,'');
+
+DUMP diffs;
+```
+
+The results look like this:
+
+<br>
+<script src="https://gist.github.com/eyala/3d36775faf081daad37a102f25add2a4.js"></script>
+<br>
+
+Let’s take a moment to look at these results. They have the same general structure. Rows that start with _missing_ indicate records that were in the first relation, but aren’t in the new one. Conversely, rows that start with _added_ indicate records that are in the new relation, but not in the old one. Each of these rows is followed by the relevant tuple from the relations.
+
+The rows that start with _changed_ are more interesting. The word _changed_ is followed by a list of the fields which have changed values in the new table. For the row with _id_ 2, this is the _date\_updated_ field. For the row with _id_ 4, this is the _purchases_ and _date\_updated_ fields.
+
+Obviously, one thing we might want to ignore is the _date\_updated_ field. If the only difference in the fields is when it was last updated, we might just want to skip these records for a more concise diff. For this, we need to change the following row in our original Pig script, from this:
+
+```pig
+diffs = diff_macro(data,changed,id,'');
+```
+
+to become this:
+
+```pig
+diffs = diff_macro(data,changed,id,'date_updated');
+```
+
+If we run our changed Pig script, we’ll get the following result.
+
+<br>
+<script src="https://gist.github.com/eyala/d9b0d5c60ad4d8bbccc79c3527f99aca.js"></script>
+<br>
+
+The row for _julia_ is missing from our diff, because only _date\_updated_ has changed, but the row for _alice_ still appears, because the _purchases_ field has also changed.
+
+There’s one implementation detail that’s important to know — the macro uses a replicated join in order to be able to run quickly on very large tables, so the sample table needs to be able to fit in memory.
+
diff --git a/site/source/docs/datafu/guide/sampling.html.markdown.erb b/site/source/docs/datafu/guide/sampling.html.markdown.erb
index 8f4209a..476a856 100644
--- a/site/source/docs/datafu/guide/sampling.html.markdown.erb
+++ b/site/source/docs/datafu/guide/sampling.html.markdown.erb
@@ -252,4 +252,28 @@ joined_sample = FOREACH (COGROUP impressions BY (user_id,item_id),
   ((SIZE(clicks) > 0 ? 1 : 0)) as is_clicked;
 ```
 
-Since we have sampled before joining the data, this should be much more efficient.
\ No newline at end of file
+Since we have sampled before joining the data, this should be much more efficient.
+
+### Manually Sampling By Key
+
+All the previous methods are based on random selection. If you wish to create a sample of a given table based on a (manually created) table of ids, you can use the _sample\_by\_keys_ macro.
+
+For example, lets assume we have a list of customers are stored on the HDFS as _customers.csv_, and our list for the sample are in _sample.csv_, which only contains customers 2, 4 and 6 from the original _customers.csv_.
+
+We can use the following Pig script:
+
+```pig
+REGISTER datafu-pig-<%= current_page.data.version %>.jar;
+
+IMPORT 'datafu/sample_by_keys.pig';
+
+data = LOAD 'customers.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, updated: chararray);
+
+customers = LOAD 'sample.csv' AS (cust_id: int);
+
+sampled = sample_by_keys(data, customers, id, cust_id);
+
+STORE sampled INTO 'sample_out';
+```
+
+The result will be all the records from our original table for customers 2, 4 and 6.
diff --git a/site/source/docs/download.html.markdown.erb b/site/source/docs/download.html.markdown.erb
index dc53630..2351720 100644
--- a/site/source/docs/download.html.markdown.erb
+++ b/site/source/docs/download.html.markdown.erb
@@ -78,6 +78,7 @@ To build the JARs, run:
 
 This will produce JARs in the following directories:
 
+* `datafu-spark/build/libs`
 * `datafu-pig/build/libs`
 * `datafu-hourglass/build/libs`
 
@@ -131,5 +132,6 @@ Maven:
 
 See the following guides for next steps:
 
+* [Getting started with DataFu for Spark](/docs/spark/getting-started.html)
 * [Getting started with DataFu for Pig](/docs/datafu/getting-started.html)
 * [Getting started with DataFu Hourglass](/docs/hourglass/getting-started.html)
diff --git a/site/source/docs/spark/getting-started.html.markdown.erb b/site/source/docs/spark/getting-started.html.markdown.erb
new file mode 100644
index 0000000..1a2cb52
--- /dev/null
+++ b/site/source/docs/spark/getting-started.html.markdown.erb
@@ -0,0 +1,75 @@
+---
+title: Apache DataFu Spark - Getting Started
+version: 1.5.0
+section_name: Getting Started
+license: >
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+---
+
+# DataFu Spark
+
+Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](http://spark.apache.org/).
+
+A list of some of the things you can do with DataFu Spark is given below:
+
+* ["Dedup" a table](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139) - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
+
+* [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361)
+
+* [Do a skewed join between tables](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274) (where the small table is still too big to fit in memory)
+
+* [Count distinct up to](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224) - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table
+
+* Call Python code from Spark Scala, or Scala code from PySpark
+
+If you'd like to read more details about these functions, check out the [Guide](/docs/spark/guide.html).  Otherwise if you are
+ready to get started using DataFu Spark, keep reading.
+
+The rest of this page assumes you already have a built JAR available.  If this is not the case, please see the [Download](/docs/download.html) page.
+
+This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, _DataFrameOps_.
+
+## Basic Example: Finding the most recent update of a given record
+
+A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.
+
+<br>
+<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
+<center>Raw customers’ data, with more than one row per customer</center>
+<br>
+
+We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's _dedupWithOrder_ method.
+
+```scala
+import datafu.spark.DataFrameOps._
+
+val customers = spark.read.format("csv").option("header", "true").load("customers.csv")
+
+csv.dedupWithOrder($"id", $"date_updated".desc).show
+```
+
+Our result will be as expected — each customer only appears once, as you can see below:
+
+<br>
+<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
+<center>“Deduplicated” data, with only the most recent record for each customer (though not in order)</center>
+<br>
+
+There are two additional variants of _dedupWithOrder_ in datafu-spark. The _dedupWithCombiner_ method has similar functionality to _dedupWithOrder_, but uses a UDAF to utilize map side aggregation. _dedupTopN_ allows retaining more than one record for each key.
+
+## Next Steps
+
+Check out the [Guide](/docs/spark/guide.html) for more information on what you can do with DataFu Spark.
diff --git a/site/source/docs/spark/guide.html.markdown.erb b/site/source/docs/spark/guide.html.markdown.erb
new file mode 100644
index 0000000..34d1feb
--- /dev/null
+++ b/site/source/docs/spark/guide.html.markdown.erb
@@ -0,0 +1,211 @@
+---
+title: Guide - Apache DataFu Spark
+version: 1.5.0
+section_name: Apache DataFu Spark
+license: >
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+---
+
+# Guide
+
+Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](https://spark.apache.org/).
+It has a number of useful functions available.  This guide will provide examples of how to use these functions and serves as an overview for working with the library.
+
+## Spark Compatibility
+
+The current version of DataFu has been tested against Spark versions 2.1.x - 2.4.x, in Scala 2.10, 2.11 and 2.12 (where applicable).
+
+## Calling DataFu Spark functions from PySpark
+
+In order to call the datafu-spark API's from Pyspark, you can do the following (tested on a Hortonworks vm)
+
+First, call pyspark with the following parameters
+
+```bash
+export PYTHONPATH=datafu-spark_2.11_2.3.0-<%= current_page.data.version %>.jar
+
+pyspark --jars datafu-spark_2.11_2.3.0-<%= current_page.data.version %>-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-<%= current_page.data.version %>-SNAPSHOT.jar
+```
+
+The following is an example of calling the Spark version of the datafu _dedup_ method
+
+```python
+from pyspark_utils.df_utils import PySparkDFUtils
+
+df_utils = PySparkDFUtils()
+
+df_people = sqlContext.createDataFrame([
+     ("a", "Alice", 34),
+     ("a", "Sara", 33),
+     ("b", "Bob", 36),
+     ("b", "Charlie", 30),
+     ("c", "David", 29),
+     ("c", "Esther", 32),
+     ("c", "Fanny", 36),
+     ("c", "Zoey", 36)],
+     ["id", "name", "age"])
+
+func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id,
+                                orderCols=[df_people.age.desc(), df_people.name.desc()])
+
+func_dedup_res.registerTempTable("dedup")
+
+func_dedup_res.show()
+```
+
+This should produce the following output
+
+<pre>
++---+-----+---+
+| id| name|age|
++---+-----+---+
+|  c| Zoey| 36|
+|  b|  Bob| 36|
+|  a|Alice| 34|
++---+-----+---+
+</pre>
+
+## Calling Pyspark functions from Scala using DataFu
+
+## Using DataFu to do Skewed Joins
+
+DataFu-Spark contains two methods for doing skewed joins.
+
+_broadcastJoinSkewed_ can be used in cases when one data frame is skewed and the other is not skewed. It splits both of the data frames to two parts according to the skewed keys.
+For example, let's say we have two data frames, _customers_, which isn't skewed:
+
+<pre>
++-------+-----------+
+|company|year_joined|
++-------+-----------+
+| paypal|       2017|
+| myshop|       2019|
++-------+-----------+
+</pre>
+
+And _transactions_, which is skewed on the field _company_:
+
+<pre>
++--------------+-------+
+|transaction_id|company|
++--------------+-------+
+|             1| paypal|
+|             2| paypal|
+|             3| paypal|
+|             4| paypal|
+|             5| paypal|
+|             6| paypal|
+|             7| paypal|
+|             8| paypal|
+|             9| myshop|
++--------------+-------+
+</pre>
+
+In order to join them, we need to determine how many rows we would like to broadcast. In our case, with only one skewed key, we would use 1, like this:
+
+```scala
+val result = customers.broadcastJoinSkewed(transactions, Seq("company", 1))
+```
+
+The result will look like this, just as if we had used a regular join.
+
+<pre>
++-------+-----------+--------------+
+|company|year_joined|transaction_id|
++-------+-----------+--------------+
+| myshop|       2019|             9|
+| paypal|       2017|             1|
+| paypal|       2017|             2|
+| paypal|       2017|             3|
+| paypal|       2017|             4|
+| paypal|       2017|             5|
+| paypal|       2017|             6|
+| paypal|       2017|             7|
+| paypal|       2017|             8|
++-------+-----------+--------------+
+</pre>
+
+## Doing a join between a number and a range
+
+An interesting type of join that DataFu allows you to do is between a point and a range. A naive solution for this might explode the range columns, but this would cause the table to become huge.
+The DataFu _joinWithRange_ method takes a decrease factor in order to deal with this problem.
+
+For an example, let's imagine two data frames, one of graded papers and one representing a system for scoring.
+
+The dataframe for grades might look like this:
+
+<pre>
++-----+-------+
+|grade|student|
++-----+-------+
+|   37| tyrion|
+|   72|   robb|
+|   83| renley|
+|   64|    ned|
+|   95|  sansa|
+|   88|   arya|
+|   79| cersei|
+|   81|  jaime|
++-----+-------+
+</pre>
+
+The scoring system might look like this:
+
+<pre>
++-----+---+---+
+|grade|min|max|
++-----+---+---+
+|    A| 90|100|
+|    B| 80| 90|
+|    C| 70| 80|
+|    D| 60| 70|
+|    F|  0| 60|
++-----+---+---+
+</pre>
+
+We will use a decrease factor of 10, since each range is of size at least 10.
+
+```scala
+skewed.joinWithRange("grade", notskewed, "min", "max", 2).show
+```
+
+Our result will be as follows:
+
+<pre>
++-----+-------+-----+---+---+
+|grade|student|grade|min|max|
++-----+-------+-----+---+---+
+|   37| tyrion|    F|  0| 60|
+|   72|   robb|    C| 70| 80|
+|   83| renley|    B| 80| 90|
+|   64|    ned|    D| 60| 70|
+|   95|  sansa|    A| 90|100|
+|   88|   arya|    B| 80| 90|
+|   79| cersei|    C| 70| 80|
+|   81|  jaime|    B| 80| 90|
++-----+-------+-----+---+---+
+</pre>
+
+In order to use _joinWithRange_ on tables, they need to meet two requirements:
+
+1. the points table (_grades_ in our example) needs to be distinct on the point column
+2. the range and point columns need to be numeric
+
+If there are ranges that overlap, a point that matches will be joined to all the ranges that include it. In order to take only one range per point, you can use the joinWithRangeAndDedup_ method.
+
+It takes the same parameters as _joinWithRange_, with one addition - whether to match the largest or smallest range that contains a point.
+
+
diff --git a/site/source/index.markdown.erb b/site/source/index.markdown.erb
index 30fa843..100381a 100644
--- a/site/source/index.markdown.erb
+++ b/site/source/index.markdown.erb
@@ -22,15 +22,24 @@ license: >
 Apache DataFu&trade; is a collection of libraries for working with large-scale data in Hadoop.
 The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
 
-It consists of two libraries:
+It consists of three libraries:
 
-* **Apache DataFu Pig**: a collection of user-defined functions for [Apache Pig](http://pig.apache.org/)
+* **Apache DataFu Spark**: a collection of utils and user-defined functions for [Apache Spark](http://spark.apache.org/)
+* **Apache DataFu Pig**: a collection of user-defined functions and macros for [Apache Pig](http://pig.apache.org/)
 * **Apache DataFu Hourglass**: an incremental processing framework for [Apache Hadoop](http://hadoop.apache.org/) in MapReduce
 
 To begin using it, see our [Download](/docs/download.html) page.  If you'd like to help contribute, see [Contributing](/community/contributing.html).
 
 ## About the Project
 
+### Apache DataFu Spark
+
+Apache DataFu Spark is a collection of utils and user-defined functions for [Apache Spark](http://spark.apache.org/).
+This library is based on an internal PayPal project and was open sourced in 2019. It has been used by production workflows at PayPal since 2017.
+All of the codes is unit tested to ensure quality.
+
+Check out the [Getting Started](/docs/spark/getting-started.html) guide to learn more.
+
 ### Apache DataFu Pig
 
 Apache DataFu Pig is a collection of useful user-defined functions for data analysis in [Apache Pig](http://pig.apache.org/).
@@ -55,4 +64,4 @@ Work on this library began in early 2013, which led to a
 [presented](http://www.slideshare.net/matthewterencehayes/hourglass-a-library-for-incremental-processing-on-hadoop)
 at [IEEE BigData 2013](http://cci.drexel.edu/bigdata/bigdata2013/).  It is currently in production use at LinkedIn.
 
-Check out the [Getting Started](/docs/hourglass/getting-started.html) guide to learn more.
\ No newline at end of file
+Check out the [Getting Started](/docs/hourglass/getting-started.html) guide to learn more.
diff --git a/site/source/layouts/_docs_nav.erb b/site/source/layouts/_docs_nav.erb
index ecc7683..3c3d4ab 100644
--- a/site/source/layouts/_docs_nav.erb
+++ b/site/source/layouts/_docs_nav.erb
@@ -20,10 +20,17 @@
 <h4>Getting Started</h4>
 <ul class="nav nav-pills nav-stacked">
   <li><a href="/docs/download.html">Download</a></li>
+  <li><a href="/docs/spark/getting-started.html">DataFu Spark</a></li>
   <li><a href="/docs/datafu/getting-started.html">DataFu Pig</a></li>
   <li><a href="/docs/hourglass/getting-started.html">DataFu Hourglass</a></li>
 </ul>
 
+<h4>DataFu Spark Docs</h4>
+<ul class="nav nav-pills nav-stacked">
+
+  <li><a href="/docs/spark/guide.html">Guide</a></li>
+</ul>
+
 <h4>DataFu Pig Docs</h4>
 <ul class="nav nav-pills nav-stacked">
 
@@ -53,4 +60,4 @@
   <li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
   <li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
   <li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
-</ul>
\ No newline at end of file
+</ul>