You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by ey...@apache.org on 2019/02/26 14:15:53 UTC

[datafu] branch spark-tmp updated: Add initial README file

This is an automated email from the ASF dual-hosted git repository.

eyal pushed a commit to branch spark-tmp
in repository https://gitbox.apache.org/repos/asf/datafu.git


The following commit(s) were added to refs/heads/spark-tmp by this push:
     new a493967  Add initial README file
a493967 is described below

commit a493967cdcf495cf11f02be83634bc40fd43bd61
Author: Eyal Allweil <ey...@apache.org>
AuthorDate: Tue Feb 26 16:15:19 2019 +0200

    Add initial README file
---
 datafu-spark/README.md | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/datafu-spark/README.md b/datafu-spark/README.md
new file mode 100644
index 0000000..ebad8e1
--- /dev/null
+++ b/datafu-spark/README.md
@@ -0,0 +1,55 @@
+# datafu-spark
+
+datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling Scala code from Python, and vice-versa, easier.
+
+-----------
+
+In order to call the spark-datafu API's from Pyspark, you can do the following (tested on a Hortonworks vm)
+
+First, call pyspark with the following parameters
+
+```bash
+export PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
+
+pyspark  --jars datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
+```
+
+The following is an example of calling the Spark version of the datafu _dedup_ method
+
+```python
+from pyspark_utils.df_utils import PySparkDFUtils
+
+df_utils = PySparkDFUtils()
+
+df_people = sqlContext.createDataFrame([
+...     ("a", "Alice", 34),
+...     ("a", "Sara", 33),
+...     ("b", "Bob", 36),
+...     ("b", "Charlie", 30),
+...     ("c", "David", 29),
+...     ("c", "Esther", 32),
+...     ("c", "Fanny", 36),
+...     ("c", "Zoey", 36)],
+...     ["id", "name", "age"])
+
+func_dedup_res = df_utils.dedup(dataFrame=df_people, groupCol=df_people.id,
+...                              orderCols=[df_people.age.desc(), df_people.name.desc()])
+
+func_dedup_res.registerTempTable("dedup")
+
+func_dedup_res.show()
+```
+
+This should produce the following output
+
+<pre>
++---+-----+---+                                                                 
+| id| name|age|
++---+-----+---+
+|  c| Zoey| 36|
+|  b|  Bob| 36|
+|  a|Alice| 34|
++---+-----+---+
+</pre>
+
+