You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datafu.apache.org by ey...@apache.org on 2019/02/26 14:15:53 UTC
[datafu] branch spark-tmp updated: Add initial README file
This is an automated email from the ASF dual-hosted git repository.
eyal pushed a commit to branch spark-tmp
in repository https://gitbox.apache.org/repos/asf/datafu.git
The following commit(s) were added to refs/heads/spark-tmp by this push:
new a493967 Add initial README file
a493967 is described below
commit a493967cdcf495cf11f02be83634bc40fd43bd61
Author: Eyal Allweil <ey...@apache.org>
AuthorDate: Tue Feb 26 16:15:19 2019 +0200
Add initial README file
---
datafu-spark/README.md | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/datafu-spark/README.md b/datafu-spark/README.md
new file mode 100644
index 0000000..ebad8e1
--- /dev/null
+++ b/datafu-spark/README.md
@@ -0,0 +1,55 @@
+# datafu-spark
+
+datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling Scala code from Python, and vice-versa, easier.
+
+-----------
+
+In order to call the spark-datafu API's from Pyspark, you can do the following (tested on a Hortonworks vm)
+
+First, call pyspark with the following parameters
+
+```bash
+export PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
+
+pyspark --jars datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
+```
+
+The following is an example of calling the Spark version of the datafu _dedup_ method
+
+```python
+from pyspark_utils.df_utils import PySparkDFUtils
+
+df_utils = PySparkDFUtils()
+
+df_people = sqlContext.createDataFrame([
+... ("a", "Alice", 34),
+... ("a", "Sara", 33),
+... ("b", "Bob", 36),
+... ("b", "Charlie", 30),
+... ("c", "David", 29),
+... ("c", "Esther", 32),
+... ("c", "Fanny", 36),
+... ("c", "Zoey", 36)],
+... ["id", "name", "age"])
+
+func_dedup_res = df_utils.dedup(dataFrame=df_people, groupCol=df_people.id,
+... orderCols=[df_people.age.desc(), df_people.name.desc()])
+
+func_dedup_res.registerTempTable("dedup")
+
+func_dedup_res.show()
+```
+
+This should produce the following output
+
+<pre>
++---+-----+---+
+| id| name|age|
++---+-----+---+
+| c| Zoey| 36|
+| b| Bob| 36|
+| a|Alice| 34|
++---+-----+---+
+</pre>
+
+