You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ma...@apache.org on 2013/11/27 05:56:04 UTC

[7/7] git commit: Merge pull request #146 from JoshRosen/pyspark-custom-serializers

Merge pull request #146 from JoshRosen/pyspark-custom-serializers

Custom Serializers for PySpark

This pull request adds support for custom serializers to PySpark.  For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext.

For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers.  It's pretty easy to add support for other serializers, although I still need to add instructions on this.

A few notable changes:

- The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings.  The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD.  This mechanism could also be used to support other data exchange formats, such as MsgPack.
- Several magic numbers were refactored into constants.
- Batching is implemented by wrapping / decorating an unbatched SerDe.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/fb6875dd
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/fb6875dd
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/fb6875dd

Branch: refs/heads/master
Commit: fb6875dd5c9334802580155464cef9ac4d4cc1f0
Parents: 330ada1 1b74a27
Author: Matei Zaharia <ma...@eecs.berkeley.edu>
Authored: Tue Nov 26 20:55:40 2013 -0800
Committer: Matei Zaharia <ma...@eecs.berkeley.edu>
Committed: Tue Nov 26 20:55:40 2013 -0800

----------------------------------------------------------------------
 .../org/apache/spark/api/python/PythonRDD.scala | 149 +++------
 python/epydoc.conf                              |   2 +-
 python/pyspark/accumulators.py                  |   6 +-
 python/pyspark/context.py                       |  71 +++--
 python/pyspark/rdd.py                           |  97 +++---
 python/pyspark/serializers.py                   | 301 ++++++++++++++++---
 python/pyspark/tests.py                         |   3 +-
 python/pyspark/worker.py                        |  44 ++-
 python/run-tests                                |   1 +
 9 files changed, 428 insertions(+), 246 deletions(-)
----------------------------------------------------------------------