You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by pw...@apache.org on 2014/01/02 22:21:17 UTC

[13/13] git commit: Merge pull request #297 from tdas/window-improvement

Merge pull request #297 from tdas/window-improvement

Improvements to DStream window ops and refactoring of Spark's CheckpointSuite

- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/588a1695
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/588a1695
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/588a1695

Branch: refs/heads/master
Commit: 588a1695f4b0b7763ecfa8ea56e371783810dd68
Parents: 7bafb68 5fde456
Author: Patrick Wendell <pw...@gmail.com>
Authored: Thu Jan 2 13:20:54 2014 -0800
Committer: Patrick Wendell <pw...@gmail.com>
Committed: Thu Jan 2 13:20:54 2014 -0800

----------------------------------------------------------------------
 .../spark/rdd/PartitionerAwareUnionRDD.scala    | 110 ++++++
 .../apache/spark/rdd/RDDCheckpointData.scala    |   2 +-
 .../org/apache/spark/CheckpointSuite.scala      | 361 +++++++++++--------
 .../scala/org/apache/spark/rdd/RDDSuite.scala   |  27 ++
 .../spark/streaming/PairDStreamFunctions.scala  |  13 +-
 .../streaming/api/java/JavaPairDStream.scala    |  18 +-
 .../streaming/dstream/ShuffledDStream.scala     |   9 +-
 .../streaming/dstream/WindowedDStream.scala     |  16 +-
 .../spark/streaming/WindowOperationsSuite.scala |   4 +-
 9 files changed, 388 insertions(+), 172 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/588a1695/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/588a1695/core/src/test/scala/org/apache/spark/CheckpointSuite.scala
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/588a1695/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
----------------------------------------------------------------------

http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/588a1695/streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
----------------------------------------------------------------------