You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by rx...@apache.org on 2013/11/04 05:43:38 UTC

[5/5] git commit: Merge pull request #70 from rxin/hash1

Merge pull request #70 from rxin/hash1

Fast, memory-efficient hash set, hash table implementations optimized for primitive data types.

This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space).

Details:

This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs.

Building on top of OpenHashSet, this PR adds two different hash tables implementations:
1. OpenHashSet: for nullable keys, optional specialization for primitive values
2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values

I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items:

Int to Int: ~30%
java.lang.Integer to java.lang.Integer: ~100%
Int to java.lang.Integer: ~50%
java.lang.Integer to Int: ~85%


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/b5dc3393
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/b5dc3393
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/b5dc3393

Branch: refs/heads/master
Commit: b5dc3393a586099229fcd293d96f909e596a11e6
Parents: 41ead7a eb5f8a3
Author: Reynold Xin <rx...@apache.org>
Authored: Sun Nov 3 20:43:15 2013 -0800
Committer: Reynold Xin <rx...@apache.org>
Committed: Sun Nov 3 20:43:15 2013 -0800

----------------------------------------------------------------------
 .../scala/org/apache/spark/util/Utils.scala     |  12 +-
 .../apache/spark/util/collection/BitSet.scala   | 103 +++++++
 .../spark/util/collection/OpenHashMap.scala     | 152 +++++++++++
 .../spark/util/collection/OpenHashSet.scala     | 271 +++++++++++++++++++
 .../collection/PrimitiveKeyOpenHashMap.scala    | 121 +++++++++
 .../spark/util/collection/BitSetSuite.scala     |  73 +++++
 .../util/collection/OpenHashMapSuite.scala      | 148 ++++++++++
 .../util/collection/OpenHashSetSuite.scala      | 145 ++++++++++
 .../PrimitiveKeyOpenHashSetSuite.scala          |  90 ++++++
 9 files changed, 1108 insertions(+), 7 deletions(-)
----------------------------------------------------------------------