You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pc...@apache.org on 2018/07/06 20:51:11 UTC
[arrow] branch master updated: ARROW-2798: [Plasma] Use hashing
function that takes into account all UniqueID bytes
This is an automated email from the ASF dual-hosted git repository.
pcmoritz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 9d1432e ARROW-2798: [Plasma] Use hashing function that takes into account all UniqueID bytes
9d1432e is described below
commit 9d1432eda73ea7cfac8ccd96207fd0191b159c2f
Author: songqing <zh...@163.com>
AuthorDate: Fri Jul 6 13:51:03 2018 -0700
ARROW-2798: [Plasma] Use hashing function that takes into account all UniqueID bytes
Now, the hashing of UniqueID in plasma is too simple which has caused a problem. In some cases(for example, in github/ray, UniqueID is composed of a taskID and a index), the UniqueID may be like "ffffffffffffffffffff00", "ffffffffffffffffff01", "fffffffffffffffffff02" ... . The current hashing method is only to copy the first few bytes of a UniqueID and the result is that most of the hashed ids are same, so when the hashed ids put to plasma store, it will become very slow when se [...]
In fact, the same PR has been merged into ray, see https://github.com/ray-project/ray/pull/2174.
and I have tested the perf between the new hashing method and the original one with putting lots of objects continuously, it seems the new hashing method doesn't cost more time.
Author: songqing <zh...@163.com>
Closes #2220 from songqing/oid-hashing and squashes the following commits:
5c803aa0 <songqing> modify murmurhash LICENSE
8b8aa3e1 <songqing> add murmurhash LICENSE
d8d5f93f <songqing> lint fix
426cd1e2 <songqing> lint fix
4767751d <songqing> Use hashing function that takes into account all UniqueID bytes
---
LICENSE.txt | 13 ++++++++++++
cpp/src/plasma/common.cc | 53 ++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/LICENSE.txt b/LICENSE.txt
index 30966d3..57299c4 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -297,6 +297,19 @@ You can contact the author at :
--------------------------------------------------------------------------------
+src/plasma/common.cc (some portions)
+
+Copyright (c) Austin Appleby (aappleby (AT) gmail)
+
+Some portions of this file are derived from code in the MurmurHash project
+
+All code is released to the public domain. For business purposes, Murmurhash is
+under the MIT license.
+
+https://sites.google.com/site/murmurhash/
+
+--------------------------------------------------------------------------------
+
src/arrow/util (some portions): Apache 2.0, and 3-clause BSD
Some portions of this module are derived from code in the Chromium project,
diff --git a/cpp/src/plasma/common.cc b/cpp/src/plasma/common.cc
index 2e3899b..ae55fb9 100644
--- a/cpp/src/plasma/common.cc
+++ b/cpp/src/plasma/common.cc
@@ -68,12 +68,57 @@ std::string UniqueID::hex() const {
return result;
}
-size_t UniqueID::hash() const {
- size_t result;
- std::memcpy(&result, id_, sizeof(size_t));
- return result;
+// This code is from https://sites.google.com/site/murmurhash/
+// and is public domain.
+uint64_t MurmurHash64A(const void* key, int len, unsigned int seed) {
+ const uint64_t m = 0xc6a4a7935bd1e995;
+ const int r = 47;
+
+ uint64_t h = seed ^ (len * m);
+
+ const uint64_t* data = reinterpret_cast<const uint64_t*>(key);
+ const uint64_t* end = data + (len / 8);
+
+ while (data != end) {
+ uint64_t k = *data++;
+
+ k *= m;
+ k ^= k >> r;
+ k *= m;
+
+ h ^= k;
+ h *= m;
+ }
+
+ const unsigned char* data2 = reinterpret_cast<const unsigned char*>(data);
+
+ switch (len & 7) {
+ case 7:
+ h ^= uint64_t(data2[6]) << 48;
+ case 6:
+ h ^= uint64_t(data2[5]) << 40;
+ case 5:
+ h ^= uint64_t(data2[4]) << 32;
+ case 4:
+ h ^= uint64_t(data2[3]) << 24;
+ case 3:
+ h ^= uint64_t(data2[2]) << 16;
+ case 2:
+ h ^= uint64_t(data2[1]) << 8;
+ case 1:
+ h ^= uint64_t(data2[0]);
+ h *= m;
+ }
+
+ h ^= h >> r;
+ h *= m;
+ h ^= h >> r;
+
+ return h;
}
+size_t UniqueID::hash() const { return MurmurHash64A(&id_[0], kUniqueIDSize, 0); }
+
bool UniqueID::operator==(const UniqueID& rhs) const {
return std::memcmp(data(), rhs.data(), kUniqueIDSize) == 0;
}