You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pc...@apache.org on 2018/07/06 20:51:11 UTC
[arrow] branch master updated: ARROW-2798: [Plasma] Use hashing function that takes into account all UniqueID bytes

This is an automated email from the ASF dual-hosted git repository.

pcmoritz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 9d1432e  ARROW-2798: [Plasma] Use hashing function that takes into account all UniqueID bytes
9d1432e is described below

commit 9d1432eda73ea7cfac8ccd96207fd0191b159c2f
Author: songqing <zh...@163.com>
AuthorDate: Fri Jul 6 13:51:03 2018 -0700

    ARROW-2798: [Plasma] Use hashing function that takes into account all UniqueID bytes
    
    Now, the hashing of UniqueID in plasma is too simple which has caused a problem.  In some cases(for example, in github/ray, UniqueID is composed of a taskID and a index),  the UniqueID may be like "ffffffffffffffffffff00", "ffffffffffffffffff01", "fffffffffffffffffff02" ...  . The current hashing method is only to copy the first few bytes of a UniqueID and the result is that most of the hashed ids  are same, so when the hashed ids  put to plasma store, it will become very slow when se [...]
    
    In fact, the same PR has been merged into ray, see https://github.com/ray-project/ray/pull/2174.
    
    and I have tested the perf between the new hashing method and the original one with putting lots of objects continuously, it seems the new hashing method doesn't cost more time.
    
    Author: songqing <zh...@163.com>
    
    Closes #2220 from songqing/oid-hashing and squashes the following commits:
    
    5c803aa0 <songqing> modify murmurhash LICENSE
    8b8aa3e1 <songqing> add murmurhash LICENSE
    d8d5f93f <songqing> lint fix
    426cd1e2 <songqing> lint fix
    4767751d <songqing> Use hashing function that takes into account all UniqueID bytes
---
 LICENSE.txt              | 13 ++++++++++++
 cpp/src/plasma/common.cc | 53 ++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/LICENSE.txt b/LICENSE.txt
index 30966d3..57299c4 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -297,6 +297,19 @@ You can contact the author at :
 
 --------------------------------------------------------------------------------
 
+src/plasma/common.cc (some portions)
+
+Copyright (c) Austin Appleby (aappleby (AT) gmail)
+
+Some portions of this file are derived from code in the MurmurHash project
+
+All code is released to the public domain. For business purposes, Murmurhash is
+under the MIT license.
+
+https://sites.google.com/site/murmurhash/
+
+--------------------------------------------------------------------------------
+
 src/arrow/util (some portions): Apache 2.0, and 3-clause BSD
 
 Some portions of this module are derived from code in the Chromium project,
diff --git a/cpp/src/plasma/common.cc b/cpp/src/plasma/common.cc
index 2e3899b..ae55fb9 100644
--- a/cpp/src/plasma/common.cc
+++ b/cpp/src/plasma/common.cc
@@ -68,12 +68,57 @@ std::string UniqueID::hex() const {
   return result;
 }
 
-size_t UniqueID::hash() const {
-  size_t result;
-  std::memcpy(&result, id_, sizeof(size_t));
-  return result;
+// This code is from https://sites.google.com/site/murmurhash/
+// and is public domain.
+uint64_t MurmurHash64A(const void* key, int len, unsigned int seed) {
+  const uint64_t m = 0xc6a4a7935bd1e995;
+  const int r = 47;
+
+  uint64_t h = seed ^ (len * m);
+
+  const uint64_t* data = reinterpret_cast<const uint64_t*>(key);
+  const uint64_t* end = data + (len / 8);
+
+  while (data != end) {
+    uint64_t k = *data++;
+
+    k *= m;
+    k ^= k >> r;
+    k *= m;
+
+    h ^= k;
+    h *= m;
+  }
+
+  const unsigned char* data2 = reinterpret_cast<const unsigned char*>(data);
+
+  switch (len & 7) {
+    case 7:
+      h ^= uint64_t(data2[6]) << 48;
+    case 6:
+      h ^= uint64_t(data2[5]) << 40;
+    case 5:
+      h ^= uint64_t(data2[4]) << 32;
+    case 4:
+      h ^= uint64_t(data2[3]) << 24;
+    case 3:
+      h ^= uint64_t(data2[2]) << 16;
+    case 2:
+      h ^= uint64_t(data2[1]) << 8;
+    case 1:
+      h ^= uint64_t(data2[0]);
+      h *= m;
+  }
+
+  h ^= h >> r;
+  h *= m;
+  h ^= h >> r;
+
+  return h;
 }
 
+size_t UniqueID::hash() const { return MurmurHash64A(&id_[0], kUniqueIDSize, 0); }
+
 bool UniqueID::operator==(const UniqueID& rhs) const {
   return std::memcmp(data(), rhs.data(), kUniqueIDSize) == 0;
 }