You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2020/06/04 18:01:00 UTC

[impala] 02/02: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix

This is an automated email from the ASF dual-hosted git repository.

joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 3713d5db8dcac540ce0b5cb45974054ca87792db
Author: Gabor Kaszab <ga...@cloudera.com>
AuthorDate: Tue Jun 2 22:08:38 2020 +0200

    IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix
    
    There is a bug in DataSketches HLL MurmurHash where long strings are
    over-read resulting a cardinality estimate that is more than 15% off
    from the correct cardinality number. A recent upstream fix in Apache
    DataSketches addresses this issue and this patch pulls it to Impala.
    
    https://issues.apache.org/jira/browse/DATASKETCHES-5
    
    Testing:
      - I used ds_hll_sketch() and ds_hll_estimate() functions from
        IMPALA-9632 to trigger DataSketches HLL functionality.
      - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to
        reproduce the issue. The symptom was that the actual result was
        around 15% off from the correct cardinality result (~69M vs 79M).
      - After applying this fix re-running the query gives much closer
        results, usually under 3% error range.
    
    Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d
    Reviewed-on: http://gerrit.cloudera.org:8080/16026
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/thirdparty/datasketches/MurmurHash3.h | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/be/src/thirdparty/datasketches/MurmurHash3.h b/be/src/thirdparty/datasketches/MurmurHash3.h
index 45a64c6..f68e989 100644
--- a/be/src/thirdparty/datasketches/MurmurHash3.h
+++ b/be/src/thirdparty/datasketches/MurmurHash3.h
@@ -104,14 +104,12 @@ FORCE_INLINE void MurmurHash3_x64_128(const void* key, int lenBytes, uint64_t se
   out.h2 = seed;
 
   // Number of full 128-bit blocks of 16 bytes.
-  // Possible exclusion fo a remainder of up to 15 bytes.
+  // Possible exclusion of a remainder of up to 15 bytes.
   const int nblocks = lenBytes >> 4; // bytes / 16 
 
-  // Process the 128-bit blocks (the body) into teh hash
+  // Process the 128-bit blocks (the body) into the hash
   const uint64_t* blocks = (const uint64_t*)(data);
   for (int i = 0; i < nblocks; ++i) { // 16 bytes per block
-    //uint64_t k1 = getblock64(blocks, 0);
-    //uint64_t k2 = getblock64(blocks, 1);
     uint64_t k1 = getblock64(blocks,i*2+0);
     uint64_t k2 = getblock64(blocks,i*2+1);
 
@@ -124,12 +122,9 @@ FORCE_INLINE void MurmurHash3_x64_128(const void* key, int lenBytes, uint64_t se
     out.h2 = ROTL64(out.h2,31);
     out.h2 += out.h1;
     out.h2 = out.h2*5+0x38495ab5;
-
-    blocks += 2;
   }
 
   // tail
-  //const uint8_t * tail = (const uint8_t*)blocks;
   const uint8_t * tail = (const uint8_t*)(data + (nblocks << 4));
 
   uint64_t k1 = 0;
@@ -175,4 +170,4 @@ FORCE_INLINE void MurmurHash3_x64_128(const void* key, int lenBytes, uint64_t se
 
 //-----------------------------------------------------------------------------
 
-#endif // _MURMURHASH3_H_
\ No newline at end of file
+#endif // _MURMURHASH3_H_