You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2023/08/24 16:35:21 UTC
[impala] 03/03: IMPALA-12393: Fix inconsistent hash for TimestampValue in DictEncoder

This is an automated email from the ASF dual-hosted git repository.

michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit d96341ed537a3e321d5fa6a0235ab06b5d9169a2
Author: Joe McDonnell <jo...@cloudera.com>
AuthorDate: Tue Aug 22 10:58:22 2023 -0700

    IMPALA-12393: Fix inconsistent hash for TimestampValue in DictEncoder
    
    Currently, DictEncoder uses the default hash function for
    TimestampValue, which means it is hashing the entire
    TimestampValue struct. This can be inconsistent, because
    TimestampValue contains some padding that may not be zero
    in some cases. For TimestampValues that are part of a Tuple,
    the padding is zero, so this is mainly present in test cases.
    
    This was discovered when fixing a Clang Tidy performance-for-range-copy
    warning by iterating with a const reference rather than
    making a copy of the value. DictTest.TestTimestamps became
    flaky with that change, because the hash was no longer
    consistent. The copy must have had consistent content for
    the padding through the iteration, but the const reference
    did not.
    
    This adds a template specialization of the Hash function
    for TimestampValue. The specialization uses TimestampValue::Hash(),
    which hashes only the non-padding pieces of the struct. This
    also includes the change to dict-test.cc that uncovered the
    issue. This fix is mostly to unblock IMPALA-12390.
    
    Testing:
     - Ran dict-test in a loop for a few hundred iterations
     - Hand tested inserting many timestamps into a Parquet table
       with dictionary encoding and verified that the performance didn't
       change.
    
    Change-Id: Iad86e9b0f645311c3389cf2804dcc1a346ff10a9
    Reviewed-on: http://gerrit.cloudera.org:8080/20396
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
    Reviewed-by: Daniel Becker <da...@cloudera.com>
    Reviewed-by: Michael Smith <mi...@cloudera.com>
---
 be/src/util/dict-encoding.h | 7 +++++++
 be/src/util/dict-test.cc    | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/be/src/util/dict-encoding.h b/be/src/util/dict-encoding.h
index a2a92c7bf..6c3b741ed 100644
--- a/be/src/util/dict-encoding.h
+++ b/be/src/util/dict-encoding.h
@@ -459,6 +459,13 @@ inline uint32_t DictEncoder<StringValue>::Hash(const StringValue& value) const {
   return HashUtil::Hash(value.ptr, value.len, 0);
 }
 
+template<>
+inline uint32_t DictEncoder<TimestampValue>::Hash(const TimestampValue& value) const {
+  // TimestampValue needs to use its own hash function, because it has padding
+  // that must be ignored for consistency.
+  return value.Hash();
+}
+
 template<typename T>
 inline int DictEncoder<T>::AddToTable(const T& value, NodeIndex* bucket) {
   DCHECK_GT(encoded_value_size_, 0);
diff --git a/be/src/util/dict-test.cc b/be/src/util/dict-test.cc
index a8d8c6829..097370740 100644
--- a/be/src/util/dict-test.cc
+++ b/be/src/util/dict-test.cc
@@ -60,7 +60,7 @@ void ValidateDict(const vector<InternalType>& values,
   MemPool pool(&tracker);
   DictEncoder<InternalType> encoder(&pool, fixed_buffer_byte_size, &track_encoder);
   encoder.UsedbyTest();
-  for (InternalType i: values) encoder.Put(i);
+  for (const InternalType& i: values) encoder.Put(i);
   bytes_alloc = encoder.DictByteSize();
   EXPECT_EQ(track_encoder.consumption(), bytes_alloc);
   EXPECT_EQ(encoder.num_entries(), values_set.size());