You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Amogh Margoor (Code Review)" <ge...@cloudera.org> on 2021/07/04 16:44:12 UTC

[Impala-ASF-CR] IMPALA-7635: Reducing HashTable size by packing it's buckets efficiently.

Amogh Margoor has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/17592 )

Change subject: IMPALA-7635: Reducing HashTable size by packing it's buckets efficiently.
......................................................................

IMPALA-7635: Reducing HashTable size by packing it's buckets efficiently.

HashTable implementation comprises of contiguos
array of Buckets and each Bucket would either
have Data or will point to linked list of
duplicate entries named DuplicateNode. These
are the structures of Bucket and DuplicateNode:

  struct DuplicateNode {
    bool matched;
    DuplicateNode* next;
    HtData htdata;
  };

  struct Bucket {
    bool filled;
    bool matched;
    bool hasDuplicates;
    uint32_t hash;
    union {
      HtData htdata;
      DuplicateNode* duplicates;
    } bucketData;
  };

Size of Bucket is currently 16 bytes and Size of Duplicate Node is
24 bytes. If we can remove the booleans from both struct size of
Bucket would reduce to 12 bytes and DuplicateNode will be 16 bytes.
One of the ways we can remove booleans is to fold it into pointers
already part of struct. Pointers store addresses and on
architectures like x86 and ARM the linear address is only 48 bits
long. With level 5 paging intel is planning to expand it to 57-bit
long which means we can use most significant 7 bits i.e., 58 to 64
bits to store these booleans. This patch reduces the size of Bucket
and DuplicateNode by implementing this folding. However, there is
another requirement regarding Size of Bucket to be power of 2 and
also for the number of buckets in Hash table to be power of 2.
These requirements are for the following reasons:
1. Memory Allocator allocates memory in power of 2 to avoid
   internal fragmentation. Hence, num of buckets * sizeof(Buckets)
   should be power of 2.
2. Number of buckets being power of 2 enables faster modulo
   operation i.e., instead of slow modulo: (hash % N), faster
   (hash & (N-1)) can be used.

Due to this, 4 bytes 'hash' field from Bucket is removed and
stored sperately in new array hash_array_ in HashTable.
This ensures sizeof(Bucket) is 8 which is power of 2.

New Classes:
------------
As a part of patch, TaggedPointer is introduced which is template
class to store pointers and tag together in 64 bit integer. This
structure contains the ownership of the pointer and will take care
of allocation and deallocation the the object being pointed to.
However derived classes can opt out of the ownership of the object
and let the client manage it. It's derived classes for Bucket and
DuplicateNode do the same. These classes are TaggedBucketData and
TaggedDuplicateNode.

Benchmark:
----------
As a part ofthis patch a new Micro Benchmark for HashTable has
been introduced, which will help in measuring these:
1. Runtime for Building Hash Table and Probing the table.
2. Memory consumed after building the Table.
This would help measuring the impact of changes to the HashTable's
data structure and algorithm.

Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/util/CMakeLists.txt
M be/src/util/benchmark.cc
M be/src/util/benchmark.h
A be/src/util/tagged-ptr-test.cc
A be/src/util/tagged-ptr.h
M fe/src/main/java/org/apache/impala/planner/AggregationNode.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/PlannerContext.java
14 files changed, 980 insertions(+), 136 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/92/17592/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17592
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b
Gerrit-Change-Number: 17592
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>