You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2021/03/07 09:14:19 UTC

[GitHub] [incubator-doris] shiyi23 commented on issue #4718: [Proposal]Support bitmap_hash64 to calculate string value to 64 bit signature in data load

shiyi23 commented on issue #4718:
URL: https://github.com/apache/incubator-doris/issues/4718#issuecomment-792243732


   > **Is your feature request related to a problem? Please describe.**
   > At present, Doris uses a 32-bit integer signature for string type in bitmap. e.g. count(distinct v1). v1 is bitmap type which use bitmap_hash to calculate the hash_value.
   > 
   > Although the performance of 32-bit integer signature is better than that of 64 bit, the data precision is low due to the collision rate.
   > 
   > Therefore, the result value in Doris is inconsistent with the result value calculated offline, so we need to explain to the user the reason for the data diff: whether it is caused by the error or the SQL code bug. Gradually, the user no longer believes in the data result of Doris.
   > 
   > An erroneous result is more unacceptable than a slow query.
   > 
   > **Describe the solution you'd like**
   > The result value returned by Doris is accurate (100% consistent with the result value calculated offline, e.g. sort -u | wc -l)
   > 
   > **Describe alternatives you've considered**
   > Current bitmap_hash uses 32-bit integer to calculate the signature. We can add a 64 bit signature function. And it is better to specify the signature algorithm, e.g. murmur3_hash64, so that the subsequent expansion of the signature algorithm is also extended. For example, other signature algorithms are used to calculate the signature, e.g. xxx_hash64
   > 
   > **Additional context**
   > At present, the Doris code contains murmur2 / murmur3 signature algorithm.
   > 
   > * murmur_hash3.h: murmur_hash3_x64_64
   > * hash_util.hpp: murmur_hash3_32/murmur_hash2_64/murmur_ hash64A (The latter two results are consistent)
   > * seed in the hash function: we use 104729
   > 
   > If we use murmur32 signature for nearly 100 million data, there will be 2 ‰ - 8 ‰ error.
   > When using murmur64 signature, the error is zero.
   > 
   > In addition, considering the implementation of bitmap, try to consider the signature algorithm witch has smaller high 32 bits.
   > 
   > We find that the distribution of the high 32-bit values of the signature based on 64bit-sign-function is consistent (< 5%%). No one is significantly bigger than the other.
   
   @yangzhg  I want to fix the issue,can you assign it to me?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org