You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/07 22:40:37 UTC

[GitHub] [druid] leerho commented on pull request #11201: Add "stringEncoding" parameter to DataSketches HLL.

leerho commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-834830853


   @gianm 
   I want to clarify a comment made above.
   
   > HLL sketches used UTF-16LE encoding when hashing strings. 
   
   This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does).  Strings are encoded using UTF-8 and have been for as long as I can remember.  If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.  The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[].   The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else).  Nonetheless, whatever you decide, you will **always** need to stick with your choice.  Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless!
   
   I have some comments about [PR 353](https://github.com/apache/datasketches-java/pull/353) but I want to make these in the actual PR.
   
   Lee.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org