You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/10 07:16:30 UTC

[GitHub] [pinot] richardstartin commented on pull request #8878: Optimize the immutable STRING/BYTES dictionary lookup

richardstartin commented on PR #8878:
URL: https://github.com/apache/pinot/pull/8878#issuecomment-1152049617

I’m not sure this is valid for UTF-8 anyway unless the text is normalized, because UTF-8 doesn’t guarantee the uniqueness of representation of characters. For example: both "\u00e9" and "\u0065\u0301" represent 'é'.

Adding some tests with text from European languages (e.g. French, Czech) should surface this. Normalizing UTF-8 on ingestion solves this problem, if it is not done already.

The size of this improvement depends on the data: how long the values are and on common prefixes within the set of strings.

The best case for this optimisation relative to the baseline is uniformly random long strings (so the loop terminates on the first byte 255 times in 256, and the cost of materialising the string is exacerbated), but natural language text is never uniformly distributed and some sequences of bytes are very common.

When there is more regularity in the data (imagine the strings are English book titles and a good percentage of them start with “The “ but others start with “Their” or other common words prefixed by “The” and the average length is ~16 bytes) you might get a different relative outcome. Comparing byte by byte would be even worse with URLs, which have very regular and long common prefixes. It would be better to read the unpadded bytes into the buffer and perform a vectorized comparison with Arrays.mismatch, this will likely regress on randomly generated data but will be much faster otherwise. Please compare with the existing baseline for URLs, book titles, people’s names written in English from a range of cultures, etc. too.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org