You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/01/07 18:43:59 UTC

[GitHub] gianm opened a new issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch

gianm opened a new issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch
URL: https://github.com/apache/incubator-druid/issues/6814
 
 
   Branching off a discussion from #6743. See @leerho's comment https://github.com/apache/incubator-druid/issues/6743#issuecomment-449490568 for a rationale as to why it could make sense to transition to HllSketch. However, it would be a delicate transition, requiring solutions to the following problems.
   
   - The on-disk format is not compatible, and cannot be, due to the difference in hash functions used. So the migration would be an extended one, and we should expect that some users might never migrate, due to an inability to reindex data. (Not all users retain copies of their raw data, despite the fact that it is a best practice.)
   - The fact that the new one is in an extension and the old one is in core presents the opportunity for user confusion. Ideally they'd both be in core or both be in extensions.
   - Druid SQL's `COUNT(DISTINCT x)` operator uses hyperUnique currently. Ideally, it would adapt to use whatever sketch aggregator matches your segments, when run on a complex column. And if you run it on strings, it should use the 'best' available.
   
   Alternative approaches?
   
   - Patching hyperUnique's implementation to improve its error characteristics. I'm not sure if this is possible while retaining the same on-disk format. If not, it would need to involve an ability to read both the current format and a new format.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org