You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/03/20 07:53:28 UTC

[GitHub] [incubator-druid] samarthjain opened a new issue #7303: [PROPOSAL]

samarthjain opened a new issue #7303: [PROPOSAL]
URL: https://github.com/apache/incubator-druid/issues/7303

### Motivation

TDigest (https://github.com/tdunning/t-digest) is a popular datastructure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The datastructure is also designed for parallel programming use cases like distributed aggregations or map reduce jobs by making combining two intermediate t-digests easy and efficient.

There are various other projects like Apache Mahout, streaming-lib and Elastic Search which have adopted T-Digest. It would be good to add T-Digest based aggregators in Druid as well. This would be complimentary to existing approximate sketch generation algorithms in Druid like moments and yahoo quantile sketches.

### Proposed changes

A new module called druid-tdigestsketch will be added in the the extension-contrib module. Proposal is to add following aggregators:
1) buildTDigestSketch - this aggregator will generate t-digest based sketches over numeric value. This generally would be used during the indexing phase where a pre-aggregated sketch over a metric's values will be created. This aggregator could also be used for generating sketches on the fly during query time itself.
2) mergeTDigestSketch - this aggregator will take care of combining existing t-digest based sketches. This aggregator will generally be used during query time to combine sketches generated during the indexing phase by buildTDigestSketch aggregator.
3) quantilesFromTDigestSketch - this post aggregator will take in an array of fractions, and generate quantiles on the t-digest sketches generated by the above two aggregators.

This section should include any changes made to user-facing interfaces, for example:
SQL language - support for this aggregator will be added to the Druid sql grammar.

### Rationale
At my work, various data engineering teams have been using t-digest based sketch aggregations both in and outside of Druid. They have found it to be a good fit for their various use cases.

### Operational impact

No operational impact.

### Test plan (optional)

There is enough literature out there that has tested out performance and correctness of t-digest. For the druid integration, some of the work would be to manually verify that the quantiles reported are correct and similar to results reported by other frameworks like Spark, Map-reduce, etc.

### Future work (optional)
When a new version of t-digest library gets rolled out, and if the serialization format changes, it would be tricky to make the old and new versions interoperable. An option would be to write a new module every time the t-digest library is updated. Or we would need to devise a scheme of versioning aggregators.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org