You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Chi Cao Minh <ch...@imply.io> on 2019/10/28 19:28:39 UTC

Discussion: Moving DataSketches to core

To support range partitioning for native parallel batch indexing, I’m considering moving DataSketches from extensions to core (see https://github.com/apache/incubator-druid/issues/8769 <https://github.com/apache/incubator-druid/issues/8769> for details). Having DataSketches in core would also allow us to switch usages of HyperLogLogCollector to the better HLL implementation available in DataSketches. One drawback is that moving DataSketches to core will possibly block the work to upgrade DataSketches to the latest version: https://github.com/apache/incubator-druid/pull/8647 <https://github.com/apache/incubator-druid/pull/8647>.

Any other thoughts on the pros/cons?

Thanks,
Chi

Re: Discussion: Moving DataSketches to core

Posted by Chi Cao Minh <ch...@imply.io>.
I tested moving datasketches to core and it doesn’t look like it brings additional dependencies:

> [INFO] --------------------< org.apache.druid:druid-core >---------------------
> [INFO] Building druid-core 0.17.0-incubating-SNAPSHOT
> [INFO] --------------------------------[ jar ]---------------------------------
> [INFO]
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ druid-core ---
> [INFO] org.apache.druid:druid-core:jar:0.17.0-incubating-SNAPSHOT
> [INFO] +- com.yahoo.datasketches:sketches-core:jar:0.13.4:compile
> [INFO] +- com.yahoo.datasketches:memory:jar:0.12.2:compile
> [INFO] +- commons-io:commons-io:jar:2.6:compile

A diff of the distribution build before and after moving datasketches:

> diff -r before after | grep -v Binary
> Only in before/extensions/druid-datasketches: memory-0.12.2.jar
> Only in before/extensions/druid-datasketches: sketches-core-0.13.4.jar
> Only in before/extensions/druid-datasketches: slf4j-api-1.7.25.jar
> Only in after/lib: memory-0.12.2.jar
> Only in after/lib: sketches-core-0.13.4.jar


Thanks,
Chi


> On Oct 31, 2019, at 9:15 AM, Charles Allen <cr...@apache.org> wrote:
> 
> Any time we discuss moving things into core Druid I would love to see a
> list of dependencies that comes with it.
> 
> On Wed, Oct 30, 2019, 6:08 PM Jihoon Son <ji...@apache.org> wrote:
> 
>> +1 on moving too.
>> 
>> On Mon, Oct 28, 2019 at 12:46 PM Fangjin Yang <fa...@imply.io> wrote:
>> 
>>> +1 on moving datasketches to core
>>> 
>>> On Mon, Oct 28, 2019 at 12:36 PM Chi Cao Minh <ch...@imply.io>
>>> wrote:
>>> 
>>>> To support range partitioning for native parallel batch indexing, I’m
>>>> considering moving DataSketches from extensions to core (see
>>>> https://github.com/apache/incubator-druid/issues/8769 <
>>>> https://github.com/apache/incubator-druid/issues/8769> for details).
>>>> Having DataSketches in core would also allow us to switch usages of
>>>> HyperLogLogCollector to the better HLL implementation available in
>>>> DataSketches. One drawback is that moving DataSketches to core will
>>>> possibly block the work to upgrade DataSketches to the latest version:
>>>> https://github.com/apache/incubator-druid/pull/8647 <
>>>> https://github.com/apache/incubator-druid/pull/8647>.
>>>> 
>>>> Any other thoughts on the pros/cons?
>>>> 
>>>> Thanks,
>>>> Chi
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org


Re: Discussion: Moving DataSketches to core

Posted by Charles Allen <cr...@apache.org>.
Any time we discuss moving things into core Druid I would love to see a
list of dependencies that comes with it.

On Wed, Oct 30, 2019, 6:08 PM Jihoon Son <ji...@apache.org> wrote:

> +1 on moving too.
>
> On Mon, Oct 28, 2019 at 12:46 PM Fangjin Yang <fa...@imply.io> wrote:
>
> > +1 on moving datasketches to core
> >
> > On Mon, Oct 28, 2019 at 12:36 PM Chi Cao Minh <ch...@imply.io>
> > wrote:
> >
> > > To support range partitioning for native parallel batch indexing, I’m
> > > considering moving DataSketches from extensions to core (see
> > > https://github.com/apache/incubator-druid/issues/8769 <
> > > https://github.com/apache/incubator-druid/issues/8769> for details).
> > > Having DataSketches in core would also allow us to switch usages of
> > > HyperLogLogCollector to the better HLL implementation available in
> > > DataSketches. One drawback is that moving DataSketches to core will
> > > possibly block the work to upgrade DataSketches to the latest version:
> > > https://github.com/apache/incubator-druid/pull/8647 <
> > > https://github.com/apache/incubator-druid/pull/8647>.
> > >
> > > Any other thoughts on the pros/cons?
> > >
> > > Thanks,
> > > Chi
> >
>

Re: Discussion: Moving DataSketches to core

Posted by Jihoon Son <ji...@apache.org>.
+1 on moving too.

On Mon, Oct 28, 2019 at 12:46 PM Fangjin Yang <fa...@imply.io> wrote:

> +1 on moving datasketches to core
>
> On Mon, Oct 28, 2019 at 12:36 PM Chi Cao Minh <ch...@imply.io>
> wrote:
>
> > To support range partitioning for native parallel batch indexing, I’m
> > considering moving DataSketches from extensions to core (see
> > https://github.com/apache/incubator-druid/issues/8769 <
> > https://github.com/apache/incubator-druid/issues/8769> for details).
> > Having DataSketches in core would also allow us to switch usages of
> > HyperLogLogCollector to the better HLL implementation available in
> > DataSketches. One drawback is that moving DataSketches to core will
> > possibly block the work to upgrade DataSketches to the latest version:
> > https://github.com/apache/incubator-druid/pull/8647 <
> > https://github.com/apache/incubator-druid/pull/8647>.
> >
> > Any other thoughts on the pros/cons?
> >
> > Thanks,
> > Chi
>

Re: Discussion: Moving DataSketches to core

Posted by Fangjin Yang <fa...@imply.io>.
+1 on moving datasketches to core

On Mon, Oct 28, 2019 at 12:36 PM Chi Cao Minh <ch...@imply.io> wrote:

> To support range partitioning for native parallel batch indexing, I’m
> considering moving DataSketches from extensions to core (see
> https://github.com/apache/incubator-druid/issues/8769 <
> https://github.com/apache/incubator-druid/issues/8769> for details).
> Having DataSketches in core would also allow us to switch usages of
> HyperLogLogCollector to the better HLL implementation available in
> DataSketches. One drawback is that moving DataSketches to core will
> possibly block the work to upgrade DataSketches to the latest version:
> https://github.com/apache/incubator-druid/pull/8647 <
> https://github.com/apache/incubator-druid/pull/8647>.
>
> Any other thoughts on the pros/cons?
>
> Thanks,
> Chi